A Tour of Python Packages for Scientific Computing

Python Packages and Modules

A module is simply a file containing Python code which defines variables, functions and classes, and a package is a collection of modules.

Use the keyword import to import a module or packages into your Python environment. We access variables, functions, classes, etc. from a module or package using the dot notation.

See Mathematical Python for an introduction to Python, SciPy and Jupyter with mathematical applications.

NumPy, SciPy and Matplotlib

NumPy is the core numerical computing package in Python. It provides the ndarray object which represents vectors, matrices and arrays of any dimension. Every other package we talk about today is built on NumPy and ndarray.

SciPy is a library containing packages for numerical integration, linear algebra, signal processing, and much more.

Matplotlib is a plotting library.

Let’s begin by importing NumPy under the alias np and matplotlib.pyplot as plt.

import numpy as np
import matplotlib.pyplot as plt

Basic Plotting

The procedure is simple:

  • Create a NumPy array of \(x\) values

  • Use NumPy functions to create an array of \(y\) values

  • Plot with the command plt.plot(x,y)

See more matplotlib.pyplot commands.

Plot \(y = \sin(2 \pi x)\) over the interval \([0,6]\).

x = np.linspace(0,6,200)
y = np.sin(2*np.pi*x)
plt.plot(x,y)
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_5_0.png

Plot the parametric curve given by \(x = 2 k \cos(t) - a \cos(k t)\), \(y = 2 k \sin(t) - a \sin(k t)\) over the interval \(t \in [0,2 \pi]\) for different values \(a\) and \(k\).

k = 6; a = 11;
t = np.linspace(0,2*np.pi,1000)
x = 2*k*np.cos(t) - a*np.cos(k*t)
y = 2*k*np.sin(t) - a*np.sin(k*t)
plt.plot(x,y)
plt.axis('equal')
plt.axis('off')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_7_0.png

Numerical Integration

The function scipy.integrate.quad module computes approximations of definite integrals.

import scipy.integrate as spi

Plot the Gaussian \(e^{-x^2}\) over the interval \([-3,3]\) and verify the formula

\[ \int_{-\infty}^{\infty} e^{-x^2} = \sqrt{\pi} \]
x = np.linspace(-3,3,1000)
y = np.exp(-x**2)
plt.plot(x,y)
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_11_0.png

Compute the integral and estimate error

I, err = spi.quad(lambda x: np.exp(-x**2),-np.inf,np.inf)
print(I)
1.7724538509055159

Compare to \(\sqrt{\pi}\)

np.pi**0.5
1.7724538509055159

Logistic Equation

The function scipy.integrate.odeint computes approximations of solutions of differential equations. Plot numerical solutions of the logistic equation \(y' = y(1-y)\) for different initial conditions \(y(0)\).

def odefun(y,t):
    return y*(1-y)

t = np.linspace(0,1,100)

for y0 in np.arange(-0.4,3,0.2):
    y = spi.odeint(odefun,y0,t)
    plt.plot(t,y,'b')
../../_images/a_tour_of_python_packages_for_scientific_computing_17_0.png

Euler’s 3-body Problem

Write a function called euler3body which takes input parameters:

  • m1 is the mass of star 1

  • x1 is the position vector of star 1

  • m2 is the mass of star 2

  • x2 is the position vector of star 2

  • u0 is the initial values vector of the planet \([x(0),x'(0),y(0),y'(0)]\)

  • tf is the final time value

  • N is the number of t values per year (default value N=100)

and plots the approximations \(x(t)\) versus \(y(t)\).

def euler3body(m1,x1,m2,x2,u0,tf,N=100):

    def odefun(u,t):
        G = 4*np.pi**2
        d1 = np.sqrt((u[0] - x1[0])**2 + (u[2] - x1[1])**2)
        d2 = np.sqrt((u[0] - x2[0])**2 + (u[2] - x2[1])**2)
        dudt = np.zeros(4)
        dudt[0] = u[1]
        dudt[1] = -G*(m1*(u[0] - x1[0])/d1**3 + m2*(u[0] - x2[0])/d2**3)
        dudt[2] = u[3]
        dudt[3] = -G*(m1*(u[2] - x1[1])/d1**3 + m2*(u[2] - x2[1])/d2**3)
        return dudt
    
    t = np.linspace(0,tf,100*tf)
    U = spi.odeint(odefun,u0,t)
    plt.plot(U[:,0],U[:,2])
    plt.plot(x1[0],x1[1],'r.',MarkerSize=m1*10)
    plt.plot(x2[0],x2[1],'r.',MarkerSize=m2*10)
    plt.axis('equal')
    plt.show()

Test the function with input where we know the output. We can model the orbit of the Earth around the Sun by setting \(m_1=1\) and \(m_2=0\) with Star 1 at the origin, and \(\mathbf{u}_0=[1,0,0,2\pi]\) to start the planet at 1AU from the Sun and velocity \(2\pi\) AU/year to produce a near circular orbit.

euler3body(1,[0,0],0,[0,0],[1,0,0,2*np.pi],1)
../../_images/a_tour_of_python_packages_for_scientific_computing_21_0.png

Success! Let’s try to create some cool orbits!

euler3body(2,[-1,0],1,[1,0],[0,5,3,0],30,200)
../../_images/a_tour_of_python_packages_for_scientific_computing_23_0.png
euler3body(1.5,[-2,0],2.5,[2,0],[0,4.8,3,0],20)
../../_images/a_tour_of_python_packages_for_scientific_computing_24_0.png
euler3body(2,[-1,1],1,[1,-1],[0,10,0,5],30)
../../_images/a_tour_of_python_packages_for_scientific_computing_25_0.png

Image Deblurring

The subpackage scipy.linalg contains many functions and algorithms for numerical linear algebra.

import scipy.linalg as la

Blurring images by Toeplitz matrices

Represent a image as a matrix \(X\). Use the function scipy.linalg.toeplitz to create a Toeplitz matrices \(A_c\) and \(A_r\). Matrix multiplication on the left \(A_c X\) blurs vertically (in the columns) and on the right \(X A_r\) blurs horizontally (in the rows).

Start with a simple example. Create an \(256 \times 256\) matrix of zeros and ones which represents the image of square.

N = 256

Z = np.zeros((N//4,N//4))
O = np.ones((N//4,N//4))
X = np.block([[Z,Z,Z,Z],[Z,O,O,Z],[Z,O,O,Z],[Z,Z,Z,Z]])
plt.imshow(X,cmap='binary')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_30_0.png

Create a Toeplitz matrix where the values decrease from the diagonal.

c = np.zeros(N)
s = 5
c[:s] = (s - np.arange(0,s))/(3*s)
Ac = la.toeplitz(c)

Ac[:5,:5]
array([[0.33333333, 0.26666667, 0.2       , 0.13333333, 0.06666667],
       [0.26666667, 0.33333333, 0.26666667, 0.2       , 0.13333333],
       [0.2       , 0.26666667, 0.33333333, 0.26666667, 0.2       ],
       [0.13333333, 0.2       , 0.26666667, 0.33333333, 0.26666667],
       [0.06666667, 0.13333333, 0.2       , 0.26666667, 0.33333333]])
plt.imshow(Ac[:20,:20],cmap='binary')
plt.colorbar()
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_33_0.png

Check the condition number of \(A_c\).

np.linalg.cond(Ac)
24782.3310425676

Blur the image \(X\) vertically.

plt.imshow(Ac @ X,cmap='binary')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_37_0.png

Do the same but in the horizontal direction.

r = np.zeros(N)
s = 20
r[:s] = (s - np.arange(0,s))/(3*s)
Ar = la.toeplitz(r)

plt.imshow(X @ Ar.T,cmap='binary')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_39_0.png

Combine both vertical and horizontal blurring.

plt.imshow(Ac @ X @ Ar.T,cmap='binary')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_41_0.png

Inverting the noise

Let \(E\) represent some noise in the recording of the blurred image

\[ A_c X A_r^T = B + E \]

How do we find \(X\)? Let’s do an example with a real picture.

kitten = plt.imread('data/kitten.jpg').astype(np.float64)

kitten[:3,:3]
array([[150., 153., 158.],
       [150., 153., 158.],
       [149., 153., 157.]])
kitten.shape
(256, 256)
plt.imshow(kitten,cmap='gray')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_46_0.png

Add noise to the image.

noise = 0.01*np.random.randn(256,256)
B_E = Ac @ kitten @ Ar.T + noise

plt.imshow(B_E,cmap='gray')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_48_0.png

Compute the solution directly

\[ X = A_c^{-1} (B + E) ( A_r^T )^{-1} \]
X1 = la.solve(Ac,B_E)
X2 = la.solve(Ar,X1.T)
X2 = X2.T

plt.imshow(X2,cmap='gray')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_50_0.png

What happened? We computed

\[ X = A_c^{-1} B ( A_r^T )^{-1} + A_c^{-1} E ( A_r^T )^{-1} \]

and the result is dominated by the inverted noise \(A_c^{-1} E ( A_r^T )^{-1}\).

Truncated SVD

We need to avoid inverting the noise therefore we compute using the truncated pseudoinverse

\[ X = (A_c)_k^+ (B + E) (A_r^T)_k^+ \]
Pc,Sc,QTc = la.svd(Ac)
Pr,Sr,QTr = la.svd(Ar)

plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
plt.plot(Sc)
plt.grid(True)
plt.title('Singular Values of $A_c$')

plt.subplot(1,2,2)
plt.plot(Sr)
plt.grid(True)
plt.title('Singular Values of $A_r$')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_53_0.png

Compute the truncated pseudoinverse by cutting off small singular values.

k = 200
Ac_k_plus = QTc[:k,:].T @ np.diag(1/Sc[:k]) @ Pc[:,:k].T
Ar_k_plus = QTr[:k,:].T @ np.diag(1/Sr[:k]) @ Pr[:,:k].T
X = Ac_k_plus @ B_E @ Ar_k_plus
plt.imshow(X,cmap='gray')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_55_0.png

We rescued the kitten!

Computed Tomography

The following example is a tomographic X-ray data of a walnut. The dataset was prepared by the Finnish Inverse Problems Society. Import the MATLAB data file with scipy.io.loadmat.

from scipy.io import loadmat
data = loadmat('./data/ct.mat')

The data file contains a measurement matrix \(A\) and the projections vector \(m\). These are large matrices and so we need sparse matrices.

Measurement Matrix \(A\)

A = data['A']
A.data.nbytes
7762752
A.shape
(9840, 6724)
type(A)
scipy.sparse.csc.csc_matrix

Each row of the measurement matrix represents a projection of an X-ray through the sample as a particular angle. WE can visualize each row by reshaping into a matrix.

index = 300 #np.random.randint(0,A.shape[0])
proj = A[index,:].reshape(82,82)
plt.spy(proj)
plt.show()
print('Row: ',index)
../../_images/a_tour_of_python_packages_for_scientific_computing_66_0.png
Row:  300

Sinogram: Measured Projections

The vector \(m\) is the collection of 82 projections from 120 different angles.

sinogram = data['m']
sinogram.nbytes
78720
sinogram.shape
(82, 120)
type(sinogram)
numpy.ndarray
plt.figure(figsize=(10,6))
plt.imshow(sinogram)
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_72_0.png

Compute Truncated SVD Solution

We need to solve \(A\mathbf{x} = \mathbf{b}\) (where \(\mathbf{b}\) is the vector of projections) however there are errors in the projections vector \(\mathbf{b}\) and so we actually have

\[ A \hat{\mathbf{x}} = \mathbf{b} + \mathbf{e} \]

where \(\mathbf{e}\) is noise. We need to compute the truncated pseudoinverse to avoid inverting the noise

\[ \hat{\mathbf{x}} = A_k^+ ( \mathbf{b} + \mathbf{e} ) \]
b = sinogram.reshape([82*120,1],order='F')

We proceed as in the last example but now we need functions for sparse matrices using scipy.sparse.

from scipy.sparse import linalg as sla

Truncate the measurement matrix after the largest \(k\) singular values and compute the pseudoinverse.

k = 200
P,S,QT = sla.svds(A,k=k)
P.shape
(9840, 200)
QT.shape
(200, 6724)
A_k_plus = QT.T @ np.diag(1/S) @ P.T
X = A_k_plus @ b
plt.figure(figsize=(8,8))
plt.imshow(X.reshape(82,82).T,cmap='bone')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_82_0.png

True Solution

Compare to the true solution provided by FIPS.

Solution of Computed Tomorgraphy Example

pandas

pandas is the main Python package for data analysis. Load tabular data with the all important pandas.read_csv function. Let’s do an example with Vancouver weather data taken from Vancouver Weather Statistics.

import pandas as pd
data = pd.read_csv('./data/weatherstats_vancouver_hourly.csv',
                   parse_dates=['date_time_local'],
                   usecols=[0,2,4,5,6,8,10])
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname PDT identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  category=UnknownTimezoneWarning)
/opt/hostedtoolcache/Python/3.7.9/x64/lib/python3.7/site-packages/dateutil/parser/_parser.py:1218: UnknownTimezoneWarning: tzname PST identified but not understood.  Pass `tzinfos` argument in order to correctly return a timezone-aware datetime.  In a future version, this will raise an exception.
  category=UnknownTimezoneWarning)
data.head()
date_time_local pressure_station wind_dir wind_dir_10s wind_speed relative_humidity temperature
0 2020-08-12 08:00:00 NaN NaN NaN NaN NaN NaN
1 2020-08-12 07:00:00 101.48 W 27.0 17.0 77.0 14.3
2 2020-08-12 06:00:00 101.40 W 27.0 18.0 78.0 13.6
3 2020-08-12 05:00:00 101.35 W 27.0 21.0 73.0 14.2
4 2020-08-12 04:00:00 101.30 W 28.0 26.0 71.0 14.5
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 7 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   date_time_local    15000 non-null  datetime64[ns]
 1   pressure_station   14999 non-null  float64       
 2   wind_dir           14823 non-null  object        
 3   wind_dir_10s       14998 non-null  float64       
 4   wind_speed         14999 non-null  float64       
 5   relative_humidity  14998 non-null  float64       
 6   temperature        14998 non-null  float64       
dtypes: datetime64[ns](1), float64(5), object(1)
memory usage: 820.4+ KB

Use the datetime functionality to convert the datetime column into columns with year, month, day and hour.

data['year'] = data['date_time_local'].dt.year
data['month'] = data['date_time_local'].dt.month
data['day'] = data['date_time_local'].dt.day
data['hour'] = data['date_time_local'].dt.hour

Plot the average monthly temparature in 2019.

data2019 = data[data['year'] == 2019]
data2019.groupby('month')['temperature'].mean().plot(kind='bar')
plt.title('Average Monthly Temperature 2019')
plt.xlabel('Month'), plt.ylabel('Temperature (C)')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_93_0.png

Plot the average hourly wind speed in September 2019.

data2019[data2019['month'] == 9].groupby('hour')['wind_speed'].mean().plot(kind='bar')
plt.title('Average Hourly Windspeed September 2019')
plt.xlabel('Hour'), plt.ylabel('Wind Speed (km/h)')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_95_0.png

scikit-learn

scikit-learn provides simple implementations of many machine learning algorithms.

Hand-written digits dataset

scikit-learn comes with builtin datasets for experimentation. Let’s import the digits dataset and use the data to create a model which will predict the correct digit for a new image sample.

from sklearn.datasets import load_digits
digits = load_digits()
digits.keys()
dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])
images = digits.images
images.shape
(1797, 8, 8)
plt.imshow(images[1001,:,:],cmap='binary',interpolation='gaussian')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_103_0.png

The images array is a 3D array where, for each index i, the 2D array images[i,:,:] is a numeric array which represents an 8 by 8 pixel image of a hand-written digit.

images[1001,:,:]
array([[ 0.,  0.,  0.,  1., 15.,  2.,  0.,  0.],
       [ 0.,  0.,  0.,  5., 15.,  0.,  4.,  0.],
       [ 0.,  0.,  0., 13.,  8.,  1., 16.,  3.],
       [ 0.,  0.,  5., 15.,  2.,  5., 15.,  0.],
       [ 0.,  5., 15., 16., 16., 16.,  8.,  0.],
       [ 0., 14., 12., 12., 14., 16.,  2.,  0.],
       [ 0.,  0.,  0.,  0., 12., 12.,  0.,  0.],
       [ 0.,  0.,  0.,  2., 16.,  5.,  0.,  0.]])

K-nearest neighbors classifier

The K-nearest neighbors classifier is simple to understand: given our set of known digits as points in 64-dimensional space, look at a new sample as a new points in 64D and look at the labels of the K-nearest points in our training set to predict the correct label.

from sklearn.neighbors import KNeighborsClassifier as KNN
from sklearn.model_selection import train_test_split
clf = KNN(n_neighbors = 10)
X_train, X_test, y_train, y_test = train_test_split(digits.data,digits.target,test_size=0.3)
clf.fit(X_train,y_train)
KNeighborsClassifier(n_neighbors=10)
clf.score(X_test,y_test)
0.9814814814814815
plt.imshow(X_test[10,:].reshape(8,8),cmap='binary',interpolation='gaussian')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_113_0.png
clf.predict(X_test[10,:].reshape(1,64))
array([1])

Easy as that! We have a model which is around 97% accurate on our testing data!

Principal Component Analysis

Each sample in the digits dataset is an 8 by 8 pixel image of a handwritten digit. We represent each image as a vector in 64-dimensional space. Let’s use principal component analysis to project that 64-dimensional space of digits down to 2D while preserving as much of the variance in the data as possible.

Instantiate a PCA object

Principal component analysis projects the data onto orthogonal components in the feature space so that each component captures the maximum amount of variance. We’ll apply PCA to the digits dataset and observe the results and then we’ll do the computation for ourselves to see what’s going on under the hood.

from sklearn.decomposition import PCA
pca = PCA(n_components=2) # We want 2 principal components so that we can plot the dataset in 2D

Compute the principal components

pca.fit(digits.data)
PCA(n_components=2)

The model has computed the 2 principal components. What are they? They are vectors in the feature space. In other words, they are 8 by 8 pixel images. What do they look like? We can access the components using the .components_ attribute.

pca.components_.shape
(2, 64)
p1 = pca.components_[0,:] # First principal component
p2 = pca.components_[1,:] # Second principal component
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
plt.imshow(p1.reshape(8,8),cmap='binary'), plt.title('First Principal Component')
plt.subplot(1,2,2)
plt.imshow(p2.reshape(8,8),cmap='binary'), plt.title('Second Principal Component')
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_125_0.png

This means that our PCA object is now equipped to project each image onto these two principal components. Looking at the 2 principal components, we can see that the best 2D representation of the dataset is the result of how much a digits looks like a 3 and how much it looks like a 0.

Visualizing the digits dataset

Now we can apply the .transform method to whole dataset and plot the result.

digits_2D = pca.transform(digits.data)

plt.figure(figsize=(18,8))

plt.scatter(digits_2D[:,0],digits_2D[:,1],c=digits.target,s=100,alpha=0.5,lw=0,cmap=plt.cm.get_cmap('jet', 10));
plt.colorbar(ticks=range(0,10)); plt.clim([-0.5,9.5]);
plt.show()
../../_images/a_tour_of_python_packages_for_scientific_computing_128_0.png

Notice how all the 3s are to the left along the horizontal axis. This is because the first principal component is a 3 except with the colors inverted. Similarly, the 0s are at the bottom along the vertical axis because the second principal component is a 0 again with the colors inverted.

NetworkX

NetworkX is a Python package for network analysis. Let’s create a random directed graph and compute the PageRank of each node.

import networkx as nx
n_pages = 20
n_links = 100
edges = [(np.random.randint(n_pages),np.random.randint(n_pages)) for _ in range(0,n_links)]
G = nx.DiGraph(edges)
nx.draw(G,pos=nx.spring_layout(G),with_labels=True)
../../_images/a_tour_of_python_packages_for_scientific_computing_132_0.png
ranks = nx.pagerank(G)
nx.draw(G,node_color=list(ranks.values()),cmap='winter',with_labels=True)
../../_images/a_tour_of_python_packages_for_scientific_computing_134_0.png
ranks
{15: 0.02258797829683527,
 5: 0.01626682002127012,
 17: 0.035435882015889604,
 14: 0.04462296899294341,
 11: 0.08729250687828811,
 18: 0.12191048411632714,
 9: 0.10596713008981737,
 0: 0.013524048602123665,
 6: 0.04644446989308877,
 12: 0.06107989825014321,
 7: 0.050463620910014474,
 2: 0.02813135395343686,
 19: 0.05707906269087664,
 10: 0.024324538491661322,
 8: 0.08890507274924601,
 16: 0.03530366616292815,
 1: 0.02429965225761032,
 3: 0.04727598165748564,
 13: 0.04940686793486309,
 4: 0.03967799603515077}