DataScience Workbook / 05: Introduction to Programming / 3. Introduction to Python programming / 3.6. SciPy library - algorithms for scientific computing

Introduction

You can find the official SciPy documentation on their website at: https://docs.scipy.org/doc/scipy/tutorial/index.html ⤴

SciPy ⤴ is a Python library for scientific and technical computing that provides a collection of algorithms and functions for tasks such as optimization, signal processing, linear algebra, and more. It is built on top of NumPy ⤴, another popular Python library for numerical computing, and is a fundamental tool for data analysis and scientific computing.

The library is open-source and free to use, with a large community of contributors who are continuously adding new features and improving existing ones. Additionally, SciPy works well with other popular scientific computing libraries, such as Matplotlib and Pandas, making it a valuable tool for data scientists and researchers in various fields.

Getting started with SciPy

SciPy is NOT a built-in Python module, meaning it is not included with the standard Python distribution. It is an external library that can be installed and imported into a Python environment.

Install scipy library

To install SciPy, you can use pip, which is the standard package installer for Python. You can run the following command in your terminal or command prompt to install SciPy:

pip install scipy

This will download and install the latest version of SciPy from the Python Package Index (PyPI).

An alternative way to install SciPy is using Conda. This way you can install different variants of scipy library in separate virtual environments, depending on the requirements of your project. You can create and activate a new conda environment, and then install scipy libarary:

conda install scipy

This will download and install the latest version of SciPy from the conda repository.

This command will install scipy and any necessary dependencies in your current Conda environment. If you don’t have conda installed yet, you can follow the guide provided in the tutorial . If you are using the Anaconda distribution, conda is already installed by default.

conda provides additional benefits over pip, such as the ability to create and manage multiple environments for different projects with different dependencies, and the ability to install packages from both the Conda and PyPI (Python Package Index) repositories.

If you are working in a virtual environment, you can install packages without administrative privileges by activating the virtual environment before running the installation command.

Import scipy library

Once installed, SciPy can be imported into Python scripts and used to perform numerical computations and data manipulation.

import scipy as sp

This will import the SciPy library and give it an alias of sp, which is a commonly used abbreviation for SciPy.

You can then use the SciPy functions and classes in your code by prefixing them with sp, such as sp.array() or sp.stats.norm.

Import methods from submodules

To import selected methods from SciPy submodules, you can use the from ... import ... syntax.

For example, to import the linalg.solve method from the scipy.linalg submodule, you can use the following code:

from scipy.linalg import solve

After importing the method in this way, you can use it directly in your code without the need to prefix it with the submodule name, as shown below:

import numpy as np
from scipy.linalg import solve

A = np.array([[1, 2], [3, 4]])
b = np.array([5, 6])

x = solve(A, b)                                 # use imported method directly

print(x)

In this example, we import the solve method from scipy.linalg and then use it to solve a system of linear equations.

print(x)
[-4.   4.5]

Submodules in SciPy

The SciPy library contains several submodules that cater to specific scientific domains. Each submodule offers a range of specialized functionalities that can help you solve a wide variety of scientific and technical computing problems.

submodule description documentation
scipy.io Contains functions for reading and writing data in various file formats, such as MATLAB, NetCDF, and WAV. docs ⤴
scipy.datasets Provides a selection of datasets for use in scientific and statistical analysis, such as iris and MNIST. docs ⤴
scipy.constants Contains physical and mathematical constants, such as the speed of light and the golden ratio. docs ⤴
scipy.stats Contains a range of statistical functions, including probability distributions, descriptive statistics, hypothesis testing, and regression analysis. docs ⤴
scipy.cluster Offers algorithms for clustering data, including k-means, hierarchical clustering, and spectral clustering. docs ⤴
scipy.optimize Contains various optimization algorithms for minimizing or maximizing functions, including nonlinear least squares, least-squares minimization, and root finding. docs ⤴
scipy.interpolate Provides functions for interpolating 1D and higher-dimensional data, including splines, radial basis functions, and smoothing. docs ⤴
scipy.integrate Provides functions for numerical integration, including single and multi-dimensional integration, as well as ODE solvers. docs ⤴
scipy.linalg Contains linear algebra functions, including matrix operations, eigenvalue problems, and decompositions. docs ⤴
scipy.fftpack Provides functions for computing fast Fourier transforms (FFT) and related operations. docs ⤴
scipy.odr Offers orthogonal distance regression (ODR) for fitting models to data with errors in both the x and y dimensions. docs ⤴
scipy.sparse Provides tools for working with sparse matrices, including operations such as matrix multiplication, linear solvers, and eigenvalue problems. docs ⤴
scipy.spatial Provides functions for working with spatial data, including distance calculations, nearest-neighbor searches, and Voronoi diagrams. docs ⤴
scipy.signal Offers a range of signal processing functions, including filtering, Fourier analysis, wavelets, and convolution. docs ⤴
scipy.ndimage Provides functions for n-dimensional image processing and filtering, such as smoothing, edge detection, and morphology. docs ⤴
scipy.special Contains special functions, such as Bessel functions, gamma functions, and error functions. docs ⤴
scipy.misc Offers various utility functions for scientific computing, such as image manipulation and numerical approximations. docs ⤴

scipy.io

The scipy.io module provides functions for reading and writing data in various file formats. Some of the most common functions in this module are:

  • loadmat() - oad data from MATLAB .mat files
  • savemat() - save data to a MATLAB .mat file

Other I/O functions refer to IDL files, Matrix Market files, Unformatted Fortran files, Netcdf, Harwell-Boeing files, Wav sound files, and Arff files. Learn more from the official documentation: scipy.io ⤴.

savemat() loadmat()

savemat()
function used to save data to a MATLAB .mat file

loadmat()
function used to load data from MATLAB .mat files

It takes a dictionary of variables and their values as input. It returns a dictionary containing the variables and their values.
import scipy.io as sio
import numpy as np

data = {'x': np.array([1, 2, 3]), 'y': np.array([4, 5, 6])}     # create data sample

sio.savemat('data.mat', data)                                   # save data to .mat file

data = sio.loadmat('data.mat')                                  # load data from .mat file

print(data.keys())                                              # output: ['x', 'y']
print(data['x'])                                                # output: [][1 2 3]]

In this example, we create a dictionary data that contains two arrays, x and y. We then use the savemat() function to save the data to a MATLAB .mat file called data.mat. Then, we use the loadmat function to load data from a MATLAB .mat file called data.mat. We print the keys of the dictionary that is returned by the function to see the variables that were loaded, and then we print the value of the x variable.

scipy.datasets

The scipy.datasets module provides a set of 3 datasets that can be used for testing, benchmarking, and other purposes:

  • ascent() - grayscale image
  • face() - color image
  • electrocardiogram() - medical recordings of the heart’s electrical activity

The module provides a set of built-in functions to load these datasets by their names.

import scipy

data = scipy.datasets.ascent()
print(data)
[[ 83  83  83 ... 117 117 117]
 [ 82  82  83 ... 117 117 117]
 [ 80  81  83 ... 117 117 117]
 ...
 [178 178 178 ...  57  59  57]
 [178 178 178 ...  56  57  57]
 [178 178 178 ...  57  57  58]]

If you want to download all datasets included in Scipy, you can use the download_all() function. This function will download all three datasets and save them to a selected location as .dat files.

from scipy.datasets import download_all

download_all('.', subok=True)

In this code, download_all() will download the three datasets to the current working directory, represented by '.'. The subok=True parameter ensures that any subdirectories needed to save the files will be created if they don’t already exist.

…from the official documentation: scipy.datasets ⤴.

scipy.constants

The scipy.constants module provides a collection of physical and mathematical constants. These constants are often used in scientific calculations and simulations. The full list of available constants can be explored in the official documentation: scipy.constants ⤴

Here are some of the most commonly used methods in this module along with their usage examples:

  • scipy.constants.value(constant_name)

Returns the numerical value of a given physical constant.

from scipy import constants

constants.value('Planck constant')
6.62607015e-3

Some common constants, such as R (molar gas constant), C (speed of light in vacuum), g (standard acceleration due to gravity), or atomic mass, can be called directly by their symbol or name:

from scipy import constants

constants.c                                      # output: 299792458.0
constants.g                                      # output: 9.80665
constants.R                                      # output: 8.31446261815324
constants.atomic_mass                            # output: 1.6605390666e-27
  • scipy.constants.find(substring)

Returns a list of constants whose names contain the given substring.

from scipy import constants

constants.find('Planck')
['Planck constant', 'Planck constant in eV/Hz', 'Planck length', 'Planck mass',
 'Planck mass energy equivalent in GeV', 'Planck temperature', 'Planck time',
 'molar Planck constant', 'reduced Planck constant', 'reduced Planck constant in eV s',
 'reduced Planck constant times c in MeV fm']

scipy.constants.physical_constants[constant_name]
Returns a tuple containing the value, unit, and uncertainty of a given physical constant.

from scipy import constants

constants.physical_constants['proton mass']      # output: (1.67262192369e-27, 'kg', 5.1e-37)
(1.67262192369e-27, 'kg', 5.1e-37)
  • scipy.constants.unit(constant_name)

Returns the unit of a given physical constant.

from scipy import constants

constants.unit['proton mass']                    # output: 'kg'
  • scipy.constants.convert_temperature(temp, from_unit, to_unit)

Returns the unit of a given physical constant.

from scipy import constants

constants.convert_temperature(100, 'Celsius', 'Kelvin')        # output: 373.15

scipy.stats

The scipy.stats module provides a wide range of distributions, including:

and statistical functions:

and other statistical functionality such as Transformations ⤴, Sampling ⤴, Resampling Methods ⤴, Quasi-Monte Carlo ⤴ and stats-specific Warnings / Errors ⤴.


Here are a few common methods with usage examples:

generate random samples

A. Returns a normal continuous random variable:
scipy.stats.norm
Here’s an example of generating 10 random samples from a normal distribution with a mean of 0 and a standard deviation of 1:

from scipy.stats import norm

data = norm.rvs(size=10)
data
array([-0.78903894, -1.55293905, -0.64965826, -1.05128015,  2.55342981,
       -0.2064151 , -0.60322765, -0.861767  , -0.75899174, -1.28693063])

print(data)
[-0.78903894 -1.55293905 -0.64965826 -1.05128015  2.55342981 -0.2064151
 -0.60322765 -0.861767   -0.75899174 -1.28693063]

B. Returns a binomial discrete random variable:
scipy.stats.binom
Here’s an example of generating 10 random samples from a binomial distribution with 10 trials and a success probability of 0.5:

from scipy.stats import binom

data = binom.rvs(n=10, p=0.5, size=10)
print(data)
[2 4 4 5 5 7 6 6 3 5]

C. Returns a chi-squared continuous random variable:
scipy.stats.chi2
Here’s an example of generating 10 random samples from a chi-squared distribution with 2 degrees of freedom:

from scipy.stats import chi2

data = chi2.rvs(df=2, size=10)
print(data)
[1.04467704 1.1461272  0.36506538 0.81107591 3.26317094 4.84727761
 4.52965288 0.93817042 8.51668502 3.2564631 ]

D. Returns a gamma continuous random variable:
scipy.stats.gamma
Here’s an example of generating 10 random samples from a gamma distribution with a shape parameter of 2 and a scale parameter of 2:

from scipy.stats import gamma

data = gamma.rvs(a=2, scale=2, size=10)
print(data)
[ 0.80623887  3.61599447  4.67765372  3.27401918  4.62340971  4.33309687
  1.51408293  1.31255128  3.13450136 10.43018603]

calculate T-test of one sample

scipy.stats.ttest_1samp
Calculates the T-test for the mean of one sample.
Here’s an example of performing a one-sample T-test on a set of data:

from scipy.stats import ttest_1samp

data = [1, 2, 3, 4, 5]

ttest_1samp(data, 3)
TtestResult(statistic=0.0, pvalue=1.0, df=4

calculate ANOVA test for multiple samples

scipy.stats.f_oneway
Performs a one-way ANOVA test for multiple samples.
Here’s an example of performing an ANOVA test on three sets of data:

from scipy.stats import f_oneway

data1 = [1, 2, 3, 4, 5]
data2 = [2, 4, 6, 8, 10]
data3 = [3, 6, 9, 12, 15]

f_oneway(data1, data2, data3)
F_onewayResult(statistic=3.857142857142857, pvalue=0.05086290933139865)

calculate a Pearson correlation coefficient

scipy.stats.pearsonr
Calculates a Pearson correlation coefficient and the associated p-value.
Here’s an example of calculating the correlation coefficient between two sets of data:

from scipy.stats import pearsonr

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

pearsonr(x, y)
PearsonRResult(statistic=1.0, pvalue=0.0)

calculate a linear least-squares regression

scipy.stats.linregress
Calculates a linear least-squares regression for two sets of measurements.
Here’s an example of performing a linear regression on two sets of data:

from scipy.stats import linregress

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

linregress(x, y)                        # output: LinregressResult(slope=2.0, intercept=0.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)
LinregressResult(slope=2.0, intercept=0.0, rvalue=1.0, pvalue=1.2004217548761408e-30, stderr=0.0, intercept_stderr=0.0)

scipy.cluster

The scipy.cluster module of Scipy provides algorithms for clustering, which involves grouping data points into clusters based on their similarities. The full list of available clustering methods can be explored in the official documentation: scipy.cluster ⤴

Here are some of the most common methods in this module along with usage examples:

hierarchical clustering with linkage matrix

scipy.cluster.hierarchy.linkage
Computes the hierarchical clustering of a dataset and returns a linkage matrix.

import numpy as np
from scipy.cluster.hierarchy import linkage

data = np.random.rand(5, 3)

linkage_matrix = linkage(data, method='single')
see the expected output
print(data)
[[0.97309739 0.14092786 0.73237976]
 [0.53102418 0.85463043 0.28841557]
 [0.25546178 0.51738497 0.28659921]
 [0.7780539  0.35929428 0.53852367]
 [0.60161033 0.82559442 0.34248004]]

print( linkage_matrix)
[[1.         4.         0.09353321 2.        ]
 [0.         3.         0.3511496  2.        ]
 [2.         5.         0.43551399 3.        ]
 [6.         7.         0.53572498 5.        ]]

Check if a linkage matrix is valid

scipy.cluster.hierarchy.is_valid_linkage
Here’s an example of using this method:

from scipy.cluster.hierarchy import is_valid_linkage

print(is_valid_linkage(linkage_matrix))                         # output: True

Get leaf nodes of a hierarchical clustering

scipy.cluster.hierarchy.leaves_list
Returns the leaf nodes of a hierarchical clustering.

from scipy.cluster.hierarchy import leaves_list

leaves = leaves_list(linkage_matrix)
print(leaves)
[0 3 2 1 4]

hierarchical clustering as a dendrogram

scipy.cluster.hierarchy.dendrogram
Plots the hierarchical clustering as a dendrogram.

import numpy as np
import matplotlib.pyplot as plt
from scipy.cluster.hierarchy import linkage, dendrogram

data = np.random.rand(5, 3)

linkage_matrix = linkage(data, method='single')
dendrogram(linkage_matrix)

plt.show()
dendrogram(linkage_matrix)
{'icoord': [[5.0, 5.0, 15.0, 15.0], [35.0, 35.0, 45.0, 45.0], [25.0, 25.0, 40.0, 40.0], [10.0, 10.0, 32.5, 32.5]],
 'dcoord': [[0.0, 0.4609265621656883, 0.4609265621656883, 0.0], [0.0, 0.28518916147512924, 0.28518916147512924, 0.0], [0.0, 0.491984094759486, 0.491984094759486, 0.28518916147512924], [0.4609265621656883, 0.5412089099877605, 0.5412089099877605, 0.491984094759486]],
 'ivl': ['1', '2', '3', '0', '4'],
 'leaves': [1, 2, 3, 0, 4],
 'color_list': ['C0', 'C1', 'C0', 'C0'],
 'leaves_color_list': ['C0', 'C0', 'C0', 'C1', 'C1']
}

plt.show()

scipy_clustering_dendrogram

k-means clustering with centroids

scipy.cluster.vq.kmeans
Performs k-means clustering on a dataset and returns the cluster centroids.

import numpy as np
from scipy.cluster.vq import kmeans

data = np.random.rand(5, 3)

centroids, _ = kmeans(data, 2)
print(centroids)
[[0.24676794 0.82368652 0.88983179]
 [0.6756976  0.53838277 0.16609091]]

Assigns point to the nearest cluster centroid

scipy.cluster.vq.vq
Assigns each data point to the nearest cluster centroid using vector quantization.

from scipy.cluster.vq import vq

labels, _ = vq(data, centroids)
print(labels)
[0 0 1 0 1]