Machine Learning for Engineering¶

High-Dimensional System: Spatial Decomposition - Linear¶

Instructor: Daning Huang¶

$$ \newcommand{\R}{\mathbb{R}} \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\X}{\mathcal{X}} \newcommand{\D}{\mathcal{D}} $$

TODAY: Dimension Reduction - II¶

  • Principal Component Analysis
    • Classical view
    • Probabilistic view
    • Connection to other methods
  • Other dimension reduction methods

References:¶

  • [PRML] Chap. 12
  • [MLaPP] Chap. 27

Principal Components Analysis¶

Uses material from [MLAPP] and [PRML]

Dimensionality Reduction¶

High-dimensional data may have low-dimensional structure.

  • We only need two dimensions to describe a rotated plane in 3d!
  • We have seem a few examples in the previous lecture
In [3]:
plot_plane()

Dimensionality Reduction¶

Data may even be embedded in a low-dimensional nonlinear manifold.

  • How can we recover a low-dimensional representation?

Dimensionality Reduction¶

In physics, a seemingly high-dimensional problem, e.g. an unsteady flow field, might live in a low-dimensional subspace.

An interesting paper on modal analysis of fluid flow.

Principal Components Analysis¶

Given a set $X = \{x_n\}$ of observations

  • in a space of dimension $D$,
  • find a linear subspace of dimension $M < D$
  • that captures most of its variability.

There are perhaps 10+ ways to understand PCA, and perhaps 10+ names, e.g.

  • Proper Orthogonal Decomposition (POD), from the fluid dynamics community
  • Karhunen-Loeve Expansion (KLE), from the statistics/probability community

we will start with the classical deterministic approach

PCA: Equivalent Descriptions¶

  • maximizing the variance of the projection, or
  • minimizing the squared approximation error.

With mean at the origin $ c_i^2 = a_i^2 + b_i^2 $, with constant $c_i^2$

  • Minimizing $b_i^2$ maximizes $a_i^2$ and vice versa

PCA: First Principal Component¶

Given data points $\{x_n\}$ in $D$-dim space, define

  • Mean $\bar x = \frac{1}{N} \sum_{n=1}^{N} x_n $
  • Data covariance ($D \times D$ matrix):

    $$ S = \frac{1}{N} \sum_{n=1}^{N}(x_n - \bar x)(x_n - \bar x)^T$$

Let $u_1$ be the principal component we want.

  • Unit length $u_1^T u_1 = 1$
  • Projection of $x_n$ is $u_1^T x_n$

Goal: Maximize the projection variance over directions $u_1$: $$ \frac{1}{N} \sum_{n=1}^{N}\{u_1^Tx_n - u_1^T \bar x \}^2 = u_1^TSu_1$$

  • Use a Lagrange multiplier to enforce $u_1^T u_1 = 1$

    • Maximize: $u_1^T S u_1 + \lambda(1-u_1^T u_1)$
  • Derivative is zero when $ Su_1 = \lambda u_1$

    • That is, $u_1^T S u_1 = \lambda $
  • So $u_1$ is eigenvector of $S$ with largest eigenvalue.

  • Repeating the above procedure, one obtains the "less" principal components.

Probabilistic PCA¶

We can view PCA as solving a probabilistic latent variable problem. $$ \begin{align} x &= Wz + \mu + \epsilon \\ z &\sim \mathcal{N}(0, I_M) \end{align} $$ where we describe

  • data $x \in \R^D$ in terms of
  • latent variable $z \in \R^M$ in lower dimensional space, via
  • linear transformation $W \in \R^{D \times M}$ that maps $z \mapsto x$
  • and $\epsilon$ is a D-dimensional zero-mean Gaussian-distributed noise variable with covariance $\sigma^2 I$.

Given the generative model $$ x = Wz + \mu + \epsilon $$

we can infer

$$ \mathbb{E}[x] = \mathbb{E}[Wz + \mu + \epsilon] = \mu $$$$ \begin {align} \text{Cov}[x] &= \mathbb{E}[(Wz + \epsilon)(Wz + \epsilon)^T] \\ &= \mathbb{E}[(Wzz^TW^T] + \mathbb{E}[ \epsilon \epsilon^T] \\ &= WW^T + \sigma^2 I \end{align} $$
$$ x = Wz + \mu + \epsilon $$

Maximum Likelihood¶

$$ \begin{align} \log p(X|\mu, W, \sigma^2) &= \sum_n \log p(x_n| W, \mu, \sigma^2) \\ &= -\frac{ND}{2} \log 2\pi - \frac{N}{2} \log |C| -\frac{1}{2} \sum_n (x_n - \mu)^TC^{-1}(x_n - \mu) \\ C = WW^T + \sigma^2 I \end{align} $$

We can simply maximize this likelihood function with respect to $\mu, W, \sigma$.

  • Mean: $\mu = \bar x$
  • Noise: $$ \sigma_{ML}^2 = \frac{1}{D-M} \sum_{i=M+1} ^{D} \lambda_i $$
  • Transform: $$W_{ML} = U_M (L_M - \sigma^2 I)^{\frac{1}{2}} R $$

The truncated dimensions manifest as uniform "noise", and

  • $L_M$ is diag with the $M$ largest eigenvalues; so noise is due to the discarded eigenvalues/components in the data.
  • $U_M$ is the $M$ corresponding eigenvectors, i.e. the PC's in standard PCA.
  • $R$ is an arbitrary $M\times M$ rotation (i.e., $z$ can be defined by rotating “back”), note for any such $R$, one can define $\tilde{W}=WR$, and $\tilde{W}\tilde{W}^T=WRR^TW^T=WW^T$. So there is redundancy in $W$!

Probabilistic PCA: Other Formulations¶

It is also possible to formulate the PCA using (1) Expectation Maximization, and (2) Bayesian approach. See [PRML] Chap. 12 for more details.

e.g., Expectation Maximization¶

which alternatively optimizes the $W$ and $z$ in the probabilistic formulation.

PCA and SVD¶

The search of the principal components of the data ends up in solving an eigenvalue problem (EVP) associated with the empirical covariance matrix $$ S u_i=\lambda_i u_i $$

  • one would want to choose the largest $M$ eigenvalues and the associated eigenvectors.
  • These eigenpairs maximizes the variances $\sigma_i^2 = u_i^T S u_i$, and minimizes the projection error $J=\sum_{M+1}^D\lambda_i$.
  • One typically choose $M$ such that, for a small number, e.g. $\epsilon=0.01$, $$ \frac{\sum_{i=1}^M\lambda_i}{\sum_{i=1}^D\lambda_i} = 1-\frac{J}{\sum_{i=1}^D\lambda_i} \geq 1-\epsilon $$

Now consider data matrix $X\in\mathbb{R}^{N\times D}$, whose rows are the data vectors, mean subtracted.

  • Suppose now $D\gg N$, so it is inefficient to do an EVP for $S=\frac{1}{N}X^TX$, which has a cost of $O(D^3)$.
  • Instead, one could do the following EVP at reduced cost $O(N^3)$, $$ \frac{1}{N}XX^T (X u_i)=\lambda_i (X u_i)\quad\Rightarrow\quad \tilde{S} v_i=\lambda_i v_i $$
  • The eigenvalues are the same as those associated with $u_i$.
  • And to recover $u_i$ from $v_i$, $$ u_i=\frac{1}{\sqrt{N\lambda_i}}X^T v_i $$

Finally, consider the singular value decomposition of the data matrix $$ X=U\Sigma V^T,\quad X^TX=V\Sigma^2 V^T,\quad XX^T=U\Sigma^2U^T $$

  • It is clear that $u_i$ and $v_i$ are the left and right singular vectors of $X$, and there is a one-to-one correspondence between the singular values and the eigenvalues.
  • Alternatively, one could do the SVD of $X$ to find the principal components of the dataset.
  • While full SVD is expensive, there is a randomized SVD algorithm that can find max components efficiently.
  • The SVD approach also called the snapshot method

Side note¶

It is also possible to hard-threshold the SVD truncation. For an $m\times n$ matrix, let $m$ be the median of SV's, one can truncate all the SV's below $\omega(\beta)m$, where $$ \beta=m/n,\quad \omega(\beta) \approx 0.56\beta^3 - 0.95\beta^2 + 1.82\beta + 1.43 $$ But I would always recommend looking at the data and the spectrum of SV's before truncating.

Ref: M. Gavish and D. Donoho, The Optimal Hard Threshold for Singular Values is $4/\sqrt{3}$, arXiv 1305.5870.

PCA with Regularization¶

Now that we are talking about SVD, the PCA can be also thought as an optimization problem, $$ U,W = \mathrm{argmax}\ ||X-UW||_F^2 $$ such that $U\in\R^{D\times M}$ and $W\in\R^{M\times N}$; F-norm is the sqrt of sum of squares of all elements.

And it turns out SVD of $X$ gives the solution to the above optimization problem, as before.

But in terms of optimization we could do more: for example, sparse PCA by applying some regularization terms (recall from the regression methods).

A quadratic form is $$ U,W = \mathrm{argmax}\ ||X-UW||_F^2 + \gamma ||U||_F^2 + \gamma ||W||_F^2 $$ which tries to minimize the magnitudes of the weights

  • Or replace the F-norm with 1-norm for $W$, then we are asking to use as few modes as possible to describe the original data.
  • More details in Ref: M. Udell, Generalized Low Rank Models

Sparse PCA may capture "multi-scale" behaviors in data

Ref: R. Deshmukh, etc., Basis Identification for Reduced Order Modeling of Unsteady Flows Using Sparse Coding

For another example, PCA of a dataset where some data in some dimensions are missing. The technique called Gappy POD can be used by constructing another optimization problem $$ U,W = \mathrm{argmax}\ ||P(X-UW)||_F^2 $$ where $P$ is a "mask" matrix with 0s and 1s specifying where the data is known.

In this 2D airfoil example, the method reconstructs the mode with 30% data missing.

Ref: K. Lee, Unifying Perspective for Gappy Proper Orthogonal Decomposition and Probabilistic Principal Component Analysis

Karhunen-Loeve Expansion¶

One can also think a high-dimensional data point as a realization of a stochastic process - let's just look at a zero-mean Gaussian process, $y\sim GP(0,k(s,t))$.

Guaranteed by the so-called Mercer's theorem, the kernel can be expanded as $$ k(s,t) = \sum_{i=1}^\infty \lambda_i u_i(s)u_i(t) $$ where one needs to solve an integral eigenvalue problem (IEP) $$ \int k(s,t)u_i(s) ds = \lambda_i u_i(t) $$

Then the GP can be approximated as $$ y \approx \sum_{i=1}^M c_i \sqrt{\lambda_i} u_i(t) $$ where $c_i$ is a unit Gaussian variable.

In other words, the infinite-dimensional GP can be approximated using finitely $M$ properly chosen random variables.

In the practical sense, for example,

  • Recall that if one obtains $N$ realizations of a zero-mean GP on a $[0,1]$ with 1000-point grid, one gets N 1000-dimension vectors $\{y_1,y_2,\cdots,y_N\}$.
  • One can estimate the covariance matrix from the data $S=\frac{1}{N}\sum_{i=1}^N y_i y_i^T$
  • $S$ is essentially the discrete version of $k(s,t)$
    • And the eigenvalue decomposition of $S$ corrsponds to the IEP
    • But as we have seen $S$ also produces the principal components
  • In this context the following three are equivalent
    • Eigenvectors of $S$
    • Principal components of data
    • Eigenfunctions of the stochastic process underlying the data

KLE of a GP with kernel $k(s,t)=\exp(-|s-t|)$ on domain $[0,1]$. The eigenfunctions are actually sinusoidal functions (left) and the GP is reconstructed well using >30 functions (right) - regardless of how many grid points are used in $[0,1]$

Ref: A. Alexanderian, A brief note on the Karhunen-Loeve expansion, arXiv 1509.07526.

Much more details: L. Wang, Karhunen-Loeve Expansions and their Applications, Ph.D. thesis, LSE.

Kernel PCA¶

  • Suppose the regularity that allows dimensionality reduction is non-linear. Left: nonlinear space; Right: "straightened" linear space.
  • As with regression and classification, we can transform the raw input data $x_n$ to a set of feature values
$$ x_n \rightarrow \phi(x_n) $$
  • Linear PCA (on the nonlinear feature space) gives us a linear subspace in the feature value space, corresponding to nonlinear structure in the data space.
  • Define a kernel, to avoid having to evaluate the feature vectors explicitly. $$ \kappa (x, x') = \phi(x)^T \phi(x') $$

  • And recall that the kernel defines another type of distance (now think in terms of variance maximization) $$ ||\phi(x) - \phi(z) ||^2 = \kappa(x,x) - 2\kappa(x,z) + \kappa(z,z) $$

  • Define the Gram matrix K of pairwise similarities among the data points: $$ K_{nm} = \phi (x_n)^T \phi(x_m) = \kappa (x_n, x_m) $$
  • Express PCA in terms of the kernel,
    • Some care is required to centralize the data. $$ N\lambda x=Kx $$ As if working with $S$!

Extending to Manifold Learning¶

There are many more dimension reduction algorithms, which are nonlinear, classified under the name "manifold learning". Due to the limit of time, these algorithms cannot be covered in class. However, the manifold module from the sklearn package is a good source for an introduction to (at least) the following algorithms,

  • Isometric Mapping (Isomap)
  • Locally Linear Embedding (LLE)
  • Multidimensional scaling (MDS)
  • t-distributed Stochastic Neighbor Embedding (t-SNE)

There is also methods from a somewhat different world, such as the Topological Data Analysis, for extracting nonlinear geometry from data.

Independent Components Analysis (Optional)¶

Uses material from [MLAPP] and [PRML]

Independent Component Analysis¶

Suppose N independent signals are mixed, and sensed by N independent sensors.

  • Cocktail party with speakers and microphones.
  • EEG with brain wave sources and sensors.

Can we reconstruct the original signals, given the mixed data from the sensors?

Independent Component Analysis¶

The sourcess must be independent.

  • And they must be non-Gaussian.
  • If Gaussian, then there is no way to find unique independent components.

Linear mixing to get the sensor signals x.

$x = As$ or $s = Wx$ (i.e., $W = A^{-1}$ )

$A$ are the bases, $W$ are the filters

ICA: Algorithm¶

  • There are several formulations of ICA:
    • Maximum likelihood
    • Maximizing non-Gaussianity (popular)
  • Common steps of ICA (e.g., FastICA):
    • Apply PCA whitening (aka sphering) to the data
    • Find orthogonal unit vectors along which the that non-Gaussianity are maximized $$ \underset{W}{\max}\hspace{1em} f(W \tilde x) \hspace{2em} s.t.\hspace{1em} WW^T = I$$ where f(x) can be "kurtosis", L1 norm, etc.

ICA: Step 1, Preprocessing¶

We use PCA Whitening to preprocess the data:

  • Apply PCA: $ \Sigma = U \Lambda U^T $
  • Project (rotate) to the principal components
  • “Scale” each axis so that the transformed data has identity as covariance.

ICA: Step 2, Maximize Non-Gaussianity¶

Rotate to maximize non-Gaussianity

$$ \begin{align} x_{PCA} &= U^Tx = \Lambda^{-\frac{1}{2}} U^Tx \\ x_{ICA} &= V \Lambda^{-\frac{1}{2}} U^Tx \end{align} $$

ICA vs PCA¶

In [5]:
plot_ica_pca()

ICA: Mixture Example¶

Input Signals and Density

ICA: Mixture Example¶

To whiten the input data:

  • We want a linear transformation $ y = V x $

  • So the components are uncorrelated: $\mathbb{E}[yy^T] = I $

  • Given the original covariance $ C = \mathbb{E}[xx^T]$

  • We can use $V = C^{-\frac {1}{2}} $

  • Because $ \mathbb{E}[yy^T] = \mathbb{E}[Vxx^TV^T] = C^{-\frac {1}{2}} C C^{-\frac {1}{2}} = I $

ICA: Mixture Example¶

ICA: Summary¶

  • Learning can be done by PCA whitening followed kurtosis maximization.
  • ICA is widely used for blind-source separation
  • The ICA components can be used for features.
  • Limitation: difficult to learn overcomplete bases due to the orthogonality constraint