Uses material from [MLAPP] and [PRML]
High-dimensional data may have low-dimensional structure.
plot_plane()
Data may even be embedded in a low-dimensional nonlinear manifold.
Given a set $X = \{x_n\}$ of observations
There are perhaps 10+ ways to understand PCA, and perhaps 10+ names, e.g.
we will start with the classical deterministic approach
With mean at the origin $ c_i^2 = a_i^2 + b_i^2 $, with constant $c_i^2$
Given data points $\{x_n\}$ in $D$-dim space, define
Data covariance ($D \times D$ matrix):
$$ S = \frac{1}{N} \sum_{n=1}^{N}(x_n - \bar x)(x_n - \bar x)^T$$
Let $u_1$ be the principal component we want.
Goal: Maximize the projection variance over directions $u_1$: $$ \frac{1}{N} \sum_{n=1}^{N}\{u_1^Tx_n - u_1^T \bar x \}^2 = u_1^TSu_1$$
Use a Lagrange multiplier to enforce $u_1^T u_1 = 1$
Derivative is zero when $ Su_1 = \lambda u_1$
So $u_1$ is eigenvector of $S$ with largest eigenvalue.
Repeating the above procedure, one obtains the "less" principal components.
We can view PCA as solving a probabilistic latent variable problem. $$ \begin{align} x &= Wz + \mu + \epsilon \\ z &\sim \mathcal{N}(0, I_M) \end{align} $$ where we describe
Given the generative model $$ x = Wz + \mu + \epsilon $$
we can infer
$$ \mathbb{E}[x] = \mathbb{E}[Wz + \mu + \epsilon] = \mu $$$$ \begin {align} \text{Cov}[x] &= \mathbb{E}[(Wz + \epsilon)(Wz + \epsilon)^T] \\ &= \mathbb{E}[(Wzz^TW^T] + \mathbb{E}[ \epsilon \epsilon^T] \\ &= WW^T + \sigma^2 I \end{align} $$We can simply maximize this likelihood function with respect to $\mu, W, \sigma$.
The truncated dimensions manifest as uniform "noise", and
It is also possible to formulate the PCA using (1) Expectation Maximization, and (2) Bayesian approach. See [PRML] Chap. 12 for more details.
which alternatively optimizes the $W$ and $z$ in the probabilistic formulation.
The search of the principal components of the data ends up in solving an eigenvalue problem (EVP) associated with the empirical covariance matrix $$ S u_i=\lambda_i u_i $$
Now consider data matrix $X\in\mathbb{R}^{N\times D}$, whose rows are the data vectors, mean subtracted.
Finally, consider the singular value decomposition of the data matrix $$ X=U\Sigma V^T,\quad X^TX=V\Sigma^2 V^T,\quad XX^T=U\Sigma^2U^T $$
It is also possible to hard-threshold the SVD truncation. For an $m\times n$ matrix, let $m$ be the median of SV's, one can truncate all the SV's below $\omega(\beta)m$, where $$ \beta=m/n,\quad \omega(\beta) \approx 0.56\beta^3 - 0.95\beta^2 + 1.82\beta + 1.43 $$ But I would always recommend looking at the data and the spectrum of SV's before truncating.
Ref: M. Gavish and D. Donoho, The Optimal Hard Threshold for Singular Values is $4/\sqrt{3}$, arXiv 1305.5870.
Now that we are talking about SVD, the PCA can be also thought as an optimization problem, $$ U,W = \mathrm{argmax}\ ||X-UW||_F^2 $$ such that $U\in\R^{D\times M}$ and $W\in\R^{M\times N}$; F-norm is the sqrt of sum of squares of all elements.
And it turns out SVD of $X$ gives the solution to the above optimization problem, as before.
But in terms of optimization we could do more: for example, sparse PCA by applying some regularization terms (recall from the regression methods).
A quadratic form is $$ U,W = \mathrm{argmax}\ ||X-UW||_F^2 + \gamma ||U||_F^2 + \gamma ||W||_F^2 $$ which tries to minimize the magnitudes of the weights
Sparse PCA may capture "multi-scale" behaviors in data
Ref: R. Deshmukh, etc., Basis Identification for Reduced Order Modeling of Unsteady Flows Using Sparse Coding
For another example, PCA of a dataset where some data in some dimensions are missing. The technique called Gappy POD can be used by constructing another optimization problem $$ U,W = \mathrm{argmax}\ ||P(X-UW)||_F^2 $$ where $P$ is a "mask" matrix with 0s and 1s specifying where the data is known.
In this 2D airfoil example, the method reconstructs the mode with 30% data missing.
Ref: K. Lee, Unifying Perspective for Gappy Proper Orthogonal Decomposition and Probabilistic Principal Component Analysis
One can also think a high-dimensional data point as a realization of a stochastic process - let's just look at a zero-mean Gaussian process, $y\sim GP(0,k(s,t))$.
Guaranteed by the so-called Mercer's theorem, the kernel can be expanded as $$ k(s,t) = \sum_{i=1}^\infty \lambda_i u_i(s)u_i(t) $$ where one needs to solve an integral eigenvalue problem (IEP) $$ \int k(s,t)u_i(s) ds = \lambda_i u_i(t) $$
Then the GP can be approximated as $$ y \approx \sum_{i=1}^M c_i \sqrt{\lambda_i} u_i(t) $$ where $c_i$ is a unit Gaussian variable.
In other words, the infinite-dimensional GP can be approximated using finitely $M$ properly chosen random variables.
In the practical sense, for example,
KLE of a GP with kernel $k(s,t)=\exp(-|s-t|)$ on domain $[0,1]$. The eigenfunctions are actually sinusoidal functions (left) and the GP is reconstructed well using >30 functions (right) - regardless of how many grid points are used in $[0,1]$
Ref: A. Alexanderian, A brief note on the Karhunen-Loeve expansion, arXiv 1509.07526.
Much more details: L. Wang, Karhunen-Loeve Expansions and their Applications, Ph.D. thesis, LSE.
Define a kernel, to avoid having to evaluate the feature vectors explicitly. $$ \kappa (x, x') = \phi(x)^T \phi(x') $$
And recall that the kernel defines another type of distance (now think in terms of variance maximization) $$ ||\phi(x) - \phi(z) ||^2 = \kappa(x,x) - 2\kappa(x,z) + \kappa(z,z) $$
There are many more dimension reduction algorithms, which are nonlinear, classified under the name "manifold learning". Due to the limit of time, these algorithms cannot be covered in class. However, the manifold
module from the sklearn
package is a good source for an introduction to (at least) the following algorithms,
There is also methods from a somewhat different world, such as the Topological Data Analysis, for extracting nonlinear geometry from data.
Uses material from [MLAPP] and [PRML]
Suppose N independent signals are mixed, and sensed by N independent sensors.
Can we reconstruct the original signals, given the mixed data from the sensors?
The sourcess must be independent.
Linear mixing to get the sensor signals x.
$x = As$ or $s = Wx$ (i.e., $W = A^{-1}$ )
$A$ are the bases, $W$ are the filters
We use PCA Whitening to preprocess the data:
Rotate to maximize non-Gaussianity
$$ \begin{align} x_{PCA} &= U^Tx = \Lambda^{-\frac{1}{2}} U^Tx \\ x_{ICA} &= V \Lambda^{-\frac{1}{2}} U^Tx \end{align} $$plot_ica_pca()
Input Signals and Density
To whiten the input data:
We want a linear transformation $ y = V x $
So the components are uncorrelated: $\mathbb{E}[yy^T] = I $
Given the original covariance $ C = \mathbb{E}[xx^T]$
We can use $V = C^{-\frac {1}{2}} $
Because $ \mathbb{E}[yy^T] = \mathbb{E}[Vxx^TV^T] = C^{-\frac {1}{2}} C C^{-\frac {1}{2}} = I $