KGP_Kern slides

Machine Learning for Engineering¶

Kernels and Gaussian Processes: Kernel method¶

Instructor: Daning Huang¶

TODAY: Kernels and GP - I¶

Kernel functions
Kernel trick
Kernel ridge regression

References¶

[MLaPP] Chp. 14.2, 14.4
[PRML] Chp. 6

Rethinking Linear Regression¶

Problem: Vanilla linear model $y(x) = w^T x$ can only produce straight lines/hyperplanes through the origin.

Not very flexible / powerful
Cannot handle nonlinearities

Solution: Regression with some features

e.g., replace $x$ by $\phi(x) = [1, x, x^2, \cdots, x^p]^T$
The feature mapping $\phi$ is nonlinear
But in feature space the problem becomes linear!

Feature Mapping (or Lifting, Embedding, etc.)¶

Our models so far employ a nonlinear feature mapping $\phi : \mathcal{X} \mapsto \mathbb{R}^M$

in general, features are nonlinear functions of the data
each feature $\phi_j(x)$ extracts important properties from $x$

Ultimately, this allows us to use linear methods for nonlinear tasks.

Example¶

Problem: Find a linear perspective for the following scattered data.

Left: Already linear in the current coordinates
Right: Nonlinear - how to transform the coordinates (or, find the right features)

In [42]:

plot_circular_data(100)

No description has been provided for this image

Solution 1¶

Add squared-distance-to-origin $\phi(x)=(x_1^2 + x_2^2)$ as third feature.

In [43]:

plot_circular_3d(100)

Solution 2¶

Replace $(x_1, x_2)$ with $(\phi_1,\phi_2) = (x_1^2, x_2^2)$

In [46]:

plot_circular_squared(100)

What feature mapping does¶

Data has been mapped via $\phi$ to a new, higher-dimensional space

Certain representations are better for certain problems

Alternatively, the data still lives in original space, but the definition of distance or inner product has changed.

Certain inner products are better for certain problems

More on the second bullet soon.

Effective, but...¶

How do I choose a feature mapping $\phi(x)$?

Manually? Learn features?

With $N$ examples and $M$ features,

design matrix is $\Phi \in \mathbb{R}^{N \times M}$
linear regression requires inverting $\Phi^T \Phi \in \mathbb{R}^{M \times M}$
computational complexity of $O(M^3)$ - too high if too many features?
If many features, we need many more data, so that $N\gg M$ - not always possible!

In other words, higher-dimensional features come at a price:

We cannot manage high/infinite-dimensional $\phi(x)$.
Computational complexity blows up.

Kernel methods to the rescue!

Kernel Functions¶

Many algorithms depend on the data only through pairwise inner products between data points, $$ \langle x_1, x_2 \rangle = x_2^T x_1 $$

Inner products can be replaced by Kernel Functions, capturing more general notions of similarity.

No longer need coordinates!

Definition¶

A kernel function $\kappa(x,x’)$ is intended to measure the similarity between $x$ and $x’$.

So far, we have used the standard inner product in the transformed space, $$ \kappa(x, x') = \phi(x)^T \phi(x') $$
In general, $\kappa(x, x')$ is any symmetric positive-semidefinite function. This means that for every set $x_1, ..., x_n$, the matrix $[\kappa(x_i,x_j)]_{i,j\in[n]}$ is a PSD matrix.

Kernels as implicit feature map¶

For every valid kernel function $\kappa(x, x')$,

there is an implicit feature mapping $\phi(x)$
corresponding to an inner product $\phi(x)^T \phi(x')$ in some high-dimensional feature space
And sometimes infinite dimensional!

This incredible result follows from Mercer's Theorem,

Generalizes the fact that every positive-definite matrix corresponds to an inner product
For more info, see Reproducing Kernel Hilbert Space (RKHS), and later in lecture.

Some Examples¶

Linear¶

Kernel: For $x = (x_1, x_2)$ and $z=(z_1, z_2)$, define $$ \begin{align*} \kappa(x, z) &= (x^Tz)^2 \\ &= (x_1z_1 + x_2z_2)^2 \\ &= x_1^2z_1^2 + 2x_1z_1x_2z_2 + x_2^2z_2^2 \end{align*} $$

Mapping: Equivalent to the standard inner product when either $$ \begin{gather*} \phi(x) = (x_1^2, \sqrt{2} x_1 x_2, x_2^2) \\ \phi(x) = (x_1^2, x_1 x_2, x_1 x_2, x_2^2) \end{gather*} $$

Implicit mapping is not unique, but a standard choice would be given by Mercer's theorem.

Mononomials¶

Kernel: Higher-order mononomial of degree $p$, $$ \kappa(x,z) = (x^T z)^p = \left( \sum_{k=1}^M x_k z_k \right)^p $$

Mapping: Implicit feature vector $\phi(x)$ contains all monomials of degree $p$

Polynomials¶

Kernel: Inhomogeneous polynomial up to degree $p$, for $c > 0$, $$ \kappa(x,z) = (x^T z + c)^p = \left(c + \sum_{k=1}^M x_k z_k \right)^p $$

Mapping: Implicit feature vector $\phi(x)$ contains all monomials of degree $\leq p$

Applying to CFD solutions¶

Take the pixel values and compute $\kappa(x,z) = (x^Tz + 1)^p $

here, $x$ = $449 \times 129 \approx 60k$ nodes, (From NASA Langley)

You can compute the inner product in the space of all monomials up to degree $p$

For $dim(x)\approx 60k$ and $p=3$ a $32T$ dimensional space!

Constructing Kernels¶

Method 1¶

Explicitly define a feature space mapping $\phi(x)$ and use inner product kernel $$ \kappa(x,x') = \phi(x)^T \phi(x') = \sum_{i=1}^{M}\phi_i(x) \phi_i(x') $$

Method 2¶

Explicitly define a kernel $\kappa(x,x')$ and identify the implicit feature map, e.g. $$ \kappa(x,z) = (x^Tz)^2 \quad \implies \quad \phi(x) = (x_1^2, \sqrt{2} x_1 x_2, x_2^2) $$

Kernels help us avoid the complexity of explicit feature mappings.

Method 3¶

Define a similarity function $\kappa(x, x')$ and use Mercer's Condition to verify that an implicit feature map exists without finding it.

Define the Gram matrix $K$ to have elements $K_{nm} = \kappa(x_n,x_m)$
$K$ must positive semidefinite for all possible choices of the data set {$x_n$} $$ a^TKa \equiv \sum_{i}\sum_{j} a_i K_{ij} a_j \geq 0 \quad \forall\, a \in \mathbb{R}^n $$

Building Blocks¶

There are a number of axioms that help us construct new, more complex kernels, from simpler known kernels.

For kernels $\kappa_1$ and $\kappa_2$, and a positive scalar $s$, then the following are kernels too:

$$ \kappa_1+\kappa_2,\quad s\kappa_1,\quad \kappa_1\kappa_2 $$

You will prove some specific cases in homework

For example, the following are kernels: $$ \begin{align*} \kappa(x, x') &= f(x) \kappa_1(x,x') f(x') \\ \kappa(x, x') &= \exp( \kappa_1(x, x') ) \end{align*} $$

Popular Kernel Functions¶

Simple Polynomial Kernel (terms of degree $2$) $$\kappa(x,z) = (x^Tz)^2 $$
Generalized Polynomial Kernel (degree $M$) $$\kappa(x,z) = (x^Tz + c)^M, c > 0 $$
Gaussian Kernels $$ \kappa(x,z) = \exp \left\{-\frac{\|x-z\|^2}{2\sigma^2} \right\} $$

Particularly for Gaussian Kernel¶

Not related to Gaussian PDF!

Translation invariant (depends only on distance between points)
Corresponds to an infinitely dimensional space!

The Kernel Trick¶

What kernels essentially do

Compute inner products in a high dimensional space (i.e., $\phi$).
But with only the computational cost of a low dimensional space (i.e., $x$).

What kernel trick does

Many algorithms can be expressed completely in kernels/inner products $\kappa(x,x’)$, rather than other operations on $x$, e.g., $\phi(x)$.
Huge reduction in cost ($x$ instead of $\phi$), but still with high expressibility ($\phi$).

Example: Similarity¶

Conventionally the similarity between two points $x$ and $z$ can be quantified as the "angle" between vectors $$ s(x,z) = \frac{x^T z}{\|x\|\|z\|} $$

In feature space, we can generalize as $$ s(x,z) = \frac{\phi(x)^T \phi(z)}{\|\phi(x)\|\|\phi(z)\|} $$

But this is simply $$ s(x,z) = \frac{k(x,z)}{\sqrt{k(x,x)k(z,z)}} $$

Example: Distance¶

Distance between samples can be expressed in inner products: $$ \begin{align*} \|\phi(x) - \phi(z) \|^2 &= \langle\phi(x)- \phi(z), \phi(x)- \phi(z) \rangle\\ &= \langle \phi(x), \phi(x)\rangle - 2\langle\phi(x), \phi(z)\rangle + \langle\phi(z), \phi(z) \rangle \\ &= \kappa(x,x) - 2\kappa(x,z) + \kappa(z,z) \end{align*} $$

Example: Distance to the Mean¶

Mean of data points given by: $\phi(s) = \frac{1}{N}\sum_{i=1}^{N}\phi(x_i)$
Distance to the mean: $$ \begin{align*} \|\phi(x) - \phi_s \|^2 &= \langle\phi(x)- \phi_s, \phi(x)- \phi_s \rangle\\ &= \langle \phi(x), \phi(x)\rangle - 2\langle\phi(x), \phi_s\rangle + \langle\phi_s, \phi_s \rangle \\ &= \kappa(x,x) - \frac{2}{N}\sum_{i=1}^{N}\kappa(x,x_i) + \frac{1}{N^2}\sum_{j=1}^{N}\sum_{i=1}^{N}\kappa(x_i,x_j) \end{align*} $$

Example: Ridge Regression¶

This is the main goal for this lecture, see next ...

Kernelizing the Ridge Regression¶

Recall ...¶

Linear Regression Model $$ y(x,w) = w^T \phi (x) = \sum_{j=0}^{M} w_j \phi_j(x) $$
Ridge Regression $$ J(w) = \frac{1}{2} \sum_{n=1}^{N}(w^T \phi (x_n) - t_n)^2 + \frac{\lambda}{2}w^Tw $$
Closed-form Solution $$ w = (\Phi^T \Phi + \lambda I)^{-1}\Phi^T t $$

Dualization¶

We want to minimize: $J(w) = \frac{1}{2} \sum_{n=1}^{N}(w^T \phi (x_n) - t_n)^2 + \frac{\lambda}{2}w^Tw$

Taking $\nabla J(w) = 0$, we obtain: $$ \begin{align*} w &= -\frac{1}{\lambda}\sum_{n=1}^{N}(w^T\phi(x_n) - t_n) \phi(x_n) \\ &\equiv \sum_{n=1}^{N}a_n\phi(x_n) = \Phi^Ta \end{align*} $$ Notice: $w$ can be written as a linear combination of $\phi(x_n)$!

Transform $J(w)$ to $J(a)$ by substituting $ w = \Phi^T a$

Where $a_n = -\frac{1}{\lambda}\left(w^T\phi(x_n) - t_n\right)$
Let $a = (a_1, \dots, a_N)^T$ be the parameter, instead of $w$.

Dual Representation¶

Substituting $w = \Phi^Ta$, $$ \begin{align*} J(a) &= \frac12 w^T \Phi^T \Phi w - w^T \Phi^T t + \frac12 t^T t + \frac{\lambda}{2} w^T w \\ &= \frac{1}{2} (a^T\Phi) \Phi^T \Phi (\Phi^Ta) - (a^T\Phi) \Phi^Tt + \frac{1}{2}t^Tt + \frac{\lambda}{2}(a^T\Phi)(\Phi^Ta) \\ &= \frac{1}{2} a^T {\color{red}{(\Phi \Phi^T) (\Phi \Phi^T)}} a - a^T {\color{red}{(\Phi\Phi^T)}} t + \frac{1}{2}t^Tt + \frac{\lambda}{2}a^T {\color{red}{(\Phi\Phi^T)}} a \\ &= \frac{1}{2}a^TKKa - a^T Kt + \frac{1}{2}t^Tt + \frac{\lambda}{2}a^TKa \end{align*} $$

More concisely, $$ J(a) = \|Ka-t\|_2^2 + \lambda \|a\|_K^2 $$

The first term is the usual prediction error
The second term is a "K-norm", which is technically the RKHS norm - it controls smoothness.
The solution is

$$ {a = (K+\lambda I_N)^{-1}t} $$

The predictive model becomes $$ y(x) = w^T\phi(x) = a^T \Phi\phi(x) = a^T k(x) $$ where $$ k(x) = \Phi \phi(x) = [\kappa(x_1,x), \kappa(x_2,x), \ldots, \kappa(x_N,x)]^T $$

We have gotten rid of all $\phi$!

Comparison¶

Primal - Ridge Regression: $w = (\Phi^T\Phi + \lambda I_M)^{-1} \Phi^Tt $

invert $\Phi^T \Phi \in \mathbb{R}^{M \times M}$
cheaper since usually $N \gg M$
must explicitly construct features

Dual - Kernel Ridge Regression: $ a = (K + \lambda I_N)^{-1}t $

invert Gram matrix $K \in \mathbb{R}^{N \times N}$
use kernel trick to avoid feature construction
use kernels $\kappa(x, x')$ to represent similarity over vectors, images, sequences, graphs, text, etc..

Kernel Ridge Regression: Example¶

Baseline¶

"rbf": $\kappa(x,y)=\exp(-\gamma\|x-y\|^2)$

Linear data plus noise.

In [58]:

# Tries to match everything, but no severe overfitting
plot_kernel_ridge(X, y, gamma=4, lmbd=0.01, ker='rbf')

Larger penalty¶

In [64]:

# Smoother and less variation
plot_kernel_ridge(X, y, gamma=4, lmbd=0.5, ker='rbf')

Smaller bandwidth¶

$\kappa(x,y)=\exp(-\gamma\|x-y\|^2)$

In [65]:

# More localized
plot_kernel_ridge(X, y, gamma=10, lmbd=0.01)

Changing kernel¶

"linear": $\kappa(x,y)=x^Ty$

In [60]:

# Effectively a regularized linear fit
plot_kernel_ridge(X, y, gamma=4, lmbd=0.5, ker='linear')

Two More Perspectives¶

Kernel itself as features¶

For a dataset $\{x_i\}_{i=1}^N$, pick a kernel $\kappa$ and define features as $$ \tilde{\phi}_i(x) = \kappa(x,x_i) $$ i.e., each sample corresponds to one feature.

The KRR model is $$ y(x) = \sum_{i=1}^N a_i\kappa(x,x_i) \equiv \tilde{\phi}(x)^T a $$

So we could have treated KRR as a regular ridge regression.

In fitting, $\tilde{\phi}_i$ forms a $\Phi$ that is exactly $K$.
The difference: the regularization in KRR $\|a\|_K^2$ but RR would use $\|a\|_2^2$.

This view has a deep connection to the Reproducing Kernel Hilbert Space (RKHS).

The collection $\{f(x) \mid f(x)=\sum_{i=1}^N a_i\kappa(x,x_i),\, a_i\in\mathbb{R}\}$ is a function space
Upon "completion", the space becomes RKHS
The many nice properties of RKHS guarantee the nice properties of KRR

See the theoretical slides for more details.

Bias-variance trade-off¶

Recall from Ridge Regression with $m$ features and $n$ data points, we had $$ \mathbb{E}[(y - \hat{f})^2] = \underbrace{O({\sigma^2})}_\text{irreducible error} + \underbrace{O(\sigma^2\frac{{\tilde{m}}}{n})}_\text{Variance} + \underbrace{O({\tilde{m}}^{-p})}_{\text{Bias}^2} $$ where the effective number of features were introduced $$ {\tilde{m}} = \sum_{j=1}^m \frac{\nu_i}{\nu_i+\lambda} $$ and $\nu_i$'s are the eigenvalues of $\Phi^\top\Phi$.

For $d$-dimensional input $x$ and $p$th-order polynomial features: $m\sim d^p$
$m$ explodes for even moderate $d$ and $p$.
Huge variance would be introduced.

Now in Kernel version, we effectively have $m=n$ features instead

so $\nu_i$'s are the eigenvalues of $K$.
and $$ \text{Variance}\sim \frac{\tilde{m}}{n} < 1\quad\rightarrow\text{Under control!} $$

Side Note: Locally-Weighted Linear Regression¶

LWLR is also frequently used. It "looks like" but is not KRR.

Main Idea: When predicting $f(x)$, give high weights for neighbors of $x$.

Regular vs. Locally-Weighted Linear Regression¶

Linear Regression

1. Fit $w$ to minimize $\sum_{k} (t_k - w^T \phi(x_k) )^2$
2. Output $w^T \phi(x_k)$

Locally-weighted Linear Regression

1. Fit $w$ to minimize $\sum_{k} r_k (t_k - w^T \phi(x_k) )^2$ for some weights $r_k$
2. Output $w^T \phi(x_k)$

The standard choice for weights $r$ uses the Gaussian Kernel, with kernel width $\tau$ $$ r_k = \exp\left( -\frac{|| x_k - x ||^2}{2\tau^2} \right) $$
Note $r_k$ depends on both $x$ (query point); must solve linear regression for each query point $x$.
Can be reformulated as a modified version of least squares problem.

Choice of kernel width matters.
- (requires hyperparameter tuning!)

The estimator is minimized when kernel includes as many training points as can be accomodated by the model. Too large a kernel includes points that degrade the fit; too small a kernel neglects points that increase confidence in the fit.

KRR v.s. LWLR: Similarities¶

Both methods are “instance-based learning”, i.e., non-parametric method

Only observations (training set) close to the query point are considered (highly weighted) for regression computation.
Kernel determines how to assign weights to training examples (similarity to the query point x)
Free to choose types of kernels
Both can suffer when the dataset size is large.

KRR v.s. LWLR: Differences¶

LWLR: Weighted regression; slow, but more accurate
KRR: Weighted mean; faster, but less accurate