Main Idea: When predicting $f(x)$, give high weights for neighbors of $x$.
Linear Regression
Locally-weighted Linear Regression
The formulation and implementation are left for HW1.
The estimator is minimized when kernel includes as many training points as can be accomodated by the model. Too large a kernel includes points that degrade the fit; too small a kernel neglects points that increase confidence in the fit.
Our models so far employ a nonlinear feature mapping $\phi : \mathcal{X} \mapsto \R^M$
Allows us to use linear methods for nonlinear tasks.
Problem: Linear model $y(x) = w^T x$ can only produce straight lines through the origin.
Solution: Polynomial Regression
# ground truth: function to be approximated
def fn(x): return np.sin(x)
# plot
poly_reg_example(fn)
Linear Regression Model $$ y(x,w) = w^T \phi (x) = \sum_{j=0}^{M} w_j \phi_j(x) $$
Ridge Regression $$ J(w) = \frac{1}{2} \sum_{n=1}^{N}(w^T \phi (x_n) - t_n)^2 + \frac{\lambda}{2}w^Tw $$
Closed-form Solution $$ w = (\phi^T \phi + \lambda I)^{-1}\phi^T t $$
How do I choose a feature mapping $\phi(x)$?
With $N$ examples and $M$ features,
Problem: Find proper features for the following scattered data
plot_circular_data(100)
Solution 1: Add squared-distance-to-origin $(x_1^2 + x_2^2)$ as third feature.
plot_circular_3d(100)
Solution 2: Replace $(x_1, x_2)$ with $(x_1^2, x_2^2)$
plot_circular_squared(100)
Data has been mapped via $\phi$ to a new, higher-dimensional (possibly infinite!) space
Alternatively, the data still lives in original space, but the definition of distance or inner product has changed.
Unfortunately, higher-dimensional features come at a price!
Kernel methods to the rescue!
Many algorithms depend on the data only through pairwise inner products between data points, $$ \langle x_1, x_2 \rangle = x_2^T x_1 $$
Inner products can be replaced by Kernel Functions, capturing more general notions of similarity.
A kernel function $\kappa(x,x’)$ is intended to measure the similarity between $x$ and $x’$.
So far, we have used the standard inner product in the transformed space, $$ \kappa(x, x') = \phi(x)^T \phi(x') $$
In general, $\kappa(x, x')$ is any symmetric positive-semidefinite function. This means that for every set $x_1, ..., x_n$, the matrix $[\kappa(x_i,x_j)]_{i,j\in[n]}$ is a PSD matrix.
For every valid kernel function $\kappa(x, x')$,
This incredible result follows from Mercer's Theorem,
Kernel: For $x = (x_1, x_2)$ and $z=(z_1, z_2)$, define $$ \begin{align} \kappa(x, z) &= (x^Tz)^2 \\ &= (x_1z_1 + x_2z_2)^2 \\ &= x_1^2z_1^2 + 2x_1z_1x_2z_2 + x_2^2z_2^2 \end{align} $$
Mapping: Equivalent to the standard inner product when either $$ \begin{gather} \phi(x) = (x_1^2, \sqrt{2} x_1 x_2, x_2^2) \\ \phi(x) = (x_1^2, x_1 x_2, x_1 x_2, x_2^2) \end{gather} $$
Implicit mapping is not unique!
Kernel: Higher-order polynomial of degree $p$, $$ \kappa(x,z) = (x^T z)^p = \left( \sum_{k=1}^M x_k z_k \right)^p $$
Mapping: Implicit feature vector $\phi(x)$ contains all monomials of degree $p$
Kernel: Inhomogeneous polynomial up to degree $p$, for $c > 0$, $$ \kappa(x,z) = (x^T z + c)^p = \left(c + \sum_{k=1}^M x_k z_k \right)^p $$
Mapping: Implicit feature vector $\phi(x)$ contains all monomials of degree $\leq p$
Take the pixel values and compute $\kappa(x,z) = (x^Tz + 1)^p $
You can compute the inner product in the space of all monomials up to degree $p$
The dual representation, and its solutions, are entirely written in terms of kernels.
The elements of the Gram matrix $ K =\Phi \Phi^T $ $$ K_{ij} = \kappa(x_i, x_j) = \phi(x_i)^T \phi(x_j) $$
These represent the pairwise similarities among all the observed feature vectors.
To use the kernel trick, we must formulate (training and test) algorithms purely in terms of inner products between data points
We cannot access the coordinates of points in the high-dimensional feature space
This seems a huge limitation, but it turns out that quite a lot can be done
Distance between samples can be expressed in inner products: $$ \begin{align} ||\phi(x) - \phi(z) ||^2 &= \langle\phi(x)- \phi(z), \phi(x)- \phi(z) \rangle\\ &= \langle \phi(x), \phi(x)\rangle - 2\langle\phi(x), \phi(z)\rangle + \langle\phi(z), \phi(z) \rangle \\ &= \kappa(x,x) - 2\kappa(x,z) + \kappa(z,z) \end{align} $$
Question: Can you determine the mean of data in the mapped feature space through kernel operations only?
Answer: No, you cannot compute any point explicitly.
Mean of data points given by: $\phi(s) = \frac{1}{N}\sum_{i=1}^{N}\phi(x_i)$
Distance to the mean: $$ \begin{align} ||\phi(x) - \phi_s ||^2 &= \langle\phi(x)- \phi_s, \phi(x)- \phi_s \rangle\\ &= \langle \phi(x), \phi(x)\rangle - 2\langle\phi(x), \phi_s\rangle + \langle\phi_s, \phi_s \rangle \\ &= \kappa(x,x) - \frac{2}{N}\sum_{i=1}^{N}\kappa(x,x_i) + \frac{1}{N^2}\sum_{j=1}^{N}\sum_{i=1}^{N}\kappa(x_i,x_j) \end{align} $$
Method 1: Explicitly define a feature space mapping $\phi(x)$ and use inner product kernel $$ \kappa(x,x') = \phi(x)^T \phi(x') = \sum_{i=1}^{M}\phi_i(x) \phi_i(x') $$
Method 2: Explicitly define a kernel $\kappa(x,x')$ and identify the implicit feature map, e.g. $$ \kappa(x,z) = (x^Tz)^2 \quad \implies \quad \phi(x) = (x_1^2, \sqrt{2} x_1 x_2, x_2^2) $$
Kernels help us avoid the complexity of explicit feature mappings.
Method 3: Define a similarity function $\kappa(x, x')$ and use Mercer's Condition to verify that an implicit feature map exists without finding it.
There are a number of axioms that help us construct new, more complex kernels, from simpler known kernels.
Simple Polynomial Kernel (terms of degree $2$) $$\kappa(x,z) = (x^Tz)^2 $$
Generalized Polynomial Kernel (degree $M$) $$\kappa(x,z) = (x^Tz + c)^M, c > 0 $$
Gaussian Kernels $$ \kappa(x,z) = \exp \left\{-\frac{||x-z||^2}{2\sigma^2} \right\} $$
Not related to Gaussian PDF!
Translation invariant (depends only on distance between points)
Corresponds to an infinitely dimensional space!
We want to min: $J(w) = \frac{1}{2} \sum_{n=1}^{N}(w^T \phi (x_n) - t_n)^2 + \frac{\lambda}{2}w^Tw$
Taking $\nabla J(w) = 0$, we obtain: $$ \begin{align} w &= -\frac{1}{\lambda}\sum_{n=1}^{N}(w^T\phi(x_n) - t_n) \phi(x_n) \\ &= \sum_{n=1}^{N}a_n\phi(x_n) = \Phi^Ta \end{align} $$ Notice: $w$ can be written as a linear combination of the $\phi(x_n)$!
Transform $J(w)$ to $J(a)$ by substituting $ w = \Phi^T a$
Dual: Substituting $w = \Phi^Ta$, $$ \begin{align} J(a) &= \frac12 w^T \Phi^T \Phi w - w^T \Phi^T t + \frac12 t^T t + \frac{\lambda}{2} w^T w \\ &= \frac{1}{2} (a^T\Phi) \Phi^T \Phi (\Phi^Ta) - (a^T\Phi) \Phi^Tt + \frac{1}{2}t^Tt + \frac{\lambda}{2}(a^T\Phi)(\Phi^Ta) \\ &= \frac{1}{2} a^T \color{red}{(\Phi \Phi^T) (\Phi \Phi^T)} a - a^T \color{red}{(\Phi\Phi^T)} t + \frac{1}{2}t^Tt + \frac{\lambda}{2}a^T \color{red}{(\Phi\Phi^T)} a \\ &= \frac{1}{2}a^TKKa - a^T Kt + \frac{1}{2}t^Tt + \frac{\lambda}{2}a^TKa \end{align} $$
Dual: Substituting $w = \Phi^Ta$, $$ J(a) = \frac{1}{2}a^TKKa - a^T Kt + \frac{1}{2}t^Tt + \frac{\lambda}{2}a^TKa $$ Prediction: Hypothesis becomes $$ y(x) = w^T\phi(x) = a^T \Phi\phi(x) = a^T \mathbf{k}(x) $$ where $$ \mathbf{k}(x) = \Phi \phi(x) = [k(x_1,x), \ldots, k(x_n,x)]^T $$ We still need to solve for $a$!
Primal: $w = (\Phi^T\Phi + \lambda I_M)^{-1} \Phi^Tt $
Dual: $ a = (K + \lambda I_N)^{-1}t $
$\kappa(x,y)=\exp(-\gamma||x-y||^2)$
n = 20
# linear + noise
X = np.linspace(0,10,n).reshape(-1,1)
y = (X + np.random.randn(n,1)).reshape(-1)
# plot
plot_kernel_ridge(X, y, gamma=4, alpha=0.01)
n = 30
# Sine + linear + noise
X = np.linspace(0,10,n).reshape(-1,1)
y = (2*np.sin(2*X) + X + np.random.randn(n,1)).reshape(-1)
# plot
plot_kernel_ridge(X, y, gamma=0.5, alpha=0.01)
n = 30
# completely random
X = np.linspace(0,10,n).reshape(-1,1)
y = np.random.randn(n,)
# plot
plot_kernel_ridge(X, y, gamma=0.5, alpha=0.01)
Locally-weighted Linear Regression
Use Gaussian Kernel, $r_n(x) = \exp \left\{-\frac{||x- x_n||^2}{2\sigma^2} \right\}$
Similarities: Both methods are “instance-based learning”, i.e. non-parametric method
Differences: