TODAY: Kernels and GP¶
Theoretical perspective on kernel ridge regression: Reproducing Kernel Hilbert Space (RKHS)
- Preliminaries and definitions
- Moore-Aronszajn Theorem (kernel-RKHS correspondence)
- Mercer's Theorem (kernel construction)
- Examples of RKHS for specific kernels
- RKHS-view of KRR
References¶
Hilbert Space¶
To understand RKHS, we certainly need to understand the "HS" first, i.e., the Hilbert space.
To do so, let's review a hierarchy of spaces:
$$ \begin{align*} &\ \boxed{\text{Vector Space}} \\ &\xrightarrow{\text{Add }\langle \cdot, \cdot \rangle} \boxed{\text{Inner Product Space}} \\ &\xrightarrow{\text{Induce }d(\cdot,\cdot)} \boxed{\text{Pre-Hilbert Space}} \\ &\xrightarrow{\text{Completion}} \boxed{\text{Hilbert Space}} \end{align*} $$
Vector Space¶
A vector space has 4 elements
- A nonempty set $V$ of vectors (e.g., conventional "vectors" $\mathbb{R}^n$ in undergrad linear algebra)
- A field $F$ of scalars (e.g., conventional scalars like $\mathbb{R}$)
- An operation: vector addition $$ + : V \times V \to V $$
- A function: scalar multiplication $$ \cdot : F \times V \to V $$
For a combination $(V,F,+,\cdot)$ to be vector space, the following axioms must hold for all $u,v,w \in V$ and all $a,b \in F$.
For vector addition:
- Associativity $$ u + (v + w) = (u + v) + w. $$
- Commutativity $$ u + v = v + u. $$
- Identity element - exists a zero element $0 \in V$ $$ v + 0 = v \quad \text{for all } v \in V. $$
- Inverse elements - exists an "additive inverse" $-v \in V$ $$ v + (-v) = 0. $$
For scalar multiplication:
- Compatibility with field multiplication $$ a(bv) = (ab)v. $$
- Identity element - with $1$ as the multiplicative identity in $F$ $$ 1v = v. $$
- Distributivity with respect to vector addition $$ a(u + v) = au + av. $$
- Distributivity with respect to field addition $$ (a + b)v = av + bv. $$
Examples¶
One of the simplest vector spaces is the usual Euclidean space that you have seen in undergrad linear algebra.
- For example, let $V=\mathbb{R}^d$ and $F=\mathbb{R}$.
A more non-trivial example is when $V$ is a set of functions, say polynomials up to $p$-th order $P^p$ with real coefficients. Choosing $F=\mathbb{R}$ and the usual $+$ and $\cdot$ of polynomials, one can verify that $P^p$ is a vector space.
- This space is still finite-dimensional - we can use $(p+1)$ basis functions $\{1,x,\cdots,x^p\}$ to represent any element in $P^p$.
A even more non-trivial example is picking $V$ as square-integral functions on interval $[0,1]$, denoted $L^2[0,1]$.
- These are functions such that $\int_0^1 |f(x)|^2 dx < \infty$.
- This space is infinite-dimensional - where infinite number of basis functions are needed to represent all elements.
Inner Product Space¶
The next level up is to equip a vector space with an inner product, $$ \langle \cdot , \cdot \rangle : V \times V \to F $$
The operation satisfies the following three properties for all vectors $x,y,z \in V$ and all scalars $a,b \in F$.
- Conjugate symmetry $$ \langle x, y \rangle = \overline{\langle y, x \rangle}. $$ Or ordinary symmetry $\langle x, y \rangle = {\langle y, x \rangle}$ if $F = \mathbb{R}$.
- Linearity in the first argument $$ \langle ax + by, z \rangle = a \langle x, z \rangle + b \langle y, z \rangle. $$
- Positive-definiteness - for $x \neq 0$, $$ \langle x, x \rangle > 0. $$
Examples¶
One of the simplest inner product spaces is again the finite-dimensional vector space, where $\langle x, y \rangle = x^T y$.
A more interesting example is $L^2[0,1]$ with an inner product $$ \langle f, g \rangle = \int_0^1 f(x)\overline{g(x)} dx $$
Pre-Hilbert Space¶
Over a vector space, one can define a norm $$ \|\cdot\| : V \to \mathbb{R}, $$ satisfying, for all $x,y \in V$ and all $\lambda \in F$:
Non-negativity $$ \|x\| \ge 0. $$
Positive definiteness $$ \|x\| = 0 \quad \text{iff} \quad x = 0. $$
Absolute homogeneity $$ \|\lambda x\| = |\lambda|\,\|x\|. $$
Triangle inequality $$ \|x + y\| \le \|x\| + \|y\|. $$
The inner product is a natural way to define a norm, or "induce" a norm, by $$ \|\cdot\| = \sqrt{\langle \cdot, \cdot \rangle} $$
With a norm one can define a distance function, or a metric $$ d : V \times V \to \mathbb{R} $$ satisfying, for all $x,y,z \in V$:
- Identity of indiscernibles: $d(x,x) = 0.$
- Positivity: For $x \neq y$, $d(x,y) > 0.$
- Symmetry: $d(x,y) = d(y,x).$
- Triangle inequality
$$d(x,z) \le d(x,y) + d(y,z).$$
The norm induced by the inner product is a natural way to define a metric, $$ d(x,y) = \|x-y\| = \sqrt{\langle x-y, x-y \rangle} $$
Examples¶
- The Euclidean space, with $d(x,y)=\|x-y\|$, i.e., the Euclidean distance.
- $L^2[0,1]$ space, with $$ d(f,g) = \left(\int (f(x)-g(x))^2 dx\right)^{1/2} $$
The pre-Hilbert space, denoted $H_0$, is an inner product space with a metric induced by the inner product.
- So technically all inner product spaces can be pre-Hilbert spaces.
- All previous examples of inner product spaces are thus also for pre-Hilbert spaces.
Caveat:
- The above means an inner product space is a normed vector space, which is a metric space.
- But the reverse is not necessarily true, as one can define metrics and norms in many different ways.
For example,
- $d(x,y)=1$ if $x\neq y$ otherwise $0$ is a metric, but cannot be a norm (fails homogeneity).
- $\|x\|_1=\sum_i |x_i|$ is a norm, but does not correspond to an inner product.
Hilbert Space¶
Lastly, we complete pre-Hilbert space to attain Hilbert space.
To understand completeness, let's first define a Cauchy sequence $\{f_n\} \subset H_0$, which satisfies
- for any $\epsilon > 0$, there exists an integer $N$ such that for all $m,n > N$, $\|f_n - f_m\| < \epsilon.$
The completeness means:
- Every Cauchy sequence converges with respect to the chosen $\|\cdot\|$ to an element in the space.
- Hilbert space $H = \overline{H_0}$ is $H_0$ plus all the limits of all Cauchy sequences in $H_0$.
Further implication:
- The requirement effectively says the space is "dense", and limits always exist, so that one can do the usual (functional) analysis.
- We can perform analysis in $H_0$ and extend results to $H$.
Example:
- A space $V_0$ of piecewise linear functions on an interval $[0,1]$, with an arbitrary number of nodes and the inner product for $L^2[0,1]$.
- Function $f(x)=x^2$ is smooth and apparently not in $V_0$.
- Now we approximate $f$ by a piecewise linear function $f_n$ using $n$ evenly distributed points over $[0,1]$.
- All $f_n\in V_0$, but $\lim_{n\rightarrow\infty} f_n =f \notin V_0$.
Hence, $V_0$ is a pre-Hilbert space, as the limit of a Cauchy sequence like $\{f_n\}$ is not in $V_0$.
Summary: What each space gives us¶
(Raw) vector space¶
- We have Vector addition and Scalar multiplication
- Gives us linearity
- We can
- Form linear combinations and linear superposition
- Talk about linear dependence / independence
- Define span, subspaces, basis, dimension
- Solve linear equations (e.g., determine coefficients of linear combinations)
Then we can express data/signal/function/etc. as a linear combination of basis, e.g.,
$$ x=\sum_{i=1}^N c_i v_i $$
Vector space with metric¶
- We add a distance function as metric
- Gives us distance between vectors
- We can
- Define convergence of sequences, and then continuity and limits
- Discuss completeness (Cauchy sequences)
Then we can talk about, e.g., convergence, say $$ x_k = \sum_{i=1}^k c_iv_i $$ and as $k\rightarrow\infty$, does $x_k\rightarrow x^*$? This means
$$d(x_k,x^*)\rightarrow 0$$
Normed vector space¶
- We add a norm, which induces metric
- Gives us magnitude of a vector
- We can
- Quantify approximation error
- Define convergence and stability consistent with linearity
- Compare rates of convergence
- Induce operator norms and define boundedness
Then we can talk about, e.g., approximation errors - such as $$ \|x^*-x_k\|=\epsilon_k $$ and then convergence: how does $\epsilon_k$ decrease as $k$ increases?
Inner product space¶
- We add an inner product, which induces norm and then metric
- Gives us angle between two vectors
- We can
- Define orthogonality and build orthonormal bases
- Perform projections and decompositions (would give us, e.g., least-squares solution)
- Use spectral theory (eigenvalues, modes)
Then we can try to find coefficients in an approximation, e.g., $$ x_k = \sum_{i=1}^k c_iv_i $$ by choosing some dual basis $u_j$ such that $\langle u_j, v_i \rangle=1$ if $i=j$ else $0$. This way $$ c_i = \langle u_i, x_k \rangle $$
Hilbert Space (Complete Inner Product Space)¶
- We add completeness with respect to the induced norm
- Gives us "well-posedness", or "limits stay in space"
- We have
- Every Cauchy sequence converges
- Limits of approximations stay inside the space: Fourier, eigenfunction, kernel expansions, etc.
- Closed subspaces admit unique orthogonal projections, and thus guarantee existence and uniqueness of projections
- Spectral theory is fully valid
This makes sure, e.g., for $$ x_k = \sum_{i=1}^k c_iv_i \in V $$ the limit is also in $V$ $$ x^* = \lim_{k\rightarrow\infty} x_k \in V $$ In other words, $x^*$ has the same (nice) properties as $x_k$ (e.g., maintaining continuity/smoothness/etc.).
Reproducing Kernel Hilbert Space¶
Kernel¶
Consider input space $X$. A kernel function is a map $$ \kappa(\cdot,\cdot) : X \times X \to \mathbb{R}. $$ By fixing one argument $x_1 \in X$, the function $$ \kappa(\cdot, x_1) : X \to \mathbb{R} $$ defines a real-valued function on $X$.
Towards RKHS, we also require two additional requirements on kernel:
- Symmetry: $\kappa(x_i,x_j) = \kappa(x_j,x_i)$
- Positive-definiteness: for any $n \in \mathbb{N}$, any $\{x_1,\dots,x_n\} \subset X$, and any $\{\alpha_1,\dots,\alpha_n\} \subset \mathbb{R}$, $$ \sum_{i=1}^n \sum_{j=1}^n \alpha_i \alpha_j \kappa(x_i,x_j) \ge 0. $$
Construction of RKHS¶
Given points $x_1,\dots,x_n \in X$, we define the space $$ H_0 = \left\{ f : f(\cdot) = \sum_{i=1}^n \alpha_i \kappa(\cdot,x_i), \ \alpha_i \in \mathbb{R},\ x_i \in X,\ n \in \mathbb{N} \right\}. $$ This space consists of finite linear combinations of kernel functions centered at points in $X$.
One can verify this is a vector space.
To make $H_0$ a pre-Hilbert space, let's first define an inner product.
For two functions $$ f(\cdot) = \sum_{i=1}^n \alpha_i \kappa(\cdot,x_i), \quad g(\cdot) = \sum_{j=1}^m \beta_j \kappa(\cdot,x_j), $$ define $$ \langle f,g \rangle_{H_0} \equiv \sum_{i=1}^n \sum_{j=1}^m \alpha_i \beta_j \kappa(x_i,x_j) \equiv \alpha^T K_{nm}\beta. $$
In particular, $$ \langle \kappa(\cdot,x_i), \kappa(\cdot,x_j) \rangle_{H_0} = \kappa(x_i,x_j). $$
- With symmetry and positive-definiteness, one can verify $\langle f,g \rangle_{H_0}$ is indeed an inner product.
- Here, we effectively require $\kappa$ itself to be the inner product.
Then, let $H = \overline{H_0}$ be the completion of $H_0$ under the inner product $\langle \cdot,\cdot \rangle_{H_0}$.
The space $H$ is the desired reproducing kernel Hilbert space (RKHS).
Theoretically ...¶
Moore-Aronszajn Theorem establishes the existence and uniqueness of RKHS, and its correspondence to kernels.
- For any symmetric positive-definite kernel $\kappa : X \times X \to \mathbb{R}$, there exists a unique RKHS $H$ with reproducing kernel $\kappa$.
- Conversely, any RKHS admits a unique symmetric positive-definite reproducing kernel.
RKHS Properties¶
The Reproducing Property¶
First is the namesake of RKHS.
For any $f \in H$ and any $x \in X$, the reproducing property holds: $$ f(x) = \langle f, \kappa(\cdot,x) \rangle_H. $$
- This means, if $f\in H$, we do not need to evaluate $f(x)$ itself, but an inner product of $f$ and $\kappa$.
- This is beneficial if $f$ is expensive to evaluate.
- Furthermore, the inner product can be completely represented using $\kappa$, and thus $f$ itself is no longer needed.
For $f(\cdot) = \sum_{i=1}^n \alpha_i \kappa(\cdot,x_i) \in H_0$, $$ \begin{aligned} f(x) &= \sum_{i=1}^n \alpha_i \kappa(x_i,x) \\ &= \sum_{i=1}^n \alpha_i \langle \kappa(\cdot,x_i), \kappa(\cdot,x) \rangle_{H_0} \\ &= \left\langle \sum_{i=1}^n \alpha_i \kappa(\cdot,x_i), \kappa(\cdot,x) \right\rangle_{H_0} \\ &= \langle f, \kappa(\cdot,x) \rangle_{H_0}. \end{aligned} $$ By continuity, this extends to all $f \in H$.
- The above means, the direct evaluation of $f$ is avoided if we can represent it by a series of coefficients $\alpha_1,\alpha_2,\cdots,\alpha_n$.
- This is the basis for kernel ridge regression!
Norm and Metric¶
With inner product we can define norm and metric as discussed before.
Norm¶
For $f(\cdot)=\kappa(\cdot,x_i)$,
$$ \|f\|_H = \sqrt{\langle f, f \rangle_H} = \sqrt{\kappa(x_i,x_i)} $$
For $f(\cdot) = \sum_{i=1}^n \alpha_i \kappa(\cdot,x_i)$, one can show $$ \|f\|_H = \sqrt{\alpha^T K\alpha} $$ where $[K]_{ij} = \kappa(x_i,x_j)$.
Proof left as exercise
Metric¶
For $f(\cdot)=\kappa(\cdot,x_i)$ and $g(\cdot)=\kappa(\cdot,x_j)$,
$$ \begin{align*} d_H(f,g)^2 &= \|f-g\|_H^2 = \langle f-g, f-g \rangle \\ &= \langle f, f \rangle + \langle g, g \rangle - 2\langle f, g \rangle \\ &= \kappa(x_i,x_i) + \kappa(x_j,x_j) - 2 \kappa(x_i,x_j) \end{align*} $$
More generally, for $f(\cdot) = \sum_{i=1}^n \alpha_i \kappa(\cdot,x_i)$ and $g(\cdot) = \sum_{j=1}^m \beta_j \kappa(\cdot,x_j)$
$$ d_H(f,g) = \sqrt{\alpha^T K_n\alpha + \beta^T K_m \beta - 2 \alpha^T K_{nm}\beta} $$
Proof left as exercise
Smoothness¶
For any $f \in H$ and any $x,x' \in X$, $$ \begin{aligned} |f(x) - f(x')| &= |\langle f, \kappa(\cdot,x) - \kappa(\cdot,x') \rangle_H|\quad \text{(Reproducing prop.)} \\ &\le \|f\|_H \, \|\kappa(\cdot,x) - \kappa(\cdot,x')\|_H\quad \text{(Cauchy–Schwarz inequality)} \\ &= \|f\|_H \, d_H(x,x').\quad \text{(Kernel-induced metric)} \end{aligned} $$
- In other words, the RKHS norm controls smoothness and stability.
- This motivates the use of RKHS norm to promote smoothness in kernel ridge regression.
More on Kernel: Mercer's theorem¶
Before we look at examples of RKHS, it is beneficial to introduce Mercer's theorem to provide a deeper understanding of kernels.
A kernel $\kappa : X\times X \rightarrow \mathbb{R}$ induces an integral operator $T_\kappa : L^2(X) \to L^2(X)$ $$ [T_\kappa f](x) = \int_X \kappa(x,z) f(z)\, dz, $$ where $L^2(X)$ denotes the space of square-integrable functions on $X$.
To be rigorous, we need compact $X$ and continuous $\kappa$ for Mercer.
Note that $T_\kappa$ is a linear operator, so it has a spectral decomposition similar to the matrices.
- The difference is that, instead of eigenvectors, $T_\kappa$ admits eigenfunctions, e.g., $$ T_\kappa\varphi = \lambda \varphi,\quad\text{or}\quad [T_\kappa \varphi](x) = \int_X \kappa(x,z) \varphi(z)\, dz = \lambda \varphi(x), $$
- General linear operators can have generalized versions of eigenvalues and eigenfunctions, which do not have direct analogy in matrices.
- But $T_\kappa$ happens to be "compact, continuous, self-adjoint, and positive", so generalized eigenpairs are not a concern.
By the spectral theorem for self-adjoint operators, all the eigenpairs of $T_\kappa$ satisfy $$ T_\kappa \varphi_i(x) = \sigma_i \varphi_i(x), $$ such that
- The eigenvalues are at most a countable decreasing sequence $\{\sigma_i\}_{i \ge 1}$ such that $\lim_{i \to \infty} \sigma_i = 0$,
- and $\sigma_i > 0$ for all $i$.
- The eigenfunctions $\{\varphi_i\}$ form an orthonormal basis of $L^2(X)$.
Mercer's theorem leads to the Mercer representation of the kernel, $$ \kappa(x,y) = \sum_{i=1}^{\infty} \sigma_i \varphi_i(x)\varphi_i(y), $$ which converges uniformly, $$ \lim_{n \to \infty} \sup_{u,v \in X} \left| \kappa(u,v) - \sum_{i=1}^{n} \sigma_i \varphi_i(u)\varphi_i(v) \right| = 0. $$
Furthermore, the RKHS $H$ associated with $\kappa$ is given by $$ H = \left\{ f \in L^2(X) \;\middle|\; \sum_{i=1}^{\infty} \frac{\langle f, \varphi_i \rangle_{L^2}^2}{\sigma_i} < \infty \right\}, $$
with inner product $$ \langle f, g \rangle_H = \sum_{i=1}^{\infty} \frac{\langle f, \varphi_i \rangle_{L^2} \langle g, \varphi_i \rangle_{L^2}}{\sigma_i}. $$
Examples of RKHS¶
One-Dimensional Linear Kernel¶
Starting from a trivial case: $$ \kappa(x,y) = xy, \quad x,y \in [0,1]. $$
Check if kernel¶
- Symmetry is apparent (we will skip this for the other examples)
- Positive definiteness $$ \sum_{i,j=1}^n \alpha_i \alpha_j \kappa(x_i,x_j) = \left( \sum_{i=1}^n \alpha_i x_i \right)^2 \ge 0, $$
So $\kappa$ is a kernel.
What space is the RKHS¶
By construction, the RKHS contains $$ f(x') = \sum_{i=1}^n \alpha_i\kappa(x',x_i) = \left(\sum_{i=1}^n \alpha_ix_i\right)x' \equiv a x' $$ Due to arbitariness of $\alpha_i$ and $x_i$, we see $$ H = \{ f : f(x) = ax,\ a \in \mathbb{R} \}. $$
Reproducing property¶
For $f(x)=ax$, which is $\kappa(\cdot,a)$ $$ \langle f, \kappa(\cdot,x) \rangle_H = \langle \kappa(\cdot,a), \kappa(\cdot,x) \rangle_H =\kappa(a,x) = ax = f(x). $$
The second equality is the definition of $H$ norm.
Mercer expansion¶
The integral operator is $$ (T_\kappa f)(x) = \int_0^1 xy f(y)\,dy = \left(\int_0^1 y f(y)\,dy\right)x \equiv \mu x, $$ where $\mu$ is a certain constant. This indicates the eigenfunction must be a linear function $\varphi(x)=\mu x$.
We also know $\varphi(x)$ is orthonormal w.r.t. $L^2$ norm, so $$ \|\varphi\|_{L^2}^2 = \int_0^1 (\mu x)^2 dx = 1\quad \Rightarrow\quad \mu=\sqrt{3}, $$ and $\varphi(x)=\sqrt{3}x$ with eigenvalue $1/3$.
And this is the only eigenfunction - i.e., this RKHS is 1D.
Second-Order Polynomial Kernel¶
Slightly more complex - generalizable to kernels based on finite number of features
$$ \kappa(x,y) = (1 + xy)^2, \quad x,y \in [-1,1]. $$
Check if kernel (positive definiteness)¶
Using the feature map approach, and let $$ \phi(x) = [1, \sqrt{2}x, x^2]^T, $$ then $$ \kappa(x,y) = \phi(x)^T\phi(y) $$ and $$ \sum_{i,j=1}^n \alpha_i \alpha_j \kappa(x_i,x_j) = \sum_{i,j=1}^n \alpha_i \alpha_j \phi(x_i)^T\phi(x_j) = \left\| \sum_{i=1}^n \alpha_i \phi(x_i) \right\|^2 \ge 0, $$ So $\kappa$ is a kernel.
What space is the RKHS¶
By construction, the RKHS contains $$ f(x') = \sum_{i=1}^n \alpha_i\kappa(x',x_i) = \left(\sum_{i=1}^n \alpha_i\phi(x_i)\right)\phi(x') \equiv a^T \phi(x') $$ Due to arbitariness of $\alpha_i$ and $x_i$, we see $H$ as all second-order polynomials on $[-1,1]$, $$ H = \{ f : f(x) = a^T\phi(x),\ a \in \mathbb{R}^3 \}. $$
Reproducing property¶
For $f(x)=ax^2+bx+c$ with arbitrary $a,b,c\in\mathbb{R}^3$, one can choose $x_i$, $i=1,2,3$, such that $\phi(x_i)$ are linearly independent, and can represent
$$ [a,b/\sqrt{2},c]^T = \sum_{i=1}^3\alpha_i \phi(x_i). $$
Then $$ f(x) = [a,b/\sqrt{2},c]\phi(x) = \sum_{i=1}^3\alpha_i \phi(x_i)^T\phi(x) = \sum_{i=1}^3\alpha_i \kappa(x,x_i), $$ and $$ \langle f, \kappa(\cdot,x) \rangle_H = \left\langle \sum_{i=1}^3\alpha_i \kappa(\cdot,x_i), \kappa(\cdot,x) \right\rangle_H = \sum_{i=1}^3\alpha_i \kappa(x,x_i) = f(x). $$
Mercer expansion¶
The integral operator is $$ (T_\kappa f)(x) = \int_{-1}^1 \phi(x)^T\phi(y) f(y)\,dy = \phi(x)^T\left(\int_{-1}^1 \phi(y) f(y)\,dy\right) \equiv w^T \phi(x), $$ which indicates the eigenfunctions must be a linear combination of $\phi(x)$, i.e., a quadratic polynomial.
Assume $\varphi_i(x) = \phi(x)^T w_i$, the eigenvalue problem reduces to that of a $3\times 3$ matrix $$ \int_{-1}^1 \phi(x)^T\phi(y) \phi(y)^T w_i\,dy = \lambda_i \phi(x)^T w_i\,\Rightarrow\, \left(\int_{-1}^1 \phi(y) \phi(y)^T\,dy\right) w_i = \lambda_i w_i $$ The solution is not shown here, but one would get 3 real eigenvalues and 3 eigenfunctions of polynomials up to 2nd-order.
Brownian Motion Kernel¶
A more interesting kernel having infinitely many features, $$ \kappa(x,x') = \min(x,x'), \quad x,x' \in [0,1]. $$
Check if kernel (positive definiteness)¶
The trick is to use an indicator function $\mathbb{1}_x(t)$ that is $1$ when $t<x$ otherwise 0.
Then the kernel is represented as an inner product in $L^2$ $$ \kappa(x,x') = \int_0^1 \mathbb{1}_x(t)\mathbb{1}_y(t) dt = \langle\mathbb{1}_x, \mathbb{1}_y\rangle_{L^2} $$ and $$ \sum_{i,j=1}^n \alpha_i \alpha_j \kappa(x_i,x_j) = \sum_{i,j=1}^n \alpha_i \alpha_j \langle\mathbb{1}_{x_i}, \mathbb{1}_{x_j}\rangle = \left\| \sum_{i=1}^n \alpha_i \mathbb{1}_{x_i} \right\|^2 \ge 0, $$ So $\kappa$ is a kernel.
What space is the RKHS¶
Here we do not prove in detail, but the RKHS is $$ H = \left\{ f : [0,1] \to \mathbb{R} \mid f,f' \in L^2[0,1], \ f(0)=0 \right\}, $$ with inner product $$ \langle f,g \rangle_H = \int_0^1 f'(x) g'(x)\,dx. $$
Reproducing property¶
Note that the derivative $\kappa'(\cdot,x)=\mathbb{1}_x(\cdot)$. Then, between kernels
$$ \langle \kappa(\cdot,x), \kappa(\cdot,x') \rangle_H = \int_0^1 \mathbb{1}_{x}(t)\mathbb{1}_{x'}(t) dt = \min(x,x') = \kappa(x,x') $$
and for a function $f\in H$ (so $f(0)=0$)
$$ \langle f, \kappa(\cdot,x) \rangle_H = \int_0^1 f'(t)\mathbb{1}_{x}(t)\,dt = \int_0^x f'(t) dt = f(x) $$
Mercer expansion¶
The eigenproblem of the integral operator is $$ (T_\kappa \varphi)(x) = \int_0^1 \min(x,y) \varphi(y)\,dy = \int_0^x y\varphi(y) dy + \int_x^1 x\varphi(y) dy = \lambda \varphi(x), $$ The integral equation can be converted to an ODE $$ \lambda \varphi''(x) + \varphi(x) = 0,\quad \varphi(0)=0,\, \lambda\varphi(1) = \int_0^1 y\varphi(y) dy $$ The eigenpair is thus, with a normalizing factor $C_i$, $$ \varphi_i(x) = C_i\sin \frac{x}{\sqrt{\lambda_i}},\quad \lambda_i = \left(\frac{2}{(2i-1)\pi}\right)^2 $$
Translation-Invariant Periodic Kernel¶
This is a large class of kernels of the form $$ \kappa(x,y) = \psi(x-y), $$ for $x,y\in \mathbb{R}$ and an even and $p$-periodic function $\psi$. For simplicity, let $p=2\pi$.
Mercer expansion¶
Because it is a generic kernel, we just look at the eigenpairs. Over one period, the integral operator is $$ \int_{-\pi}^\pi \psi(x-y)\varphi(y)dy = \lambda\varphi(x) $$
One can verify that $\sin$/$\cos$ functions are the eigenfunctions, and the kernel expansion turns out to be precisely the cosine series $$ \kappa(x,y) = \psi(x-y) = \sum_{n=0}^{\infty} \lambda_n \cos(n(x-y)) $$ where the eigenvalues are the cosine series coefficient of $\psi$.
The RKHS¶
From Mercer's theorem,
- $\kappa(x,y)$ is a kernel if and only if $\lambda_n > 0$ for all $n$.
- In this case, the RKHS is the set of $p$-periodic functions, whose Fourier coefficients are square-summable with respect to $\lambda_n^{-1}$.
Kernel Ridge Regression Revisited¶
Now we look at KRR from the RKHS view.
Assume a dataset $\{(x_i,y_i)\}_{i=1}^n$ and a kernel $\kappa$.
Kernel Regression¶
We consider a model from the RKHS, $$ f(\cdot)=\sum_{i=1}^n \alpha_i \kappa(\cdot,x_i) $$
By reproducing property, we have $$ \langle f, \kappa(\cdot, x_j) \rangle_H = \sum_{i=1}^n \alpha_i \kappa(x_i,x_j) = f(x_j) $$
We would like to have $$ \min_f \sum_{i=1}^n (f(x_j)-y_j)^2 $$ which translates to $$ \min_\alpha \|K\alpha-y\|^2 $$
Adding the Ridge¶
From the smoothness property, we know the smoothness is controlled by $\|f\|_H$. To regularize the regression, it is natural to try $$ \min_\alpha \|K\alpha-y\|^2 + \lambda \|f\|_H^2 $$
Note that $$ \|f\|_H^2 = \langle f, f\rangle_H = \alpha^T K\alpha $$
Then one can recover the KRR solution $$ \alpha = (K+\lambda I)^{-1}y $$