We say a function subjects to a GP, $$ f(\vx) \sim \GP(m(\vx),k(\vx,\vx')) $$ where $m(\vx)$ is the mean function and $k(\vx,\vx')$ is the covariance function, when $$ \begin{aligned} m(\vx) &= \mathrm{E}[f(\vx)] \\ k(\vx,\vx') &= \mathrm{E}[(f(\vx)-m(\vx))(f(\vx')-m(\vx'))] \end{aligned} $$ The probabilistic distribution of a set of points $\cD=\{ \vX,\vy\}$ sampled from GP is Gaussian, $$ \vy\sim\cN(\vm,\vK) $$ where $[\vm]_i=m(\vX_i)$ and $[\vK]_{ij}=k(\vX_i,\vX_j)$.
A covariance function of a Gaussian process is defined as, $$ k(\vx_i,\vx_j) = \sfs R(\vx_i,\vx_j) $$ where
When the observation is noisy, the covariance function is modified as, $$ k(\vx_i,\vx_j) = \sfs R(\vx_i,\vx_j) + \sns \delta_{ij} \equiv \sfs [R(\vx_i,\vx_j) + \tsn \delta_{ij}] $$
Some possible covariance functions ($r=||\vx-\vx'||$)
# Example covariance functions
N = 1001
r = 0.1
X = np.linspace(0,1,N)
figure = plt.figure(figsize=(6,4))
plt.plot(X, SEKernel(X,r), 'b-', label='Squared Exponential')
plt.plot(X, M3Kernel(X,r), 'r-', label='Matern 3/2')
plt.plot(X, PPKernel(X,r), 'k-', label='Piecewise Polynomial')
plt.plot(X, CosKernel(X,r), 'm--', label='Cos')
plt.xlabel('r')
plt.ylabel("Cov")
_=plt.legend(loc=0)
# Example realizations of GPs with different covariance functions
figure = plt.figure(figsize=(6,4))
plt.plot(X, S_SE, 'b-', label='Squared Exponential')
plt.plot(X, S_MT, 'r-', label='Matern 3/2')
plt.plot(X, S_PP, 'k-', label='Piecewise Polynomial')
plt.plot(X, S_CS, 'm--', label='Cosine')
plt.xlabel("X")
plt.ylabel("SAMPLE")
_=plt.legend(loc=0)
A typical choice of $R$ is the squared exponential function, $$ R(\vx_i,\vx_j) = \exp[-(\vx_i-\vx_j)^T\vM(\vx_i-\vx_j)] $$ where in the isotropic case, $\vM=\frac{1}{2l^2}\vI$ meaning that the length scales in all dimensions are the same; while in the anisotropic case, $\vM=\half diag(\vl)^{-2}$ is a diagonal matrix with positive diagonal entries, representing the length scales in different dimensions.
Hyperparameters so far:
Return to the dataset $\cD=\{ \vX,\vy\}$ sampled from GP: $$ \vy\sim\cN(\vm,\vK) $$ where $[\vm]_i=m(\vX_i)$ and $[\vK]_{ij}=k(\vX_i,\vX_j)$. Note that scalar output is assumed.
Now split the points into two sets: the training set $\vX_s$ with known output $\vy_s$, and the target set $\vX_u$ with unknown output $\vy_s$. The joint distribution is still Gaussian, with the following mean, $$ \vm^T = [\vm(\vX_s)^T, \vm(\vX_u)^T] \equiv [\vm_s^T, \vm_u^T] $$ and the block covariance matrix, $$ \vK = \begin{bmatrix} \vK_{ss} & \vK_{su} \\ \vK_{us} & \vK_{uu} \end{bmatrix} $$ where the $i,j$th element of $\vK$ is associated with the $i$th and $j$th rows of $\vX$, $$ [\vK_{ab}]_{ij} = k([\vX_a]_i, [\vX_b]_j) $$
When $\vy_s$ is noisy, a white noise term with variance $\sigma_n^2$ is added to the Gaussian distribution, adding a diagonal matrix to $\vK_{ss}$ $$ \vK_y = \vK_{ss} + \sigma_n^2 \vI_{ss} $$
The distribution of $\vy_u$ given $\vy_s$ is found by conditional probability, $$ \vy_u|\vy_s \sim \cN(\vm_p, \vK_p) $$ where the predictive mean is, $$ \vm_p = \vm_u + \vK_{us}\vK_y^{-1}(\vy_s-\vm_s) $$ and the (predictive) covariance is, $$ \vK_p = \vK_{uu} - \vK_{us}\vK_y^{-1}\vK_{su} $$
What are the differences between the predictive mean and the kernel ridge formulation?
# Posterior GP's
data = np.array([
[0.2, 0.4, 0.8], # Input
[0.8,-0.2, 0.2] # Output
])
mp, std, fp = genGP(X, data, SEKernel, r, nsmp=5)
plotGP(X, data, mp, std, fp)
The form of mean function is usually unknown beforehand, so it is more common to use a set of basis functions, $$ m(\vx) = \vh(\vx)^T\cb $$ The coefficients $\cb$ have to be fitted from the sample data. Under a Bayesian framework, $\cb$ is assumed to subject to, $$ \cb \sim \cN(\vb, \vB) $$ Taking $\cb$ into account, the GP is modified as, $$ f(\vx) \sim \GP(\vh(\vx)^T\vb,k(\vx,\vx')+\vh(\vx)^T\vB\vh(\vx)) $$ Extra terms due to the basis functions shall be added to the covariance matrix, $$ \vK_{ab} = \vK(\vX_a, \vX_b) + \vH_a \vB \vH_b^T,\quad [\vH_*]_i=\vh([\vX_*]_i)^T $$
The extra terms make the predictive mean and covariance from the previous section extremely cumbersome to compute. Simplifications are needed and enabled by invoking the matrix inversion lemma, $$ (\vA+\vU\vC\vV^T)^{-1} = \vA^{-1} - \vA^{-1}\vU (\vC^{-1}+\vV^T\vA^{-1}\vU)^{-1} \vV^T\vA^{-1} $$
[Skip in class]
In current case, the lemma is applied as follows, $$ (\vK_y + \vH \vB \vH^T)^{-1} = \vKyi-\vKyi\vH \vC^{-1} \vH^T \vKyi \equiv \vKyi-\vA $$
where $\vC=\vB^{-1} + \vH^T \vKyi \vH$ and the subscript $s$ is dropped in $\vH_s$. Furthermore, the following identities can be derived, $$ \begin{aligned} \vB\vH^T(\vKyi-\vA) &= \vC^{-1}\vH^T\vKyi \\ (\vKyi-\vA)\vH\vB &= \vKyi\vH\vC^{-1} \end{aligned} $$ The simplification procedure is bascally to cancel out matrices by combining the $\vH$ and $(\vKyi-\vA)$ terms.
[Skip in class]
The predictive mean is simplified to, $$ \begin{aligned} \vm_p^* &= \vH_u\vb + (\vK_{us} + \vH_u\vB\vH^T)(\vK_y + \vH\vB\vH^T)^{-1}(\vy_s-\vH\vb) \\ &= \vK_{us}\vKyi\vy_s + \vD^T \vC^{-1} (\vH^T \vKyi \vy_s + \vB^{-1}\vb) \end{aligned} $$ where $\vD=\vH_u^T-\vH^T \vKyi \vK_{su}$. The covariance matrix is simplied to, $$ \begin{aligned} \vK_p^* &= \vK_{uu} + \vH_u\vB\vH_u^T - (\vK_{us}+\vH_u\vB\vH^T) (\vK_y + \vH\vB\vH^T)^{-1} (\vK_{su}+\vH\vB\vH_u^T) \\ &= \vK_p + \vD^T \vC^{-1} \vD \end{aligned} $$
Finally, consider the case that the coefficients are distributed uniformly, instead of Gaussian. That means $\vb\rightarrow\mathrm{0}$ and $\vB^{-1}\rightarrow \vO$.
With that, we arrive at the commonly used form of GPR, $$ \begin{aligned} \vm_p^* &= \vK_{us}\vKyi\vy_s + \vD^T (\vH^T\vKyi\vH)^{-1} \vH\vKyi\vy_s \\ &= \vK_{us}\bar\vg + \vH_u\bar\vb \\ \vK_p^* &= \vK_p + \vD^T (\vH^T\vKyi\vH)^{-1} \vD \end{aligned} $$ where $\bar\vb=(\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s$ and $\bar\vg=\vKyi(\vy_s-\vH\bar\vb)$. The term $\bar\vb$ is essentially the coefficients of the mean function fitted from the sample data. Also, note that some people argue that the last term in $\vK_p^*$ can neglected for simplicity [Sasena2002].
The computation of $\vm_p^*$ and $\vK_p^*$ can be tricky due to the ill-conditioned matrix inversion. The strategy is to combine cholesky decomposition and QR decomposition to stabilize the inversions, i.e. $\vK_y^{-1}$ and $(\vH^T\vKyi\vH)^{-1}$.
For inversion of $\vK_y$, $$ \vKyi\vx = (\vL\vL^T)^{-1}\vx = \vL^{-T} (\vL^{-1}\vx) $$ where $\vL$ is a lower triangular matrix, and the matrix inversion is converted to two consecutive triangular solves.
To inverse $\vH^T\vKyi\vH$, $$ \vH^T\vKyi\vH = (\vL^{-1}\vH)^T (\vL^{-1}\vH) \equiv \vF^T\vF = \vR^T\vR $$ where QR decomposition is used $\vQ\vR=\vF$, $\vQ$ is an orthogonal matrix, and $\vR$ is an upper triangular matrix. $\vQ$ is tall and slim, because the number of rows equals to the number of samples, while the number of columns equals to the number of basis functions.
After the stabilization, the quantities $\bar\vb$ and $\bar\vg$ in $\vm_p^*$ are computed as follows, $$ \begin{aligned} \bar\vb &= (\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s = \vR^{-1} [\vQ^T(\vL^{-1}\vy_s)] \\ \bar\vg &= \vKyi(\vy_s-\vH\bar\vb) = \vL^{-T} [\vL^{-1} (\vy_s-\vH\bar\vb)] \end{aligned} $$
The covariance matrix is computed as follows, $$ \vK_p^* = \vK_{uu} - (\vL^{-1}\vK_{su})^2 + (\vR^{-T} [\vH_u^T - \vF^T(\vL^{-1}\vK_{su})])^2 $$ where $(\Box)^2=\Box^T\Box$.
Yes...?
No, how about hyperparameters?
For easier treatment, we non-dimensionalize some quantities in the GPR model.
The covariance matrices $\vK_y$ and $\vK_{su}$ can be "non-dimensionalized" by $\sfs$, $$ \begin{aligned} \vK_y &\equiv \sfs \tvK_y \equiv \sfs (\tvK_{ss} + \tsn\vI) \\ \vK_{su} &\equiv \sfs \tvK_{su} \end{aligned} $$
Subsequently, for the predictive mean, $$ \vm_p^* = \vK_{su}^T\bar\vg + \vH_u^T\bar\vb = \tvK_{su}^T\tvg + \vH_u^T\tvb $$
[Skip in class]
where, $$ \begin{aligned} \bar\vb &= (\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s = (\vH^T\tKyi\vH)^{-1}\vH^T\tKyi\vy_s \equiv \tvb\\ \bar\vg &= \vKyi(\vy_s-\vH\bar\vb) = \sigma_f^{-2} \tKyi(\vy_s-\vH\tvb) \equiv \sigma_f^{-2} \tvg \end{aligned} $$ And the covariance, $$ \begin{aligned} \vK_p^* &= \vK_{uu} - \vK_{us}\vK_y^{-1}\vK_{su} + \vD^T (\vH^T\vKyi\vH)^{-1} \vD \\ &= \sfs [\tvK_{uu} - \tvK_{us}\tKyi\tvK_{su} + \tvD^T (\vH^T\tKyi\vH)^{-1} \tvD] \end{aligned} $$ where, $$ \vD = \vH_u^T-\vH^T \vKyi \vK_{su} = \vH_u^T-\vH^T \tKyi \tvK_{su} \equiv \tvD $$ In sum, $\sfs$ has no direct effect on the mean, but the covariance is proportional to $\sfs$, once the length scales are given.
The hyperparameters are determined using the maximum likelihood estimation (MLE). With the joint Gaussian distribution, the log marginal likelihood of the training data is, $$ \begin{aligned} \cL &= \log p(\vy_s|\vX_s,\vb,\vB) \\ &= - \half(\vy_s-\vH\vb)^T (\vK_y+\vH^T\vB\vH)^{-1} (\vy_s-\vH\vb) \\ &- \half \log |\vK_y+\vH^T\vB\vH| - \frac{n}{2}\log 2\pi \end{aligned} $$ where $n$ is number of training data points.
[Skip in class]
Utilizing the determinant counterpart of matrix inversion lemma, $$ |\vA+\vU\vC\vV^T| = |\vA| |\vC| |\vC^{-1} + \vV^T\vA^{-1}\vU| $$ and let $\vb=0$ and $\vB^{-1}\rightarrow\vO$ as was done last time, $$ -2\cL = \vy_s^T [\vKyi - \vKyi\vH (\vH^T\vKyi\vH)^{-1} \vH^T \vKyi] \vy_s + \log|\vK_y| + \log|\vH^T\vKyi\vH| + (n-m)\log 2\pi $$ where $m$ is number of basis functions, and was introduced due to the singularity caused by $|\vB|$.
[Skip in class]
An interesting simplification can be done to the first term in the above expression, $$ \begin{aligned} I_1 &= \vy_s^T [\vKyi - \vKyi\vH (\vH^T\vKyi\vH)^{-1} \vH^T \vKyi] \vy_s \\ &= \vy_s^T \vKyi (\vy_s - \vH\bar\vb) \equiv I_2 \\ I_1 &= (\vy_s - \vH\bar\vb)^T \vKyi \vy_s \equiv I_3 \\ I_1 &= \vy_s^T \vKyi \vy_s - \bar\vb^T\vH^T \vKyi \vH\bar\vb \equiv I_4 \\ I_1 &= I_2+I_3-I_4 = (\vy_s - \vH\bar\vb)^T \vKyi (\vy_s - \vH\bar\vb) \end{aligned} $$ Indicating that the term $I_1$ behaves as if it is centered at $\vH\bar\vb$, with a covariance matrix $\vK_y$.
Factor $\cL$ with $\sfs$, $$ -2\tcL_1 = \sigma_f^{-2} (\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) + \log|\tvK_y| + \log|\vH^T\tKyi\vH| + (n-m)\log\sfs $$ where the constant term is neglected.
If the prior on the coefficients of the basis functions is ignored, the log-likelihood simplifies to, $$ -2\tcL_2 = \sigma_f^{-2} (\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) + \log|\tvK_y| + n\log\sfs $$
A natural step next would be finding $\sfs$ by setting $\partial(-2\tcL)/\partial\sfs=0$ and solving for $\sfs$, $$ \sfs = \frac{1}{N}(\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) $$ where $N=n-m$ for $\tcL_1$, and $N=n$ for $\tcL_2$. The former case is the center estimate of the variance, while the latter the MLE. When there are multiple outputs, the output variables are usually assumed to be independent, and $\sfs$ can be computed separately for each output.
Plug the value back to the likelihood, and remove the constant terms [Welch1992], one obtains reduced log likelihood, $$ \begin{aligned} -\cF_1 &= \log|\tvK_y| + \log|\vH^T\tKyi\vH| + (n-m)\log\sfs \\ -\cF_2 &= \log|\tvK_y| + n\log\sfs \end{aligned} $$ where the only remaining unknown hyperparameters are the length scales $\vl$. Note that $\cF_2$ should be used if a general mean function, without priors on its coefficients, is employed.
The minimization of $-\cF_2$ is equivalent to the minimization of $$ -\cF^* = \sfs|\tvK_y|^{1/n} $$ where $|\tvK_y|=|\tilde{\vL}|^2$ using Cholesky decomposition $\tvK_y=\tilde{\vL}\tilde{\vL}^T$.