Machine Learning for Engineering¶

Gaussian Process Regression: Probabilitic Formulation¶

Instructor: Daning Huang¶

$$ \newcommand{\GP}{\mathcal{GP}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cb}{\mathcal{b}} \newcommand{\tcL}{\tilde{\mathcal{L}}} \newcommand{\vA}{\mathbf{A}} \newcommand{\vB}{\mathbf{B}} \newcommand{\vC}{\mathbf{C}} \newcommand{\vD}{\mathbf{D}} \newcommand{\vF}{\mathbf{F}} \newcommand{\vH}{\mathbf{H}} \newcommand{\vI}{\mathbf{I}} \newcommand{\vK}{\mathbf{K}} \newcommand{\vL}{\mathbf{L}} \newcommand{\vM}{\mathbf{M}} \newcommand{\vO}{\mathbf{O}} \newcommand{\vQ}{\mathbf{Q}} \newcommand{\vR}{\mathbf{R}} \newcommand{\vU}{\mathbf{U}} \newcommand{\vV}{\mathbf{V}} \newcommand{\vX}{\mathbf{X}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vh}{\mathbf{h}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vl}{\mathbf{l}} \newcommand{\vm}{\mathbf{m}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vy}{\mathbf{y}} \newcommand{\tvD}{\tilde{\mathbf{D}}} \newcommand{\tvK}{\tilde{\mathbf{K}}} \newcommand{\tvb}{\tilde{\mathbf{b}}} \newcommand{\tvg}{\tilde{\mathbf{g}}} \newcommand{\vKyi}{\mathbf{K}_y^{-1}} \newcommand{\tKyi}{\tilde{\mathbf{K}}_y^{-1}} \newcommand{\sfs}{\sigma_f^2} \newcommand{\sns}{\sigma_n^2} \newcommand{\tsn}{\tilde{\sigma}_n^2} \newcommand{\half}{\frac{1}{2}} $$

TODAY: Gaussian Process Regression - II¶

  • Gaussian process
  • Probabilistic formulation
  • Computational aspects
  • Determination of hyperparameters

References¶

  • GPML Chps. 2, 4, 5
  • DACE Toolbox Manual

Gaussian Process¶

From wikipedia:

  • A stochastic process is a collection of random variables indexed by time or space
  • A Gaussian process (GP) is a stochastic process, such that every finite collection of those random variables has a multivariate Gaussian distribution

We say a function subjects to a GP, $$ f(\vx) \sim \GP(m(\vx),k(\vx,\vx')) $$ where $m(\vx)$ is the mean function and $k(\vx,\vx')$ is the covariance function, when $$ \begin{aligned} m(\vx) &= \mathrm{E}[f(\vx)] \\ k(\vx,\vx') &= \mathrm{E}[(f(\vx)-m(\vx))(f(\vx')-m(\vx'))] \end{aligned} $$ The probabilistic distribution of a set of points $\cD=\{ \vX,\vy\}$ sampled from GP is Gaussian, $$ \vy\sim\cN(\vm,\vK) $$ where $[\vm]_i=m(\vX_i)$ and $[\vK]_{ij}=k(\vX_i,\vX_j)$.

Covariance function¶

A covariance function of a Gaussian process is defined as, $$ k(\vx_i,\vx_j) = \sfs R(\vx_i,\vx_j) $$ where

  • $\sfs$ is the process variance
  • $R$ is typically stationary, i.e. $R(\vx,\vx')=R(\vx-\vx')$, and monotonically decreasing.
  • Also assuming $R(0)=1$, meaning that a point correlates with itself the most.

When the observation is noisy, the covariance function is modified as, $$ k(\vx_i,\vx_j) = \sfs R(\vx_i,\vx_j) + \sns \delta_{ij} \equiv \sfs [R(\vx_i,\vx_j) + \tsn \delta_{ij}] $$

Some possible covariance functions ($r=||\vx-\vx'||$)

  • Squared exponential (More discussion later) $$ k(r) = \exp\left(-\frac{r^2}{l^2}\right) $$
  • Matérn class: $$ k(r) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}r}{l}\right)^\nu K_\nu \left(\frac{\sqrt{2\nu}r}{l}\right) $$
  • Piecewise polynomial, e.g. $$ k(r) = (1-r)_+^j[(j+1)r+1] $$ where $j$ depends on dimension, and it has a "compact support", i.e. non-zero within limited range.
  • Periodic kernels, e.g. $$ k(r) = \cos\left(\frac{r}{l}\right) $$
In [5]:
# Example covariance functions
N = 1001
r = 0.1
X = np.linspace(0,1,N)
figure = plt.figure(figsize=(6,4))
plt.plot(X, SEKernel(X,r), 'b-', label='Squared Exponential')
plt.plot(X, M3Kernel(X,r), 'r-', label='Matern 3/2')
plt.plot(X, PPKernel(X,r), 'k-', label='Piecewise Polynomial')
plt.plot(X, CosKernel(X,r), 'm--', label='Cos')
plt.xlabel('r')
plt.ylabel("Cov")
_=plt.legend(loc=0)
In [7]:
# Example realizations of GPs with different covariance functions
figure = plt.figure(figsize=(6,4))
plt.plot(X, S_SE, 'b-', label='Squared Exponential')
plt.plot(X, S_MT, 'r-', label='Matern 3/2')
plt.plot(X, S_PP, 'k-', label='Piecewise Polynomial')
plt.plot(X, S_CS, 'm--', label='Cosine')
plt.xlabel("X")
plt.ylabel("SAMPLE")
_=plt.legend(loc=0)

A typical choice of $R$ is the squared exponential function, $$ R(\vx_i,\vx_j) = \exp[-(\vx_i-\vx_j)^T\vM(\vx_i-\vx_j)] $$ where in the isotropic case, $\vM=\frac{1}{2l^2}\vI$ meaning that the length scales in all dimensions are the same; while in the anisotropic case, $\vM=\half diag(\vl)^{-2}$ is a diagonal matrix with positive diagonal entries, representing the length scales in different dimensions.

Hyperparameters so far:

  • Length scales $\vl$ - to be found by automatic relevance determination (ARD)
  • Variances $\sfs$ and $\sns$

The basic form¶

Return to the dataset $\cD=\{ \vX,\vy\}$ sampled from GP: $$ \vy\sim\cN(\vm,\vK) $$ where $[\vm]_i=m(\vX_i)$ and $[\vK]_{ij}=k(\vX_i,\vX_j)$. Note that scalar output is assumed.

Now split the points into two sets: the training set $\vX_s$ with known output $\vy_s$, and the target set $\vX_u$ with unknown output $\vy_s$. The joint distribution is still Gaussian, with the following mean, $$ \vm^T = [\vm(\vX_s)^T, \vm(\vX_u)^T] \equiv [\vm_s^T, \vm_u^T] $$ and the block covariance matrix, $$ \vK = \begin{bmatrix} \vK_{ss} & \vK_{su} \\ \vK_{us} & \vK_{uu} \end{bmatrix} $$ where the $i,j$th element of $\vK$ is associated with the $i$th and $j$th rows of $\vX$, $$ [\vK_{ab}]_{ij} = k([\vX_a]_i, [\vX_b]_j) $$

When $\vy_s$ is noisy, a white noise term with variance $\sigma_n^2$ is added to the Gaussian distribution, adding a diagonal matrix to $\vK_{ss}$ $$ \vK_y = \vK_{ss} + \sigma_n^2 \vI_{ss} $$

Prediction and error estimate¶

The distribution of $\vy_u$ given $\vy_s$ is found by conditional probability, $$ \vy_u|\vy_s \sim \cN(\vm_p, \vK_p) $$ where the predictive mean is, $$ \vm_p = \vm_u + \vK_{us}\vK_y^{-1}(\vy_s-\vm_s) $$ and the (predictive) covariance is, $$ \vK_p = \vK_{uu} - \vK_{us}\vK_y^{-1}\vK_{su} $$

What are the differences between the predictive mean and the kernel ridge formulation?

In [8]:
# Posterior GP's
data = np.array([
    [0.2, 0.4, 0.8],   # Input
    [0.8,-0.2, 0.2]    # Output
])
mp, std, fp = genGP(X, data, SEKernel, r, nsmp=5)
plotGP(X, data, mp, std, fp)

Using basis functions for the mean function¶

The form of mean function is usually unknown beforehand, so it is more common to use a set of basis functions, $$ m(\vx) = \vh(\vx)^T\cb $$ The coefficients $\cb$ have to be fitted from the sample data. Under a Bayesian framework, $\cb$ is assumed to subject to, $$ \cb \sim \cN(\vb, \vB) $$ Taking $\cb$ into account, the GP is modified as, $$ f(\vx) \sim \GP(\vh(\vx)^T\vb,k(\vx,\vx')+\vh(\vx)^T\vB\vh(\vx)) $$ Extra terms due to the basis functions shall be added to the covariance matrix, $$ \vK_{ab} = \vK(\vX_a, \vX_b) + \vH_a \vB \vH_b^T,\quad [\vH_*]_i=\vh([\vX_*]_i)^T $$

Simplification¶

The extra terms make the predictive mean and covariance from the previous section extremely cumbersome to compute. Simplifications are needed and enabled by invoking the matrix inversion lemma, $$ (\vA+\vU\vC\vV^T)^{-1} = \vA^{-1} - \vA^{-1}\vU (\vC^{-1}+\vV^T\vA^{-1}\vU)^{-1} \vV^T\vA^{-1} $$

[Skip in class]

In current case, the lemma is applied as follows, $$ (\vK_y + \vH \vB \vH^T)^{-1} = \vKyi-\vKyi\vH \vC^{-1} \vH^T \vKyi \equiv \vKyi-\vA $$

where $\vC=\vB^{-1} + \vH^T \vKyi \vH$ and the subscript $s$ is dropped in $\vH_s$. Furthermore, the following identities can be derived, $$ \begin{aligned} \vB\vH^T(\vKyi-\vA) &= \vC^{-1}\vH^T\vKyi \\ (\vKyi-\vA)\vH\vB &= \vKyi\vH\vC^{-1} \end{aligned} $$ The simplification procedure is bascally to cancel out matrices by combining the $\vH$ and $(\vKyi-\vA)$ terms.

[Skip in class]

The predictive mean is simplified to, $$ \begin{aligned} \vm_p^* &= \vH_u\vb + (\vK_{us} + \vH_u\vB\vH^T)(\vK_y + \vH\vB\vH^T)^{-1}(\vy_s-\vH\vb) \\ &= \vK_{us}\vKyi\vy_s + \vD^T \vC^{-1} (\vH^T \vKyi \vy_s + \vB^{-1}\vb) \end{aligned} $$ where $\vD=\vH_u^T-\vH^T \vKyi \vK_{su}$. The covariance matrix is simplied to, $$ \begin{aligned} \vK_p^* &= \vK_{uu} + \vH_u\vB\vH_u^T - (\vK_{us}+\vH_u\vB\vH^T) (\vK_y + \vH\vB\vH^T)^{-1} (\vK_{su}+\vH\vB\vH_u^T) \\ &= \vK_p + \vD^T \vC^{-1} \vD \end{aligned} $$

Finally, consider the case that the coefficients are distributed uniformly, instead of Gaussian. That means $\vb\rightarrow\mathrm{0}$ and $\vB^{-1}\rightarrow \vO$.

With that, we arrive at the commonly used form of GPR, $$ \begin{aligned} \vm_p^* &= \vK_{us}\vKyi\vy_s + \vD^T (\vH^T\vKyi\vH)^{-1} \vH\vKyi\vy_s \\ &= \vK_{us}\bar\vg + \vH_u\bar\vb \\ \vK_p^* &= \vK_p + \vD^T (\vH^T\vKyi\vH)^{-1} \vD \end{aligned} $$ where $\bar\vb=(\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s$ and $\bar\vg=\vKyi(\vy_s-\vH\bar\vb)$. The term $\bar\vb$ is essentially the coefficients of the mean function fitted from the sample data. Also, note that some people argue that the last term in $\vK_p^*$ can neglected for simplicity [Sasena2002].

Computational aspects¶

The computation of $\vm_p^*$ and $\vK_p^*$ can be tricky due to the ill-conditioned matrix inversion. The strategy is to combine cholesky decomposition and QR decomposition to stabilize the inversions, i.e. $\vK_y^{-1}$ and $(\vH^T\vKyi\vH)^{-1}$.

For inversion of $\vK_y$, $$ \vKyi\vx = (\vL\vL^T)^{-1}\vx = \vL^{-T} (\vL^{-1}\vx) $$ where $\vL$ is a lower triangular matrix, and the matrix inversion is converted to two consecutive triangular solves.

To inverse $\vH^T\vKyi\vH$, $$ \vH^T\vKyi\vH = (\vL^{-1}\vH)^T (\vL^{-1}\vH) \equiv \vF^T\vF = \vR^T\vR $$ where QR decomposition is used $\vQ\vR=\vF$, $\vQ$ is an orthogonal matrix, and $\vR$ is an upper triangular matrix. $\vQ$ is tall and slim, because the number of rows equals to the number of samples, while the number of columns equals to the number of basis functions.

After the stabilization, the quantities $\bar\vb$ and $\bar\vg$ in $\vm_p^*$ are computed as follows, $$ \begin{aligned} \bar\vb &= (\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s = \vR^{-1} [\vQ^T(\vL^{-1}\vy_s)] \\ \bar\vg &= \vKyi(\vy_s-\vH\bar\vb) = \vL^{-T} [\vL^{-1} (\vy_s-\vH\bar\vb)] \end{aligned} $$

The covariance matrix is computed as follows, $$ \vK_p^* = \vK_{uu} - (\vL^{-1}\vK_{su})^2 + (\vR^{-T} [\vH_u^T - \vF^T(\vL^{-1}\vK_{su})])^2 $$ where $(\Box)^2=\Box^T\Box$.

Are we done yet?¶

Yes...?

No, how about hyperparameters?

  • Length scales $\vl$ in the covariance functions
  • Process variance $\sfs$
  • Process noise $\sns$
    • Determined by measurement
    • Zero if using computer simulations, but typical people use a small value for numerical stability

Preprocessing - Non-dimensionalization¶

For easier treatment, we non-dimensionalize some quantities in the GPR model.

The covariance matrices $\vK_y$ and $\vK_{su}$ can be "non-dimensionalized" by $\sfs$, $$ \begin{aligned} \vK_y &\equiv \sfs \tvK_y \equiv \sfs (\tvK_{ss} + \tsn\vI) \\ \vK_{su} &\equiv \sfs \tvK_{su} \end{aligned} $$

Subsequently, for the predictive mean, $$ \vm_p^* = \vK_{su}^T\bar\vg + \vH_u^T\bar\vb = \tvK_{su}^T\tvg + \vH_u^T\tvb $$

[Skip in class]

where, $$ \begin{aligned} \bar\vb &= (\vH^T\vKyi\vH)^{-1}\vH^T\vKyi\vy_s = (\vH^T\tKyi\vH)^{-1}\vH^T\tKyi\vy_s \equiv \tvb\\ \bar\vg &= \vKyi(\vy_s-\vH\bar\vb) = \sigma_f^{-2} \tKyi(\vy_s-\vH\tvb) \equiv \sigma_f^{-2} \tvg \end{aligned} $$ And the covariance, $$ \begin{aligned} \vK_p^* &= \vK_{uu} - \vK_{us}\vK_y^{-1}\vK_{su} + \vD^T (\vH^T\vKyi\vH)^{-1} \vD \\ &= \sfs [\tvK_{uu} - \tvK_{us}\tKyi\tvK_{su} + \tvD^T (\vH^T\tKyi\vH)^{-1} \tvD] \end{aligned} $$ where, $$ \vD = \vH_u^T-\vH^T \vKyi \vK_{su} = \vH_u^T-\vH^T \tKyi \tvK_{su} \equiv \tvD $$ In sum, $\sfs$ has no direct effect on the mean, but the covariance is proportional to $\sfs$, once the length scales are given.

Learning the hyperparameters¶

Log-likelihood¶

The hyperparameters are determined using the maximum likelihood estimation (MLE). With the joint Gaussian distribution, the log marginal likelihood of the training data is, $$ \begin{aligned} \cL &= \log p(\vy_s|\vX_s,\vb,\vB) \\ &= - \half(\vy_s-\vH\vb)^T (\vK_y+\vH^T\vB\vH)^{-1} (\vy_s-\vH\vb) \\ &- \half \log |\vK_y+\vH^T\vB\vH| - \frac{n}{2}\log 2\pi \end{aligned} $$ where $n$ is number of training data points.

[Skip in class]

Utilizing the determinant counterpart of matrix inversion lemma, $$ |\vA+\vU\vC\vV^T| = |\vA| |\vC| |\vC^{-1} + \vV^T\vA^{-1}\vU| $$ and let $\vb=0$ and $\vB^{-1}\rightarrow\vO$ as was done last time, $$ -2\cL = \vy_s^T [\vKyi - \vKyi\vH (\vH^T\vKyi\vH)^{-1} \vH^T \vKyi] \vy_s + \log|\vK_y| + \log|\vH^T\vKyi\vH| + (n-m)\log 2\pi $$ where $m$ is number of basis functions, and was introduced due to the singularity caused by $|\vB|$.

[Skip in class]

An interesting simplification can be done to the first term in the above expression, $$ \begin{aligned} I_1 &= \vy_s^T [\vKyi - \vKyi\vH (\vH^T\vKyi\vH)^{-1} \vH^T \vKyi] \vy_s \\ &= \vy_s^T \vKyi (\vy_s - \vH\bar\vb) \equiv I_2 \\ I_1 &= (\vy_s - \vH\bar\vb)^T \vKyi \vy_s \equiv I_3 \\ I_1 &= \vy_s^T \vKyi \vy_s - \bar\vb^T\vH^T \vKyi \vH\bar\vb \equiv I_4 \\ I_1 &= I_2+I_3-I_4 = (\vy_s - \vH\bar\vb)^T \vKyi (\vy_s - \vH\bar\vb) \end{aligned} $$ Indicating that the term $I_1$ behaves as if it is centered at $\vH\bar\vb$, with a covariance matrix $\vK_y$.

Factor $\cL$ with $\sfs$, $$ -2\tcL_1 = \sigma_f^{-2} (\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) + \log|\tvK_y| + \log|\vH^T\tKyi\vH| + (n-m)\log\sfs $$ where the constant term is neglected.

If the prior on the coefficients of the basis functions is ignored, the log-likelihood simplifies to, $$ -2\tcL_2 = \sigma_f^{-2} (\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) + \log|\tvK_y| + n\log\sfs $$

Process variances¶

A natural step next would be finding $\sfs$ by setting $\partial(-2\tcL)/\partial\sfs=0$ and solving for $\sfs$, $$ \sfs = \frac{1}{N}(\vy_s - \vH\tvb)^T \tKyi (\vy_s - \vH\tvb) $$ where $N=n-m$ for $\tcL_1$, and $N=n$ for $\tcL_2$. The former case is the center estimate of the variance, while the latter the MLE. When there are multiple outputs, the output variables are usually assumed to be independent, and $\sfs$ can be computed separately for each output.

Plug the value back to the likelihood, and remove the constant terms [Welch1992], one obtains reduced log likelihood, $$ \begin{aligned} -\cF_1 &= \log|\tvK_y| + \log|\vH^T\tKyi\vH| + (n-m)\log\sfs \\ -\cF_2 &= \log|\tvK_y| + n\log\sfs \end{aligned} $$ where the only remaining unknown hyperparameters are the length scales $\vl$. Note that $\cF_2$ should be used if a general mean function, without priors on its coefficients, is employed.

The minimization of $-\cF_2$ is equivalent to the minimization of $$ -\cF^* = \sfs|\tvK_y|^{1/n} $$ where $|\tvK_y|=|\tilde{\vL}|^2$ using Cholesky decomposition $\tvK_y=\tilde{\vL}\tilde{\vL}^T$.

Finding the hyperparameters¶

  • Two-step method
    • Utilizes the reduced log likelihood, $\cF_2$
    • Primary unknown: length scales $\vl$
    • Dependent hyperparameters: $\sfs$ and $\sns$
  • One-step method
    • Determine all hyperparameters, even including $\bar\vb$, simultaneously
    • Typically relies on gradient-based algorithms
    • Efficient if implemented under a differentiable programming framework
In [ ]: