So far, the classical formulation of Gaussian process regression (GPR) is presented. GPR is non-parametric, and thus very flexible - it can handle any dataset. Moreover, it provides the error estimate of the prediction. However, a non-parametric model like GPR has its inherent disadvantages. The prediction depends on the dataset. That is, the model does not actually extract any information from the data, and it is essentially brute-force curve-fitting. Furthermore, the model is dense. The prediction requires accessing all the training data, resulting in expensive dense matrix operations.
In fact, given $N$ training data points, the trainging and prediction stages require $O(N^3)$ and $O(N^2)$ cost, respectively. $$ \begin{align} \vm_p &= \vm_u + \vK_{us}\vK_y^{-1}(\vy_s-\vm_s) \\ \vK_p &= \vK_{uu} - \vK_{us}\vK_y^{-1}\vK_{su} \end{align} $$ During training, every time the length scales change, a new dense matrix inversion (or Cholesky decomposition) has to be carried out. And during prediction, dense matrix multiplications have to be carried out.
There are at least two scenarios for a large dataset:
For case I, the dataset is given "as is", and one needs to tackle the large dataset issue directly. For case II, where one actually gets to choose where and what to sample, the large dataset issue can be avoided by using clever GP variants.
In an effort to cure the weaknesses of GPR, a lot of modern variants have been developed. The basic idea is to reduce the effective sample size $N$ in the model. The classification of variants have been well discussed in Chap. 8 of GPML and Quin2005.
The variants can be roughly classified into three types.
One low-rank approximation is called the Nyström method. We choose $M$ samples from the dataset, and let $$ \vK_N \approx \vQ_N = \vK_{NM}\vK_M^{-1}\vK_{MN} $$ which is equivalent to an approximation of the kernel, or subset of regressors (SR), $$ k(\vx,\vx') \approx Q(\vx,\vx') = k(\vx,\vx_M)\vK_M^{-1}k(\vx_M,\vx') $$ This way $\vm_p$ and $\vK_p$ can be simplified using the matrix inversion lemma $$ \begin{align} \vm_p &= \vm_u + \vQ_{us}\vQ_y^{-1}(\vy_s-\vm_s) \\ \vK_p &= \vQ_{uu} - \vQ_{us}\vQ_y^{-1}\vQ_{su} \end{align} $$ where $\vQ_y=\vQ_N+\sns\vI$. However, during prediction, as input points move away from the $M$ samples, $\vK_{MN}$, and thus the variance, will be close to zero.
Comparison of several GP approximations from [Snel2008].
One interesting approach is the relevance vector machine (RVM). It applies the idea of automatic relevance determination (ARD) and finds the training data point that is of the most "relevance" to the model. Subsequently, a kernel is constructed from these "relevance vectors" to form the final prediction model. Due to the construction of the kernel, RVM is a degenerate GP, which is criticized in Rasm2005. A fun fact of RVM is that its 2001 version algorithm is patented, and partially discouraged its integration into some mainstream packages. Note that its faster, 2003 version of algorithm is not patented, though. One good Python implementation of RVM can be found here.
Another approach of the second type is the GPR with pseudo-inputs Snel2008, or termed fully independent training conditional (FITC). This approach determines a set of points from the training data, not necessarily the points in the dataset itself, and generates the GPR over these "pseudo inputs". The GP is still non-degenerate, but $N$ can be significantly reduced, sometimes by two orders of magnitudes. Yet the prediction is not deteriorated by much.
In Quin2005, a unified framework for sparse GPR (and GPR itself) is proposed, from which the variants discussed above can be derived.
In the framework, one has to differentiate between two types of variables. One is the observed noisy output $\vy$, the other the underlying noise-free latent output $\vf$, $$ \vy = \vf + \epsilon $$ where $\epsilon\sim\cN(0,\sns)$ is the noise, assumed to be additive and independent. The conditional satisfies $\vy|\vf \sim \cN(\vf,\sns)$. The training data is considered noisy, i.e. $\vy_s$ is provided for training. However, the prediction is expected to be noise-free, i.e. $\vf_u$ is desired. Using a Bayesian approach, the prediction can be formulated as follows, $$ \begin{aligned} p(\vf_u|\vy_s) &= \int p(\vf_s,\vf_u|\vy_s) d\vf_s \\ &= \int \frac{p(\vy_s|\vf_s,\vf_u)p(\vf_s,\vf_u)}{p(\vy_s)} d\vf_s \\ &= \frac{1}{p(\vy_s)} \int p(\vy_s|\vf_s)p(\vf_s,\vf_u) d\vf_s \\ \end{aligned} $$ where $\vf_s$ is going to be integrated out, or "marginalized", as it is of no interest to the prediction. The preceeding term $p(\vy_s)$ is essentially a normalizing factor and will not be computed explicitly. The remaining term $p(\vf_s,\vf_u)$ in the integration would satisfy the joint Gaussian distribution as in previous derivations. This is where the large dense covariance matrix is introduced. This is where the simplification comes in.
In the framework, a set of inducing points $\vf_i$ is assumed to be sampled from the GP like $\vf_s$ and $\vf_u$. Therefore, the probability should satisfy, $$ p(\vf_s,\vf_u) = \int p(\vf_s,\vf_u,\vf_i) d\vf_i = \int p(\vf_s,\vf_u|\vf_i)p(\vf_i) d\vf_i $$ where $\vf_i\sim\cN(\vm_i,\vK_{ii})$. The formulation up to here is still exact. The integration for $p(\vf_u|\vy_s)$ would result in the regular GPR model.
Now here comes the key assumption in the framework, $\vf_s$ and $\vf_u$ are conditionally independent given $\vf_i$, $$ p(\vf_s,\vf_u) \approx q(\vf_s,\vf_u) = \int q(\vf_s|\vf_i) q(\vf_u|\vf_i) p(\vf_i) d\vf_i $$ This means the dependency of $\vf_u$ on $\vf_s$ is indirectly induced by $\vf_i$. As a reference, the exact forms of the conditionals and the covariance matrix are, $$ \left\{ \begin{aligned} \vf_s|\vf_i &\sim \cN(\vm_s+\vK_{si}\vK_{ii}^{-1}(\vf_i-\vm_i), \vK_{ss}-\vQ_{ss}) \\ \vf_u|\vf_i &\sim \cN(\vm_u+\vK_{ui}\vK_{ii}^{-1}(\vf_i-\vm_i), \vK_{uu}-\vQ_{uu}) \end{aligned} \right. \quad\vK= \begin{bmatrix} \vK_{ss} & \vK_{su} \\ \vK_{us} & \vK_{uu} \end{bmatrix} $$ where $\vQ_{ab}=\vK_{ai}\vK_{ii}^{-1}\vK_{ib}$.
Under the above framework, all the approximations are actually simplifying or approximating the term $\vK_{**}-\vQ_{**}$. For example, the low-rank approximation approach, SR, is effectively using, $$ \left\{ \begin{aligned} \vf_s|\vf_i &\sim \cN(\vm_s+\vK_{si}\vK_{ii}^{-1}(\vf_i-\vm_i), \vO) \\ \vf_u|\vf_i &\sim \cN(\vm_u+\vK_{ui}\vK_{ii}^{-1}(\vf_i-\vm_i), \vO) \end{aligned} \right. \quad\vK= \begin{bmatrix} \vQ_{ss} & \vQ_{su} \\ \vQ_{us} & \vQ_{uu} \end{bmatrix} $$ The covariance matrices are neglected and the "distributions" become deterministic, and hence the term Deterministic Inducing Conditional (DIC) in the framework. The zero covariance matrix is exactly the cause of the zero error estimation in this type of approach.
Another similar, but better, approximation is called projected process (PP). It essentially employs the following covariance matrix $$ \vK= \begin{bmatrix} \vQ_{ss} & \vQ_{su} \\ \vQ_{us} & \vK_{uu} \end{bmatrix} $$ In regions away from the dataset, the variance no longer becomes zero. However, the PP model does not work well with low-noise dataset, see [Quin2005] for more discussion.
The approximation of interest in this article, the GPRFITC model, uses the following conditionals, $$ \left\{ \begin{aligned} \vf_s|\vf_i &\sim \cN(\vm_s+\vK_{si}\vK_{ii}^{-1}(\vf_i-\vm_i), \vQ_{ss} + diag(\vK_{ss}-\vQ_{ss})) \\ \vf_u|\vf_i &\sim Exact \end{aligned} \right. \quad\vK= \begin{bmatrix} \vQ_{ss} + diag(\vK_{ss}-\vQ_{ss}) & \vQ_{su} \\ \vQ_{us} & \vK_{uu} \end{bmatrix} $$ Note that $q(\vf_u|\vf_i)$ is exact only when one output is requested. Otherwise, the covariance matrix will be treated like $q(\vf_s|\vf_i)$.
In GPRFITC, the predictive mean becomes, $$ \begin{aligned} \vm_p &= \vm_u + \vQ_{us}(\vQ_{ss}+\vV)^{-1}(\vy_s-\vm_s) \\ &= \vm_u + \vK_{ui}\vM^{-1}\vK_{is}\vV^{-1}(\vy_s-\vm_s) \end{aligned} $$ where $\vV=diag(\vK_{ss}-\vQ_{ss})+\sns\vI$, and $\vM=\vK_{ii}+\vK_{is}\vV^{-1}\vK_{si}$. The covariance is, $$ \begin{aligned} \vK_p &= \vK_{uu} - \vQ_{us}(\vQ_{ss}+\vV)^{-1}\vQ_{su} \\ &= \vK_{uu} - \vQ_{uu} + \vK_{ui}\vM^{-1}\vK_{iu} \end{aligned} $$ Note that for the computational cost, $N$ is reduced to the number of inducing points.
There are three matrix inversions involved in the prediction. $\vV$ is diagonal, and its inversion is trivial. Inversion inside $\vQ_{ss}$ is done using Cholesky decomposition (again), $$ \begin{aligned} \vQ_{ss} &= \vK_{si}\vK_{ii}^{-1}\vK_{is} = \vK_{si}(\vL_i\vL_i^T)^{-1}\vK_{is} \\ &\equiv \bar\vK_{is}^T\bar\vK_{is} \end{aligned} $$ where $\bar\vK_{is}=\vL_i^{-1}\vK_{is}$. Subsequently, inversion of $\vM$ utilizes the decomposition of $\vK_{ii}$, $$ \begin{aligned} \vM^{-1} &= (\vK_{ii}+\vK_{is}\vV^{-1}\vK_{si})^{-1} = \vL_i^{-T} (\vI+\bar\vK_{is}\vV^{-1}\bar\vK_{is}^T)^{-1} \vL_i^{-1} \\ &\equiv \vL_i^{-T} (\vL_m\vL_m^T)^{-1} \vL_i^{-1} \\ &\equiv \vL^{-T}\vL^{-1} \end{aligned} $$ where Cholesky decomposition is applied in the middle. The last two expressions are then used in the computation of $\vm_p$ and $\vK_p$.
The hyperparameters of GPRFITC can be trained by maximizing the marginal likelihood. $$ -2\cL = (\vy_s-\vm_s)^T (\vQ_{ss} + \vV)^{-1} (\vy_s-\vm_s) + \log|\vQ_{ss} + \vV| + n\log 2\pi $$ where the matrix inversion is handled as before. For the determinant, $$ \begin{aligned} \log|\vQ_{ss} + \vV| &= \log|\vV||\vI+\bar\vK_{is}\vV^{-1}\bar\vK_{is}^T| \\ &= \log|\vV| + 2\log|\vL_m| \end{aligned} $$ The hyperparameters in $\cL$ include those in regular GPR model: the length scales, process and noise variances, as well as the extra parameters: the location of the inducing points, the number of which is proportional to the input dimension and number of points. Due the large number of hyperparameters, the only viable training approach is the gradient-based method. The gradients of $\cL$ can be found in Appendix C of Snel2008. Nevertheless, the gradients can be automatically handled in a differentiable programming framework, such as TensorFlow and pyTorch.
When there are multiple outputs associated with one input, another way to simplify the GPR is to extract the correlation between the outputs, leading to multi-variate GPRs.
If we model $m$ outputs $f_i$ indepedently using $m$ UVGP's, $$ f_i\sim\GP(m_i(\vx),k_i(\vx,\vx')) $$ then the $m$ UVGP's are equivalent to a special $m$-dimensional MVGP with mean and variance $$ \vm_U=\begin{bmatrix}m_1\\m_2\\\cdots\\m_m\end{bmatrix},\quad \vK_U=\begin{bmatrix}k_1 & 0 & \cdots & 0 \\ 0 & k_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & k_m\end{bmatrix} $$
Alternatively, suppose $$ \vg = \vV\vf $$ where $\vf\in\bR^m$, $\vg\in\bR^n$, and $\vf$ consists of $m$ indepedent UVGP. Then $$ \vg\sim\GP(\vV\vm_U,\vV\vK_U\vV^T) $$
This leads to a common approach to constructing matrix covariance functions, i.e. making $\vK(\vx,\vx')=\sum_{i=1}^m k_i(\vx,\vx')\vv_i\vv_i^T$, where $\vv_i\in\bR^n$
In many cases MVGPR requires fewer samples than UVGPRs to achieve a good accuracy; but if there are no limit on the data set size, then both UVGPR and MVGPR shall converge.
Now we turn to the case where one can determine what to sample. Two techniques will be briefly discussed, the gradient-enhanced kriging and multi-fidelity surrogates. These techniques allow for the use of fewer samples to approximate complex high-dimensional functions, so that one can avoid the large dataset issue.
This formulation assumes that one can obtain not only the function value, but also the gradient, at a given sample point, without much more computational cost. This is the case typically when an automatic differentiation implementation of the function is available (i.e. adjoint module of a CFD solver). The gradient encodes more information about the smoothness of the function into the GPR and helps the latter to approximate the function better.
There are three flavors:
Suppose we have a high-fidelity (HF) model $y=f_{HF}(\vx)$, and a low-fidelity (LF) model $y=f_{LF}(\vx)$. Typically the HF model is computationally more expensive, but more accurate, than the LF model.
Suppose now we have one dataset of HF data $$ \cD_{HF} = \{\vx_i^{HF},y_i^{HF}\}_{i=1}^M $$ and a dataset of LF data $$ \cD_{LF} = \{\vx_i^{LF},y_i^{LF}\}_{i=1}^N $$ Typically $M\ll N$, since the HF data is more expensive to obtain.
The idea of multi-fidleity (MF) surrogate is to construct a LF surrogate with the relatively ample LF dataset, and then correct it using the sparse HF dataset. $$ \hat{y}_{MF}(\vx) = \rho \hat{y}_{LF}(\vx) + \hat{\delta}(\vx) $$ where $\rho$ is a multiplicative correction, and $\hat{\delta}$ is called the discrepancy function.
One can think this as a special case of MVGPR. Two indepedent UVGPR's $\hat{y}_{LF}$ and $\hat{\delta}(\vx)$ are combined linearly to generate another GPR.