Machine Learning for Engineering¶

Gaussian Process Regression: Model Selection¶

Instructor: Daning Huang¶

$$ \newcommand{\bE}{\mathbb{E}} \newcommand{\bR}{\mathbb{R}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vz}{\mathbf{z}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cM}{\mathcal{M}} $$

TODAY: Gaussian Process Regression - III¶

  • Monte Carlo simulation
  • Basics of Markov chain
  • Metropolis-Hastings sampling
  • Gibbs sampling

References¶

  • [PRML] Chp. 11
  • twiecki.io

Model Selection Revisited¶

Model selection refers to two tasks of interest:

  • Task 1: Given a model $\cM$ with hyperparameters $\theta$ and a dataset $\cD$, find $\theta$ such that $\cM$ fits $\cD$ the best.
    • e.g. $\cM$: A GPR with a correlation function; $\theta$: Length scales in the correlation function.
  • Task 2: Given several models $\cM_1, \cM_2, \cdots$ and a dataset $\cD$, find $\cM_j$ that fits $\cD$ the best.
    • e.g. GPR's with different correlation/trend functions.
  • For Task 1, one can determine $\theta$ by maximizing the likelihood $$ \theta_{MLE} = \mathrm{arg\ max}_\theta p(\cD|\theta,\cM) $$

    As disscused in the previous lectures for GPR.

  • For Task 2, one could use cross validation, or certain information criteria, to assess the fitness of different models.
  • For Task 1, one could also employ a Bayesian approach by assigning priors to $\theta$, and evaluate the posterior, $$ p(\theta|\cD,\cM)=\frac{p(\cD|\theta,\cM)p(\theta|\cM)}{p(\cD|\cM)}\propto p(\cD|\theta,\cM)p(\theta|\cM) $$
    • The posterior $p(\theta|\cD,\cM)$ provides an estimation of $\theta$ given the dataset and model, with a confidence bound.
  • For Task 2, the evidence $p(\cD|\cM)$ provides a "rigorous" and quantitative assessment of the model fitness.

Sounds like better than MLE+CV?

Two issues with the Bayesian approach:

  • For Task 1, the posterior $p(\theta|\cD,\cM)$ typically do not have a closed-form expression. For example, in GPR
    • $p(\cD|\theta,\cM)$ is typically Gaussian - but in some fancier models it is non-Gaussian
    • $p(\theta|\cM)$ could be a Beta distribution, if $\theta$ are the length scales (positive valued)
  • As a result, for Task 2, it is non-trivial to evaluate the integral for the evidence: $$ p(\cD|\cM) = \int_\Theta p(\cD|\theta,\cM)p(\theta|\cM) d\theta $$

Solution - Markov chain Monte Carlo algorithm

  • A procedure to generate samples $\{\theta^{(i)}\}_{i=1}^N$ from the posterior distribution $p(\theta|\cD,\cM)\propto p(\cD|\theta,\cM)p(\theta|\cM)$.
    • Then one can use a histogram, or a kernel density estimation, to approximate the posterior distribution.
    • With the posterior distribution, you can locate the best parameters.
    • You can also evaluate the evidence to assess the fitness of the model.
  • Challenge: How to generate samples satisfying an unnormalized distribution $p(\cD|\theta,\cM)p(\theta|\cM)$?

Monte Carlo Simulation¶

  • "Integration by Darts": $$ \int f(\vz)p(\vz) d\vz \approx \frac{1}{L}\sum_{l=1}^L f(\vz^{(l)}) $$
  • where the samples $\vz^{(l)}$ are drawn from $p(\vz)$.

  • For the evidence, one can draw $\theta^{(i)}$ from $p(\theta|\cM)$, and do: $$ p(\cD|\cM) = \int_\Theta p(\cD|\theta,\cM)p(\theta|\cM) d\theta \approx \frac{1}{N}\sum_{i=1}^N p(\cD|\theta^{(i)},\cM) $$

    So the issue for Task 2 is somehow resolved.

In [15]:
# Estimating area of circle by MC.
# Very simple as p(z) is a uniform distribution defined on unit square.
# and f(z)=1 in circle and f(z)=0 out of circle.

N = 10000
s = np.random.uniform(size=(2,N))
x, y = s[0], s[1]
r2 = (x-0.5)**2 + (y-0.5)**2
m = r2 <= 0.25

f = plt.figure(figsize=(6,4))
plt.plot(x[m], y[m], 'b.')
plt.plot(x[~m], y[~m], 'r.')
plt.title('Area: Accurate = {0:4.3f}, Approx. = {1:4.3f}'.format(np.pi*0.25, np.sum(m)/float(N)))
a = plt.axis('equal')
In [16]:
area = np.cumsum(m)/np.arange(1.0,N+1.0)
f = plt.figure(figsize=(6,4))
_=plt.loglog(np.abs(area-np.pi*0.25))

Importance Sampling¶

For a distribution $p(\vz)$ that is hard to sample, one can introduce a simpler distribution $q(\vz)$, $$ \int f(\vz)p(\vz) d\vz = \int f(\vz)\frac{p(\vz)}{q(\vz)}q(\vz) d\vz \approx \frac{1}{L}\sum_{l=1}^L f(\vz^{(l)})\frac{p(\vz^{(l)})}{q(\vz^{(l)})} $$

But maybe the target distribution cannot even be easily normalized, one can do, $$ \int f(\vz)p(\vz) d\vz \approx \frac{Z_q}{Z_p} \frac{1}{L}\sum_{l=1}^L \tilde{r}^{(l)}f(\vz^{(l)}) $$ where $Z_q$ and $Z_p$ are normalizing factors for $\tilde{q}$ and $\tilde{p}$, respectively, and $$ \tilde{r}^{(l)}=\frac{\tilde{p}(\vz^{(l)})}{\tilde{q}(\vz^{(l)})} $$

But one also has, $$ \frac{Z_p}{Z_q} = \int \frac{\tilde{p}(\vz)}{\tilde{q}(\vz)}q(\vz) d\vz \approx \frac{1}{L}\sum_{l=1}^L \tilde{r}^{(l)} $$ so the integral simplifies to $$ \int f(\vz)p(\vz) d\vz \approx \sum_{l=1}^L w^{(l)}f(\vz^{(l)}),\quad w^{(l)}=\tilde{r}^{(l)}\left/\left(\sum_l \tilde{r}^{(l)}\right)\right. $$

Accept-Reject Sampling¶

Given a difficult, maybe unnormalized, distribution $p(\vz)$:

  • Pick $q(\vz)$ and a constant $k$ such that $p(\vz)<kq(\vz)$.
  • Sample $\vz_0$ from $q(\vz)$
  • Sample $u\sim \mathrm{Uniform}[0,kq(\vz_0)]$
  • If $u\leq p(\vz_0)$, accept $\vz_0$, otherwise reject.

The samples of $p(\vz)$ is represented by a carefully selected subset of samples of $q(\vz)$.

In [18]:
f = plt.figure(figsize=(6,4))
plt.plot(x, g, 'b-', label='g(x)')
plt.plot(x, c*g, 'b--', label='c*g(x)')
plt.plot(x, y, 'k-', label='f(x)')
plt.hist(ss_all, bins=50, density=True, color='y', label='All samples')
plt.hist(ss_acc, bins=50, density=True, color='r', label='Accepted')
plt.legend()
_=plt.title('# of Samples: {0}'.format(Nsmp))

Markov Models¶

Uses material from [MLAPP]

Markov Models¶

A Markov chain is a series of random variables $X_1, \dots, X_N$ satisfying the Markov property:

The future is independent of the past, given the present. $$ p(x_n | x_1, \dots, x_{n-1}) = p(x_n | x_{n-1}) $$

A chain is stationary if the transition probabilities do not change with time.

Markov Models: Joint Distribution¶

If a sequence has the Markov property, then the joint distribution factorizes according to $$ p(x_1, \dots, x_N) = p(x_1) \prod_{n=2}^N p(x_n | x_{n-1}) $$

Markov Models: Transition Matrix¶

Suppose $X_t \in \{1,\dots,K\}$ is discrete. Then, a stationary chain with $N$ states can be described by a transition matrix, $A \in \bR^{N \times N}$ where $$ a_{ij} = p(X_t=j \mid X_{t-1}=i) $$

is the probability of transitioning from state $i$ to $j$.

Each row sums to one, $\sum_{j=1}^K A_{ij} = 1$, so $A$ is a row-stochastic matrix.

Markov Models: Transition Diagram¶

Transitions between states can be represented as a graph:

Markov Models: State Vectors¶

Consider a row vector $x_t \in \bR^{K \times 1}$ with entries $x_{tj} = p(X_t = j)$. Then, $$ \begin{align} p(X_t = j) &= \sum_{i=1}^K p(X_t = j \mid X_{t-1} = i) p(X_{t-1} = i) \\ &= \sum_{i=1}^K A_{ij} x_{t-1,i} \\ \end{align} $$ Therefore, we conclude $x_t = x_{t-1} A$. Note $\sum_{j=1}^K x_{tj} = 1$.

Markov Models: Matrix Powers¶

Since $x_t = x_{t-1} A$, this suggests that in general, $$ x_{t} = x_{t-1} A = x_{t-2} A^2 = \cdots x_{0} A^t $$ If we know the initial state probabilities $x_0$, we can find the probabilities of landing in any state at time $t > 0$.

Example: Weather¶

Suppose the weather is either $R=\text{Rainy}$ or $S=\text{Sunny}$, $$ A = \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix} $$

Example: Weather¶

Suppose today is sunny, $x_0 = \begin{bmatrix} 1 & 0 \end{bmatrix}$. We can predict tomorrow's weather, $$ x_1 = x_0 A = \begin{bmatrix} 0.9 & 0.1 \end{bmatrix} $$ The weather over the next several days will be $$ \begin{align} x_2 &= x_1 A = \begin{bmatrix} 0.86 & 0.14 \end{bmatrix} \\ x_3 &= x_2 A = \begin{bmatrix} 0.844 & 0.156 \end{bmatrix} \\ x_4 &= x_3 A = \begin{bmatrix} 0.8376 & 0.1624 \end{bmatrix} \end{align} $$

Question: What happens to $x_0 A^n$ as $n \rightarrow \infty$?

Markov Chains: Stationary Distribution¶

If we ever reach a stage $x$ where $$ x = xA $$ then we have reached the stationary distribution of the chain.

  • To find $x=v^T$, solve the eigenvalue problem $A^T v = v$
  • Under certain conditions, the limiting distribution $\lim_{n \rightarrow \infty} x_0 A^n = x$
  • Stationary distribution $x$ does not depend on the starting state $x_0$
In [20]:
f = plt.figure(figsize=(6,4))
for _s in history:
    plt.plot(_s[0])
plt.xlabel('No. of transitions')
_=plt.ylabel('Probability of Sunny')

MCMC¶

Detailed Balance¶

  • A stationary distribution $\pi(x)$ satisfies the detailed balance condition: $$ \pi(i)P(i,j) = \pi(j)P(j,i) $$

Proof: $$ \sum\limits_{i=1}^{\infty}\pi(i)P(i,j) = \sum\limits_{i=1}^{\infty} \pi(j)P(j,i) = \pi(j)\sum\limits_{i=1}^{\infty} P(j,i) = \pi(j) $$ i.e. $\pi P = \pi$

  • So it is possible to find the transition matrix from the stationary distribution, which could be the distribution to be sampled.
  • Yet it is nontrivial to satisfy detailed balance. With a random guess $Q(x,y)$, one gets $$ \pi(i)Q(i,j) \neq \pi(j)Q(j,i) $$

Similar to the idea from accept-reject sampling, let's introduce a correction $\alpha(i,j)$, called acceptance rate, such that: $$ \pi(i)Q(i,j)\alpha(i,j) = \pi(j)Q(j,i)\alpha(j,i) $$

To make the above happen, $\alpha$ should satisfy, $$ \begin{align} \alpha(i,j) &= \pi(j)Q(j,i) \\ \alpha(j,i) &= \pi(i)Q(i,j) \end{align} $$

And we have, $$ P(i,j) = Q(i,j)\alpha(i,j) $$

Basic MCMC¶

We do a loop of sampling:

  • Sample $x_*$ from $Q(x,x_{t-1})$
  • Sample $u \sim \mathrm{Uniform}[0,1]$
  • If $u < \alpha(x_{t-1},x_{*}) = \pi(x_{*})Q(x_{*},x_{t-1})$, then $x_t= x_{*}$, else $x_t= x_{t-1}$ The set of samples $\{x_0, x_1, \cdots, x_n\}$ are then the MCMC samples.

Issue: $\alpha$ could be low, resulting in many rejected samples.

Metropolis-Hastings Sampling¶

Note that $\alpha$ is insusceptible to scales: $$ \pi(i)Q(i,j)\alpha(i,j) = \pi(j)Q(j,i)\alpha(j,i) \ \Leftrightarrow\ \pi(i)Q(i,j)(5\alpha(i,j)) = \pi(j)Q(j,i)(5\alpha(j,i)) $$

So one could define $$ \alpha(i,j) = \min\left\{ \frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)},1 \right\} $$ to obtain an acceptance rate $\alpha$ that is maximum possible.

Proof that the definition works: $$ \begin{align} &\quad \pi(i)Q(i,j)\alpha(i,j) \\ &= \pi(i)Q(i,j)\min\left\{ \frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)},1 \right\} \\ &= \min\{\pi(j)Q(j,i), \pi(i)Q(i,j)\} = \pi(j)Q(j,i) \min\left\{ 1,\frac{\pi(i)Q(i,j)}{\pi(j)Q(j,i)} \right\} \\ &= \pi(j)Q(j,i)\alpha(j,i) \end{align} $$

In [22]:
f = plt.figure(figsize=(6,4))
plt.hist(p, 50, density=True, facecolor='red', alpha=0.7, label='Sample PDF')
plt.scatter(p, [56.818*targetPDF(_) for _ in p], label='Target PDF')
_=plt.legend()

Issues¶

Curse of dimensionality:

  • For example, volume of high prod. region of high-dim Gaussian ~ $\sigma^{-D}$
  • Computational cost: $\frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)}$, and the proposed sample could be rejected
  • Sometimes it is even hard to get joint distribution.

One of the cures is the Gibbs sampling method.

Gibbs Sampling¶

Detailed balance requires: $$ \pi(i)P(i,j) = \pi(j)P(j,i) $$

Consider a 2D case $\pi(x_1,x_2)$ and two points: $A(x_1^{(1)},x_2^{(1)})$ and $B(x_1^{(1)},x_2^{(2)})$. One has

$$ \begin{align} \pi(x_1^{(1)},x_2^{(1)}) \pi(x_2^{(2)} | x_1^{(1)}) &= \pi(x_1^{(1)})\pi(x_2^{(1)}|x_1^{(1)}) \pi(x_2^{(2)} | x_1^{(1)}) \\ \pi(x_1^{(1)},x_2^{(2)}) \pi(x_2^{(1)} | x_1^{(1)}) &= \pi(x_1^{(1)}) \pi(x_2^{(2)} | x_1^{(1)})\pi(x_2^{(1)}|x_1^{(1)}) \end{align} $$

This is saying: $$ \begin{align} \pi(x_1^{(1)},x_2^{(1)}) \pi(x_2^{(2)} | x_1^{(1)}) &= \pi(x_1^{(1)},x_2^{(2)}) \pi(x_2^{(1)} | x_1^{(1)}) \\ \mbox{or } \pi(A) \pi(x_2^{(2)} | x_1^{(1)}) &= \pi(B) \pi(x_2^{(1)} | x_1^{(1)}) \end{align} $$

Or, $\pi(x_2| x_1^{(1)})$ can be used to satisfy detailed balance of Markov chain!

Algorithm - 2D version

  • Given stationary distribution $\pi(x_1,x_2)$
  • Set initial states $x_1^{(0)}$ and $x_2^{(0)}$
  • Loop for sampling:
  • (a) Sample $x_2^{t+1}$ from $P(x_2|x_1^{(t)})$
  • (b) Sample $x_1^{t+1}$ from $P(x_1|x_2^{(t)})$

Then, samples are generated from rotating the coordinate axes $$ (x_1^{(1)}, x_2^{(1)}) \to (x_1^{(1)}, x_2^{(2)}) \to (x_1^{(2)}, x_2^{(2)}) \to ... \to (x_1^{(n)}, x_2^{(n)}) $$

In [24]:
f = plt.figure(figsize=(6,4))
plt.hist(p[0], 50, density=True, facecolor='green', alpha=0.5, label='x')
plt.hist(p[1], 50, density=True, facecolor='red', alpha=0.5, label='y')
_=plt.legend()
In [25]:
Nb = 1000
Ng = 101
x = np.linspace(2, 8, Ng)
y = np.linspace(-7, 5, Ng)
X, Y = np.meshgrid(x, y)
Z = target.pdf(np.vstack([X.reshape(-1), Y.reshape(-1)]).T).reshape(Ng, Ng)
f = plt.figure(figsize=(6,4))
plt.scatter(p[0][Nb:], p[1][Nb:], c=p[2][Nb:])
_=plt.contour(X, Y, Z, colors='w', linewidth=2)
In [26]:
Nb = 1000
M1 = np.mean(p[0][Nb:])
M2 = np.mean(p[1][Nb:])
dx = p[0][Nb:]-M1
dy = p[1][Nb:]-M2
S1 = np.mean(dx**2)
S2 = np.mean(dy**2)
RR = np.mean(dx*dy)
print([M1, M2, S1, S2, RR])
[4.968675138645328, -1.023543892624309, 1.0023609697090448, 4.0025096384248515, 1.0018483992282012]

There are many other methods¶

One example - auxiliary variable method:

  • Introduce a new variable $\vu$, $\vz, \vu\sim p(\vz, \vu)$, so that $$ \int f(\vz)p(\vz)d\vz = \int\int f(\vz)p(\vz,\vu)d\vz d\vu = \frac{1}{N}\sum_{i=1}^N f(x^{(i)}) $$
  • Efficient if $p(\vz|\vu)$ or $p(\vu|\vz)$ is easy to sample, or if $p(\vz, \vu)$ is easy to explore.

One specific method is Hamiltonian Monte Carlo

  • Let $p(\vz) \propto \exp[-E(\vz)]$, and $$ p(\vz, \vu) \propto \exp[-E(\vz)]\exp(-\vu^T\vu/2)\equiv \exp[-H(\vz,\vu)] $$
  • $E(\vz)$ and $\frac{1}{2}\vu^T\vu$ can be considered potential and kinetic energy, respectively. Due to energy conservation, $$ p(\vz,\vu) = p(\vz',\vu') $$
  • So one can solve Hamitonian dynamics to do sampling!
    • A practical variant: HMC with No U-Turns Sampling (NUTS)
    • A challenge: Hamitonian dynamics requires the evaluation of $\frac{\partial E}{\partial \vz}$ - not always available.
    • Not a problem for gradient-enabled frameworks, e.g. TensorFlow, pyTorch, FluxML, etc.