Model selection refers to two tasks of interest:
As disscused in the previous lectures for GPR.
Sounds like better than MLE+CV?
Two issues with the Bayesian approach:
Solution - Markov chain Monte Carlo algorithm
where the samples $\vz^{(l)}$ are drawn from $p(\vz)$.
For the evidence, one can draw $\theta^{(i)}$ from $p(\theta|\cM)$, and do: $$ p(\cD|\cM) = \int_\Theta p(\cD|\theta,\cM)p(\theta|\cM) d\theta \approx \frac{1}{N}\sum_{i=1}^N p(\cD|\theta^{(i)},\cM) $$
So the issue for Task 2 is somehow resolved.
# Estimating area of circle by MC.
# Very simple as p(z) is a uniform distribution defined on unit square.
# and f(z)=1 in circle and f(z)=0 out of circle.
N = 10000
s = np.random.uniform(size=(2,N))
x, y = s[0], s[1]
r2 = (x-0.5)**2 + (y-0.5)**2
m = r2 <= 0.25
f = plt.figure(figsize=(6,4))
plt.plot(x[m], y[m], 'b.')
plt.plot(x[~m], y[~m], 'r.')
plt.title('Area: Accurate = {0:4.3f}, Approx. = {1:4.3f}'.format(np.pi*0.25, np.sum(m)/float(N)))
a = plt.axis('equal')
area = np.cumsum(m)/np.arange(1.0,N+1.0)
f = plt.figure(figsize=(6,4))
_=plt.loglog(np.abs(area-np.pi*0.25))
For a distribution $p(\vz)$ that is hard to sample, one can introduce a simpler distribution $q(\vz)$, $$ \int f(\vz)p(\vz) d\vz = \int f(\vz)\frac{p(\vz)}{q(\vz)}q(\vz) d\vz \approx \frac{1}{L}\sum_{l=1}^L f(\vz^{(l)})\frac{p(\vz^{(l)})}{q(\vz^{(l)})} $$
But maybe the target distribution cannot even be easily normalized, one can do, $$ \int f(\vz)p(\vz) d\vz \approx \frac{Z_q}{Z_p} \frac{1}{L}\sum_{l=1}^L \tilde{r}^{(l)}f(\vz^{(l)}) $$ where $Z_q$ and $Z_p$ are normalizing factors for $\tilde{q}$ and $\tilde{p}$, respectively, and $$ \tilde{r}^{(l)}=\frac{\tilde{p}(\vz^{(l)})}{\tilde{q}(\vz^{(l)})} $$
But one also has, $$ \frac{Z_p}{Z_q} = \int \frac{\tilde{p}(\vz)}{\tilde{q}(\vz)}q(\vz) d\vz \approx \frac{1}{L}\sum_{l=1}^L \tilde{r}^{(l)} $$ so the integral simplifies to $$ \int f(\vz)p(\vz) d\vz \approx \sum_{l=1}^L w^{(l)}f(\vz^{(l)}),\quad w^{(l)}=\tilde{r}^{(l)}\left/\left(\sum_l \tilde{r}^{(l)}\right)\right. $$
Given a difficult, maybe unnormalized, distribution $p(\vz)$:
The samples of $p(\vz)$ is represented by a carefully selected subset of samples of $q(\vz)$.
f = plt.figure(figsize=(6,4))
plt.plot(x, g, 'b-', label='g(x)')
plt.plot(x, c*g, 'b--', label='c*g(x)')
plt.plot(x, y, 'k-', label='f(x)')
plt.hist(ss_all, bins=50, density=True, color='y', label='All samples')
plt.hist(ss_acc, bins=50, density=True, color='r', label='Accepted')
plt.legend()
_=plt.title('# of Samples: {0}'.format(Nsmp))
Uses material from [MLAPP]
A Markov chain is a series of random variables $X_1, \dots, X_N$ satisfying the Markov property:
The future is independent of the past, given the present. $$ p(x_n | x_1, \dots, x_{n-1}) = p(x_n | x_{n-1}) $$
A chain is stationary if the transition probabilities do not change with time.
If a sequence has the Markov property, then the joint distribution factorizes according to $$ p(x_1, \dots, x_N) = p(x_1) \prod_{n=2}^N p(x_n | x_{n-1}) $$
Suppose $X_t \in \{1,\dots,K\}$ is discrete. Then, a stationary chain with $N$ states can be described by a transition matrix, $A \in \bR^{N \times N}$ where $$ a_{ij} = p(X_t=j \mid X_{t-1}=i) $$
is the probability of transitioning from state $i$ to $j$.
Each row sums to one, $\sum_{j=1}^K A_{ij} = 1$, so $A$ is a row-stochastic matrix.
Transitions between states can be represented as a graph:
Consider a row vector $x_t \in \bR^{K \times 1}$ with entries $x_{tj} = p(X_t = j)$. Then, $$ \begin{align} p(X_t = j) &= \sum_{i=1}^K p(X_t = j \mid X_{t-1} = i) p(X_{t-1} = i) \\ &= \sum_{i=1}^K A_{ij} x_{t-1,i} \\ \end{align} $$ Therefore, we conclude $x_t = x_{t-1} A$. Note $\sum_{j=1}^K x_{tj} = 1$.
Since $x_t = x_{t-1} A$, this suggests that in general, $$ x_{t} = x_{t-1} A = x_{t-2} A^2 = \cdots x_{0} A^t $$ If we know the initial state probabilities $x_0$, we can find the probabilities of landing in any state at time $t > 0$.
Suppose the weather is either $R=\text{Rainy}$ or $S=\text{Sunny}$, $$ A = \begin{bmatrix} 0.9 & 0.1 \\ 0.5 & 0.5 \end{bmatrix} $$
Suppose today is sunny, $x_0 = \begin{bmatrix} 1 & 0 \end{bmatrix}$. We can predict tomorrow's weather, $$ x_1 = x_0 A = \begin{bmatrix} 0.9 & 0.1 \end{bmatrix} $$ The weather over the next several days will be $$ \begin{align} x_2 &= x_1 A = \begin{bmatrix} 0.86 & 0.14 \end{bmatrix} \\ x_3 &= x_2 A = \begin{bmatrix} 0.844 & 0.156 \end{bmatrix} \\ x_4 &= x_3 A = \begin{bmatrix} 0.8376 & 0.1624 \end{bmatrix} \end{align} $$
Question: What happens to $x_0 A^n$ as $n \rightarrow \infty$?
If we ever reach a stage $x$ where $$ x = xA $$ then we have reached the stationary distribution of the chain.
f = plt.figure(figsize=(6,4))
for _s in history:
plt.plot(_s[0])
plt.xlabel('No. of transitions')
_=plt.ylabel('Probability of Sunny')
Proof: $$ \sum\limits_{i=1}^{\infty}\pi(i)P(i,j) = \sum\limits_{i=1}^{\infty} \pi(j)P(j,i) = \pi(j)\sum\limits_{i=1}^{\infty} P(j,i) = \pi(j) $$ i.e. $\pi P = \pi$
Similar to the idea from accept-reject sampling, let's introduce a correction $\alpha(i,j)$, called acceptance rate, such that: $$ \pi(i)Q(i,j)\alpha(i,j) = \pi(j)Q(j,i)\alpha(j,i) $$
To make the above happen, $\alpha$ should satisfy, $$ \begin{align} \alpha(i,j) &= \pi(j)Q(j,i) \\ \alpha(j,i) &= \pi(i)Q(i,j) \end{align} $$
And we have, $$ P(i,j) = Q(i,j)\alpha(i,j) $$
We do a loop of sampling:
Issue: $\alpha$ could be low, resulting in many rejected samples.
Note that $\alpha$ is insusceptible to scales: $$ \pi(i)Q(i,j)\alpha(i,j) = \pi(j)Q(j,i)\alpha(j,i) \ \Leftrightarrow\ \pi(i)Q(i,j)(5\alpha(i,j)) = \pi(j)Q(j,i)(5\alpha(j,i)) $$
So one could define $$ \alpha(i,j) = \min\left\{ \frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)},1 \right\} $$ to obtain an acceptance rate $\alpha$ that is maximum possible.
Proof that the definition works: $$ \begin{align} &\quad \pi(i)Q(i,j)\alpha(i,j) \\ &= \pi(i)Q(i,j)\min\left\{ \frac{\pi(j)Q(j,i)}{\pi(i)Q(i,j)},1 \right\} \\ &= \min\{\pi(j)Q(j,i), \pi(i)Q(i,j)\} = \pi(j)Q(j,i) \min\left\{ 1,\frac{\pi(i)Q(i,j)}{\pi(j)Q(j,i)} \right\} \\ &= \pi(j)Q(j,i)\alpha(j,i) \end{align} $$
f = plt.figure(figsize=(6,4))
plt.hist(p, 50, density=True, facecolor='red', alpha=0.7, label='Sample PDF')
plt.scatter(p, [56.818*targetPDF(_) for _ in p], label='Target PDF')
_=plt.legend()
Curse of dimensionality:
One of the cures is the Gibbs sampling method.
Detailed balance requires: $$ \pi(i)P(i,j) = \pi(j)P(j,i) $$
Consider a 2D case $\pi(x_1,x_2)$ and two points: $A(x_1^{(1)},x_2^{(1)})$ and $B(x_1^{(1)},x_2^{(2)})$. One has
$$ \begin{align} \pi(x_1^{(1)},x_2^{(1)}) \pi(x_2^{(2)} | x_1^{(1)}) &= \pi(x_1^{(1)})\pi(x_2^{(1)}|x_1^{(1)}) \pi(x_2^{(2)} | x_1^{(1)}) \\ \pi(x_1^{(1)},x_2^{(2)}) \pi(x_2^{(1)} | x_1^{(1)}) &= \pi(x_1^{(1)}) \pi(x_2^{(2)} | x_1^{(1)})\pi(x_2^{(1)}|x_1^{(1)}) \end{align} $$This is saying: $$ \begin{align} \pi(x_1^{(1)},x_2^{(1)}) \pi(x_2^{(2)} | x_1^{(1)}) &= \pi(x_1^{(1)},x_2^{(2)}) \pi(x_2^{(1)} | x_1^{(1)}) \\ \mbox{or } \pi(A) \pi(x_2^{(2)} | x_1^{(1)}) &= \pi(B) \pi(x_2^{(1)} | x_1^{(1)}) \end{align} $$
Or, $\pi(x_2| x_1^{(1)})$ can be used to satisfy detailed balance of Markov chain!
Algorithm - 2D version
Then, samples are generated from rotating the coordinate axes $$ (x_1^{(1)}, x_2^{(1)}) \to (x_1^{(1)}, x_2^{(2)}) \to (x_1^{(2)}, x_2^{(2)}) \to ... \to (x_1^{(n)}, x_2^{(n)}) $$
f = plt.figure(figsize=(6,4))
plt.hist(p[0], 50, density=True, facecolor='green', alpha=0.5, label='x')
plt.hist(p[1], 50, density=True, facecolor='red', alpha=0.5, label='y')
_=plt.legend()
Nb = 1000
Ng = 101
x = np.linspace(2, 8, Ng)
y = np.linspace(-7, 5, Ng)
X, Y = np.meshgrid(x, y)
Z = target.pdf(np.vstack([X.reshape(-1), Y.reshape(-1)]).T).reshape(Ng, Ng)
f = plt.figure(figsize=(6,4))
plt.scatter(p[0][Nb:], p[1][Nb:], c=p[2][Nb:])
_=plt.contour(X, Y, Z, colors='w', linewidth=2)
Nb = 1000
M1 = np.mean(p[0][Nb:])
M2 = np.mean(p[1][Nb:])
dx = p[0][Nb:]-M1
dy = p[1][Nb:]-M2
S1 = np.mean(dx**2)
S2 = np.mean(dy**2)
RR = np.mean(dx*dy)
print([M1, M2, S1, S2, RR])
[4.968675138645328, -1.023543892624309, 1.0023609697090448, 4.0025096384248515, 1.0018483992282012]
One example - auxiliary variable method:
One specific method is Hamiltonian Monte Carlo