Goal: Partition data $\X = \{ x_1, \dots, x_n \} \subset \R^d$ into $K$ disjoint clusters.
Usually, we fix $K$ beforehand, and need to use model selection to overcome this limitation.
First, pick random cluster centers $\mu_k$. Then, repeat until convergence:
E-Step: Assign $x_j$ to the nearest cluster center $\mu_k$, $$ z_j = \arg\min_k || x_j - \mu_k ||^2 $$
M-Step: Re-estimate cluster centers by averaging over assignments: $$ \mu_k = \frac{1}{\sum_{j=1}^N z_{jk}} \sum_{j=1}^N x_j z_{jk} $$ where $\sum_{j=1}^N z_{jk}=N_k$ the number of samples in cluster $k$.
We will explain the "E-step" and "M-step" later.
Clustering problems motivate the definition of a mixture model.
In clustering, we represent mixtures via latent cluster indicators, $$ \begin{align} z_n &\sim \Cat[\pi] & \forall\, n = 1,\dots,N \\ x_n \mid z_n &\sim p_{z_n}(\cdot \mid \theta_{z_n}) & \forall\, n = 1,\dots,N \end{align} $$ where
In a Gaussian Mixture Model, the base distributions are Gaussian.
Here are the contours of each base distribution plotted together:
plot_gmm()
Model: Specification $$ \begin{align} \theta &= (\vec\pi, \vec\mu, \vec\Sigma) && \text{model parameters} \\ z_n &\sim \Cat[\pi] && \text{cluster indicators} \\ x_n | z_n, \theta &\sim \mathcal{N}(\mu_{z_n}, \Sigma_{z_n}) && \text{base distribution} \end{align} $$
Goal: Since only the data points $x_n$ are observed, we wish to
We will use the Expectation Maximization algorithm.
The complete data log-likelihood for a single datapoint $(x_n, z_n)$ is $$ \begin{align} \log p(x_n, z_n | \theta) &= \log \prod_{k=1}^K \left[\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\right]^{z_{nk}} \\ &= \sum_{k=1}^K z_{nk} \log \left[\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)\right] \end{align} $$
If we knew $z_{nk}$, then the MLE of $\pi_k$, $\mu_k$, and $\Sigma_k$ is trivial. (Hint: To find $\pi_k$, one needs to incorporate a constraint $\sum\pi_k=1$)
And if we knew $\mu_k$ and $\Sigma_k$, but not $z_{nk}$, we can try to eliminate the latter.
The "hidden" posterior for $z_n$ of a single point can be found using Bayes' rule: $$ \begin{align} p(z_{nk} = 1 | x_n, \theta) &= \frac{p(x_n | z_{nk} = 1, \theta) p(z_{nk} = 1 | \theta)}{p(x_n | \theta)} \\ &= \frac{\pi_k \mathcal{N}(x_n | \mu_k, \Sigma_k)}{\sum_{k'=1}^K \pi_{k'} \mathcal{N}(x_n | \mu_{k'}, \Sigma_{k'})} \equiv r_{nk} \end{align} $$
The quantity $r_{nk}$ means the responsibility that cluster $k$ takes for datapoint $x_n$. And further, the expectation of $z_{nk}$ w.r.t. the posterior is, $$ E_z[z_{nk}]=r_{nk} $$
The expected complete likelihood w.r.t. the hidden posterior is $$ \begin{align} E_z[ \log p(\X,Z|\theta)] &= \sum_{n=1}^N \sum_{k=1}^K E_z\big[ z_{nk} \big] \log \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \\ &= \sum_{n=1}^N \sum_{k=1}^K r_{nk} \log \pi_k + \sum_{n=1}^N \sum_{k=1}^K r_{nk} \log \mathcal{N}(x_n \mid \mu_k, \Sigma_k) \end{align} $$
Now the unknown $z_{nk}$ is gone! (But $r_{nk}$ depends on $z_{nk}$ though)
During the M-Step, we optimize the lower bound with respect to $\theta=(\pi,\mu,\Sigma)$. The updates are $$ \begin{gather} \pi_k = \frac{1}{N} \sum_{n=1}^N r_{nk} = \frac{r_k}{N} \quad \mu_k = \frac{\sum_{n=1}^N r_{nk} x_n}{r_k} \\ \Sigma_k = \frac{\sum_{n=1}^N r_{nk} x_n x_n^T}{r_k} - \mu_k \mu_k^T \end{gather} $$
where $r_k = \sum_{n=1}^N r_{nk}$ is the effective sample size for cluster $k$.
Verify the update as an exercise.
Then an iterative procedure:
Compare to the K-means algorithm - GMM is "fuzzier", right?
The GMM is one of the fewer generative models covered in class. The previous models we have learnt are discriminative.
Suppose, we want $p(Y|X)$, $Y$ is output, $X$ is input. For regression models, $p(Y|X)$ is the predictive distribution from which one gets the prediction with error estimate.
Uses material from [Neal & Hinton 1998], [Do 2008], and [MLAPP]
CAUTION: This part is a general discussion of the EM algorithm and not a core requirement of this course.
Question: How can we perform MLE/MAP in latent variable models?
Suppose we observe both $X$ and $Z$. The complete data log-likelihood is $$ \ell_c(\theta) = P(X,Z | \theta) = \sum_{n=1}^N \log p(x_n, z_n | \theta) $$
[MLAPP]: This likelihood function is convex when $p(x_n,z_n | \theta)$ is an exponential family distribution.
In reality, we only observe $\X=(x_1,\dots,x_n)$. The observed data log-likelihood is given by $$ \begin{align} \ell(\theta | \X) &= \log p(\X | \theta) = \sum_{k=1}^N \log p(x_k | \theta) \\ &= \sum_{k=1}^N \left[ \log \sum_{z_k} p(x_k, z_k | \theta) \right] \end{align} $$
[MLAPP]: This likelihood function is non-convex even assuming exponential family distributions. Multiple modes make inference (NP) hard!
How can we optimize the observed likelihood $\ell(\theta|\X)$?
We need a different approach...
Observation: In most models, MLE/MAP is easy with fully observed data. We will alternate between
Note the similarity to K-Means!
From a theoretical standpoint, we will alternate between
(Figure 9.14 of **[PRML]**)
Our general approach will be to reason about the hidden variables through a proxy distribution $q(Z)$.
The $q$ distribution will disappear later, but this perspective is useful for extending EM to new settings.
For a convex function $f$, $$ \sum_{k=1}^K \lambda_k f(x_k) \geq f\left( \sum_{k=1}^K \lambda_k x_k \right) $$
If $\lambda_k \geq 0$ and $\sum_{k=1}^K \lambda_k = 1$.
Through Jensen's inequality, we obtain the evidence lower bound (ELBO) on the log-likelihood: $$ \begin{align} \ell(\theta | \X) &= \log \sum_z p(\X, z | \theta) \\ &= \log \sum_z q(z) \frac{p(\X, z | \theta)}{q(z)} \\ &\geq \sum_z q(z) \log \frac{p(\X, z | \theta)}{q(z)} \equiv \L(q, \theta) \end{align} $$ Important: the choice of distribution $q(z)$ is arbitrary, above holds for all $q$.
We can rewrite the lower bound $\L(q, \theta)$ as follows: $$ \begin{align} \ell(\theta | \X) \geq \L(q,\theta) &= \sum_z q(z) \log \frac{p(\X,z|\theta)}{q(z)} \\ &= \sum_q q(z) \log p(\X,z|\theta) - \sum_z q(z) \log q(z) \\ &= E_q[\log p(\X,Z|\theta)] - E_q[\log q(z)] \end{align} $$
Though not discussed in this course, it is possible to define the Kullback–Leibler divergence, or relative entropy, between two probabilitistic distributions $p$ and $q$, $$ \begin{align} D_{KL}(q || p) &= E_q[-\log p(Z)] + E_q[\log q(Z)] \\ &= H[q,p] - H[q] \geq 0 \end{align} $$ which quantifies the similarity between $p$ and $q$.
It is easy to show $\L(q,\theta)$ differs from a relative entropy by only a constant wrt $\theta$: $$ \begin{align*} D_{KL}(q || p(Z|\X,\theta)) &= H(q, p(Z|\X,\theta)) - H(q) \\ &= E_q[ -\log p(Z|\X,\theta) ] - H(q) \\ &= E_q[ -\log p(Z,\X | \theta) ] - E_q[ -\log p(\X|\theta) ] - H(q) \\ &= E_q[ -\log p(Z,\X | \theta) ] + \log p(\X|\theta) - H(q) \\ &= -\mathcal{L}(q,\theta) + \mathrm{const.} \end{align*} $$
In other words, we are essentialy trying to iteratively update the distribution $q$ to match the true unknown distribution $p$.
This yields a second proof of the ELBO, since $D_{KL} \geq 0$, $$ \boxed{ \log p(\X | \theta) = D_{KL}(q || p(Z | \X, \theta)) + \mathcal{L}(q,\theta) \geq \mathcal{L}(q,\theta) } $$
The quality of our bound $\L(q,\theta)$ depends heavily on the choice of proxy distribution $q(Z)$.
Remark: Maximizing $\L(q, \theta)$ with respect to $q$ is equivalent to minimizing the relative entropy between $q$ and the hidden posterior $P(Z|\X,\theta)$. Hence, the optimal $q$ is exactly the hidden posterior, for which $\log p(\X|\theta) = \L(q,\theta)$.
The bound is tight, i.e. we can make it touch the log-likelihood. Refer back to the PRML plot.
Finding a global maximum of the likelihood is difficult in the presence of latent variables. $$ \hat\theta_{ML} = \arg\max_\theta \ell(\theta | \X) = \arg\max_\theta \log p(\X|\theta) $$
Instead, the Expectation Maximization algorithm gives an iterative procedure for finding a local maximum of the likelihood.
Consider only proxy distributions of the form $$ q_\vartheta(Z) = p(Z | \X, \vartheta) $$
Suppose we have an estimate $\theta_t$ of the parameters. To improve our estimate, we first compute a new lower bound on the observed log-likelihood, $$ \vartheta_{t+1} = \arg\max_\vartheta \L(\vartheta, \theta_t) = \theta_t $$
Since $q_\vartheta(Z) = p(Z|\X,\vartheta)$, we know from before that $\vartheta=\theta_t$ maximizes $\L(\vartheta, \theta)$.
Next, we estimate new parameters by optimizing over the lower bound. $$ \begin{align} \theta_{t+1} &= \arg\max_\theta \L(\vartheta_{t+1}, \theta) \\ &= \arg\max_\theta E_q[\log p(\X,Z|\theta)] - E_q[\log q_\vartheta(z)] \\ &= \arg\max_\theta E_q[ \log p(\X,Z | \theta) ] \end{align} $$
The last line optimizes over the expected complete data log-likelihood.
Expectation maximization performs coordinate ascent on $\L(\vartheta,\theta)$.
Theorem: After a single iteration of Expectation Maximization, the observed data likelihood of the parameters has not decreased, that is, $$ \ell(\theta_t | \X) \leq \ell(\theta_{t+1} | \X) $$
Proof: This result is a simple consequence of all our hard work so far! $$ \begin{align} \ell(\theta_t | \X) &= \mathcal{L}(q_{\vartheta_{t+1}}, \theta_t) & \text{(Tightness)} \\ &\leq \mathcal{L}(q_{\vartheta_{t+1}}, \theta_{t+1}) & \text{(M-Step)} \\ &\leq \ell(\theta_{t+1} | \X) & \text{(ELBO)} \end{align} $$
Theorem: (Neal & Hinton 1998, Thm. 2) Every local maximum of $\L(q, \theta)$ is a local maximum of $\ell(\theta | \X)$.