NN_VI slides

Machine Learning for Engineering¶

Neural Networks: Variational Inference (Draft)¶

Instructor: Daning Huang¶

$$ \newcommand\ddf[2]{\dfrac{\mathrm{d} #1}{\mathrm{d} #2}} \newcommand\ppf[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\bE}{\mathbb{E}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cU}{\mathcal{U}} \newcommand{\vA}{\mathbf{A}} \newcommand{\vI}{\mathbf{I}} \newcommand{\vK}{\mathbf{K}} \newcommand{\vM}{\mathbf{M}} \newcommand{\vQ}{\mathbf{Q}} \newcommand{\vR}{\mathbf{R}} \newcommand{\vU}{\mathbf{U}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vc}{\mathbf{c}} \newcommand{\vd}{\mathbf{d}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vm}{\mathbf{m}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vw}{\mathbf{w}} \newcommand{\vtt}{\boldsymbol{\theta}} \newcommand{\vtvt}{\boldsymbol{\vartheta}} \newcommand{\vtl}{\boldsymbol{\lambda}} \newcommand{\vtf}{\boldsymbol{\phi}} \newcommand{\vty}{\boldsymbol{\psi}} $$

TODAY: Neural Networks¶

Basic principles of variational inference

References:¶

In modern ML applications, the NN approximates a distribution, rather than a deterministic function. This happens in, e.g.,

Bayesian Inference and Uncertainty Quantification
Latent Variable Models and Generative Models
Policy Optimization in Reinforcement Learning
etc.

This set of slides are still under construction.

Route 2.5: Gibbs Distribution¶

Now return to the general problem $$ \min_\vtt \cJ(\cF_{ND}(\vtt),\vtt) \equiv \cJ(\vtt) $$ and let's try treating $\cJ$ directly.

Consider a feasible set $\cU$ for $\vtt$.
$\cJ(\vtt)$ is a distribution of values over $\cU$.

We can convert $\cJ$ to a probability distribution by defining $$ \tilde{p}(\vtt) = \exp(-\cJ(\vtt)/\tau) $$

Smaller $\cJ$ means larger $\tilde{p}$
$\tau$ is "temperature" parameter, which control the spread of distribution
- For smaller $\tau$, only smaller $\cJ$'s have non-negligible $\tilde{p}$

Define a probability distribution $$ p(\vtt) = \frac{1}{\int_\cU \tilde{p}(\vtt) d\vtt}\tilde{p}(\vtt) \equiv \frac{1}{Z}\tilde{p}(\vtt) $$

The integral normalizes $\tilde{p}$ to produce a probability distribution
Minimizer of $\cJ$ is the maximizer of $p$

Then we can sample $p(\vtt)$ to find the maximizer

But how to generate samples of $p(\vtt)$, that are needed by policy gradient?
Technically we can use sampling methods, such as Markov-Chain Monte Carlo (MCMC).
Caveat: MCMC is usualy slow to evaluate and likely more suitable for lower-dimensional problems.

Side Note¶

Along a similar vein, but a different route, one would get Simulated Annealing algorithm for optimization.

Route 3: Variational Approach¶

Building on the previous general case, an alternative:

If $p$ is hard to sample, how about replacing it with a simpler distribution $q$?
e.g., a multivariate Gaussian $q(\vtt;\vtvt)=\cN(\mu,\Sigma)$, with $\vtvt=[\mu,\Sigma]$
Then we optimize $\vtvt$ instead of $\vtt$.

Problem: how to choose $\vtvt$ so that $p\approx q$?

Otherwise one gets the wrong optima!

Use Kullback-Leibler (KL) divergence to quantify the difference $$ KL(q||p) = \int_\vtt q(\vtt;\vtvt)\log \frac{q(\vtt;\vtvt)}{p(\vtt)} d\vtt = \underbrace{\bE_q[\log q(\vtt;\vtvt)]}_{\equiv -H[q]} - \bE_q[\log p(\vtt)] $$

If $p=q$, $\log=1$ and $KL=0$
If $p$ and $q$ are more different, $KL$ is larger
$H(q)$ is called entropy - to be discussed soon

We know $$ p(\vtt) = \frac{1}{Z}\tilde{p}(\vtt) = \frac{1}{Z} \exp(-\cJ(\vtt)/\tau) $$ so in KL divergence $$ \bE_q[\log p(\vtt)] = \bE_q[-\cJ(\vtt)/\tau - \log Z] \propto \frac{1}{\tau}\bE_q[-\cJ(\vtt)] $$ where $Z$ is a constant that can be ignored.

Then, minimizing KL is equivalent to the following $$ \min_\vtvt KL(q||p) = \underbrace{\bE_q[\cJ(\vtt)]}_{\text{Cost}} - \tau H[q] $$

Now the variable is $\vtvt$ and $KL$ is differentiable!
Cost is for minimizing $\cJ$ (weighted by $q$)
Entropy is for regularization
- e.g., for 1D Gaussian, $-H\propto (\log\sigma)$ - so larger $\sigma$ means larger $(-H)$
- If $q$ is "narrow", minimizer is more "local" $\Leftrightarrow$ $\sigma$ is small, $-H$ is small
- Larger entropy means larger $\sigma$ and promotes exploration
- Lowering $\tau$ narrows down $q$

Reparameterization trick¶

A technical detail is in the gradient of cost. $$ \bE_q[\cJ(\vtt)] = \int_\vtt q(\vtt;\vtvt)\cJ(\vtt) d\vtt \approx \frac{1}{N} \sum_{i=1}^N \cJ(\vtt_i),\quad \vtt_i\sim q(\vtt_i;\vtvt) $$

How to take the gradient of a sample $\cJ(\vtt_i)$?

In the typical choice of $q\sim\cN(\mu,\Sigma)$, $\Sigma=LL^\top$, one can do $$ \cJ(\vtt_i) = \cJ(\mu+L\epsilon_i),\quad \epsilon_i\sim\cN(0,I) $$ Let $\vtvt=[\mu,L]$, the gradient $\nabla_\vtvt\cJ$ can be computed naturally.