Machine Learning for Engineering¶

Neural Networks: Variational Inference (Draft)¶

Instructor: Daning Huang¶

$$ \newcommand\ddf[2]{\dfrac{\mathrm{d} #1}{\mathrm{d} #2}} \newcommand\ppf[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\bE}{\mathbb{E}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cU}{\mathcal{U}} \newcommand{\vA}{\mathbf{A}} \newcommand{\vI}{\mathbf{I}} \newcommand{\vK}{\mathbf{K}} \newcommand{\vM}{\mathbf{M}} \newcommand{\vQ}{\mathbf{Q}} \newcommand{\vR}{\mathbf{R}} \newcommand{\vU}{\mathbf{U}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vc}{\mathbf{c}} \newcommand{\vd}{\mathbf{d}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vm}{\mathbf{m}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vw}{\mathbf{w}} \newcommand{\vtt}{\boldsymbol{\theta}} \newcommand{\vtvt}{\boldsymbol{\vartheta}} \newcommand{\vtl}{\boldsymbol{\lambda}} \newcommand{\vtf}{\boldsymbol{\phi}} \newcommand{\vty}{\boldsymbol{\psi}} $$

TODAY: Neural Networks¶

Basic principles of variational inference

References:¶

In modern ML applications, the NN approximates a distribution, rather than a deterministic function. This happens in, e.g.,

  • Bayesian Inference and Uncertainty Quantification
  • Latent Variable Models and Generative Models
  • Policy Optimization in Reinforcement Learning
  • etc.

This set of slides are still under construction.

Route 2.5: Gibbs Distribution¶

Now return to the general problem $$ \min_\vtt \cJ(\cF_{ND}(\vtt),\vtt) \equiv \cJ(\vtt) $$ and let's try treating $\cJ$ directly.

  • Consider a feasible set $\cU$ for $\vtt$.
  • $\cJ(\vtt)$ is a distribution of values over $\cU$.

We can convert $\cJ$ to a probability distribution by defining $$ \tilde{p}(\vtt) = \exp(-\cJ(\vtt)/\tau) $$

  • Smaller $\cJ$ means larger $\tilde{p}$
  • $\tau$ is "temperature" parameter, which control the spread of distribution
    • For smaller $\tau$, only smaller $\cJ$'s have non-negligible $\tilde{p}$

Define a probability distribution $$ p(\vtt) = \frac{1}{\int_\cU \tilde{p}(\vtt) d\vtt}\tilde{p}(\vtt) \equiv \frac{1}{Z}\tilde{p}(\vtt) $$

  • The integral normalizes $\tilde{p}$ to produce a probability distribution
  • Minimizer of $\cJ$ is the maximizer of $p$

Then we can sample $p(\vtt)$ to find the maximizer

  • But how to generate samples of $p(\vtt)$, that are needed by policy gradient?
  • Technically we can use sampling methods, such as Markov-Chain Monte Carlo (MCMC).
  • Caveat: MCMC is usualy slow to evaluate and likely more suitable for lower-dimensional problems.

Side Note¶

Along a similar vein, but a different route, one would get Simulated Annealing algorithm for optimization.

Route 3: Variational Approach¶

Building on the previous general case, an alternative:

  • If $p$ is hard to sample, how about replacing it with a simpler distribution $q$?
  • e.g., a multivariate Gaussian $q(\vtt;\vtvt)=\cN(\mu,\Sigma)$, with $\vtvt=[\mu,\Sigma]$
  • Then we optimize $\vtvt$ instead of $\vtt$.

Problem: how to choose $\vtvt$ so that $p\approx q$?

Otherwise one gets the wrong optima!

Use Kullback-Leibler (KL) divergence to quantify the difference $$ KL(q||p) = \int_\vtt q(\vtt;\vtvt)\log \frac{q(\vtt;\vtvt)}{p(\vtt)} d\vtt = \underbrace{\bE_q[\log q(\vtt;\vtvt)]}_{\equiv -H[q]} - \bE_q[\log p(\vtt)] $$

  • If $p=q$, $\log=1$ and $KL=0$
  • If $p$ and $q$ are more different, $KL$ is larger
  • $H(q)$ is called entropy - to be discussed soon

We know $$ p(\vtt) = \frac{1}{Z}\tilde{p}(\vtt) = \frac{1}{Z} \exp(-\cJ(\vtt)/\tau) $$ so in KL divergence $$ \bE_q[\log p(\vtt)] = \bE_q[-\cJ(\vtt)/\tau - \log Z] \propto \frac{1}{\tau}\bE_q[-\cJ(\vtt)] $$ where $Z$ is a constant that can be ignored.

Then, minimizing KL is equivalent to the following $$ \min_\vtvt KL(q||p) = \underbrace{\bE_q[\cJ(\vtt)]}_{\text{Cost}} - \tau H[q] $$

  • Now the variable is $\vtvt$ and $KL$ is differentiable!
  • Cost is for minimizing $\cJ$ (weighted by $q$)
  • Entropy is for regularization
    • e.g., for 1D Gaussian, $-H\propto (\log\sigma)$ - so larger $\sigma$ means larger $(-H)$
    • If $q$ is "narrow", minimizer is more "local" $\Leftrightarrow$ $\sigma$ is small, $-H$ is small
    • Larger entropy means larger $\sigma$ and promotes exploration
    • Lowering $\tau$ narrows down $q$

Reparameterization trick¶

A technical detail is in the gradient of cost. $$ \bE_q[\cJ(\vtt)] = \int_\vtt q(\vtt;\vtvt)\cJ(\vtt) d\vtt \approx \frac{1}{N} \sum_{i=1}^N \cJ(\vtt_i),\quad \vtt_i\sim q(\vtt_i;\vtvt) $$

How to take the gradient of a sample $\cJ(\vtt_i)$?

In the typical choice of $q\sim\cN(\mu,\Sigma)$, $\Sigma=LL^\top$, one can do $$ \cJ(\vtt_i) = \cJ(\mu+L\epsilon_i),\quad \epsilon_i\sim\cN(0,I) $$ Let $\vtvt=[\mu,L]$, the gradient $\nabla_\vtvt\cJ$ can be computed naturally.