$$ \newcommand\ddf[2]{\dfrac{\mathrm{d} #1}{\mathrm{d} #2}} \newcommand\ppf[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\bE}{\mathbb{E}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cU}{\mathcal{U}} \newcommand{\vA}{\mathbf{A}} \newcommand{\vI}{\mathbf{I}} \newcommand{\vK}{\mathbf{K}} \newcommand{\vM}{\mathbf{M}} \newcommand{\vQ}{\mathbf{Q}} \newcommand{\vR}{\mathbf{R}} \newcommand{\vU}{\mathbf{U}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vc}{\mathbf{c}} \newcommand{\vd}{\mathbf{d}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vm}{\mathbf{m}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vw}{\mathbf{w}} \newcommand{\vtt}{\boldsymbol{\theta}} \newcommand{\vtvt}{\boldsymbol{\vartheta}} \newcommand{\vtl}{\boldsymbol{\lambda}} \newcommand{\vtf}{\boldsymbol{\phi}} \newcommand{\vty}{\boldsymbol{\psi}} $$
In modern ML applications, the NN approximates a distribution, rather than a deterministic function. This happens in, e.g.,
- Bayesian Inference and Uncertainty Quantification
- Latent Variable Models and Generative Models
- Policy Optimization in Reinforcement Learning
- etc.
This set of slides are still under construction.
Route 2.5: Gibbs Distribution¶
Now return to the general problem $$ \min_\vtt \cJ(\cF_{ND}(\vtt),\vtt) \equiv \cJ(\vtt) $$ and let's try treating $\cJ$ directly.
- Consider a feasible set $\cU$ for $\vtt$.
- $\cJ(\vtt)$ is a distribution of values over $\cU$.
We can convert $\cJ$ to a probability distribution by defining $$ \tilde{p}(\vtt) = \exp(-\cJ(\vtt)/\tau) $$
- Smaller $\cJ$ means larger $\tilde{p}$
- $\tau$ is "temperature" parameter, which control the spread of distribution
- For smaller $\tau$, only smaller $\cJ$'s have non-negligible $\tilde{p}$
Define a probability distribution $$ p(\vtt) = \frac{1}{\int_\cU \tilde{p}(\vtt) d\vtt}\tilde{p}(\vtt) \equiv \frac{1}{Z}\tilde{p}(\vtt) $$
- The integral normalizes $\tilde{p}$ to produce a probability distribution
- Minimizer of $\cJ$ is the maximizer of $p$
Then we can sample $p(\vtt)$ to find the maximizer
- But how to generate samples of $p(\vtt)$, that are needed by policy gradient?
- Technically we can use sampling methods, such as Markov-Chain Monte Carlo (MCMC).
- Caveat: MCMC is usualy slow to evaluate and likely more suitable for lower-dimensional problems.
Side Note¶
Along a similar vein, but a different route, one would get Simulated Annealing algorithm for optimization.
Route 3: Variational Approach¶
Building on the previous general case, an alternative:
- If $p$ is hard to sample, how about replacing it with a simpler distribution $q$?
- e.g., a multivariate Gaussian $q(\vtt;\vtvt)=\cN(\mu,\Sigma)$, with $\vtvt=[\mu,\Sigma]$
- Then we optimize $\vtvt$ instead of $\vtt$.
Problem: how to choose $\vtvt$ so that $p\approx q$?
Otherwise one gets the wrong optima!
Use Kullback-Leibler (KL) divergence to quantify the difference $$ KL(q||p) = \int_\vtt q(\vtt;\vtvt)\log \frac{q(\vtt;\vtvt)}{p(\vtt)} d\vtt = \underbrace{\bE_q[\log q(\vtt;\vtvt)]}_{\equiv -H[q]} - \bE_q[\log p(\vtt)] $$
- If $p=q$, $\log=1$ and $KL=0$
- If $p$ and $q$ are more different, $KL$ is larger
- $H(q)$ is called entropy - to be discussed soon
We know $$ p(\vtt) = \frac{1}{Z}\tilde{p}(\vtt) = \frac{1}{Z} \exp(-\cJ(\vtt)/\tau) $$ so in KL divergence $$ \bE_q[\log p(\vtt)] = \bE_q[-\cJ(\vtt)/\tau - \log Z] \propto \frac{1}{\tau}\bE_q[-\cJ(\vtt)] $$ where $Z$ is a constant that can be ignored.
Then, minimizing KL is equivalent to the following $$ \min_\vtvt KL(q||p) = \underbrace{\bE_q[\cJ(\vtt)]}_{\text{Cost}} - \tau H[q] $$
- Now the variable is $\vtvt$ and $KL$ is differentiable!
- Cost is for minimizing $\cJ$ (weighted by $q$)
- Entropy is for regularization
- e.g., for 1D Gaussian, $-H\propto (\log\sigma)$ - so larger $\sigma$ means larger $(-H)$
- If $q$ is "narrow", minimizer is more "local" $\Leftrightarrow$ $\sigma$ is small, $-H$ is small
- Larger entropy means larger $\sigma$ and promotes exploration
- Lowering $\tau$ narrows down $q$
Reparameterization trick¶
A technical detail is in the gradient of cost. $$ \bE_q[\cJ(\vtt)] = \int_\vtt q(\vtt;\vtvt)\cJ(\vtt) d\vtt \approx \frac{1}{N} \sum_{i=1}^N \cJ(\vtt_i),\quad \vtt_i\sim q(\vtt_i;\vtvt) $$
How to take the gradient of a sample $\cJ(\vtt_i)$?
In the typical choice of $q\sim\cN(\mu,\Sigma)$, $\Sigma=LL^\top$, one can do $$ \cJ(\vtt_i) = \cJ(\mu+L\epsilon_i),\quad \epsilon_i\sim\cN(0,I) $$ Let $\vtvt=[\mu,L]$, the gradient $\nabla_\vtvt\cJ$ can be computed naturally.