Machine Learning for Engineering¶

Neural Networks: More on Back Propagation¶

Instructor: Daning Huang¶

$$ \newcommand\ddf[2]{\dfrac{\mathrm{d} #1}{\mathrm{d} #2}} \newcommand\ppf[2]{\dfrac{\partial #1}{\partial #2}} \newcommand{\bE}{\mathbb{E}} \newcommand{\cE}{\mathcal{E}} \newcommand{\cF}{\mathcal{F}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cL}{\mathcal{L}} \newcommand{\cM}{\mathcal{M}} \newcommand{\cN}{\mathcal{N}} \newcommand{\cR}{\mathcal{R}} \newcommand{\cU}{\mathcal{U}} \newcommand{\vA}{\mathbf{A}} \newcommand{\vI}{\mathbf{I}} \newcommand{\vK}{\mathbf{K}} \newcommand{\vM}{\mathbf{M}} \newcommand{\vQ}{\mathbf{Q}} \newcommand{\vR}{\mathbf{R}} \newcommand{\vU}{\mathbf{U}} \newcommand{\vb}{\mathbf{b}} \newcommand{\vc}{\mathbf{c}} \newcommand{\vd}{\mathbf{d}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vm}{\mathbf{m}} \newcommand{\vu}{\mathbf{u}} \newcommand{\vw}{\mathbf{w}} \newcommand{\vtt}{\boldsymbol{\theta}} \newcommand{\vtvt}{\boldsymbol{\vartheta}} \newcommand{\vtl}{\boldsymbol{\lambda}} \newcommand{\vtf}{\boldsymbol{\phi}} \newcommand{\vty}{\boldsymbol{\psi}} $$

TODAY: Neural Networks - V¶

Automatic Differentiation for

  • Implicit functions
  • Non-differentiable functions

References:¶

  • [Strang]
  • ADCME

Motivation¶

Previously ...

  • a Newton solve for $x$, $f(x,b)=x^8-x^2-b=0$, given $b$

    • Implicit, because we cannot write $x$ as an analytical function of $b$.
  • an argmax function, e.g., $x=\mathrm{arg max} f(x)$

    • Non-differentiable, because an infinitesimal change in $f$ can result in large, finite change in $x$.

How to backprop these functions?

But before that, when do we need to deal with these?

Example 1: All models are wrong, but some are useful.¶

  • In traditional computational analysis, the physics cannot be precisely captured
    • e.g., turbulence closure, constitutive relation modeling
    • Uncertainty in prediction/design $\rightarrow$ Large safety factor and performance penalty
  • ML can be mebedded into traditional analysis for better predictive capability.

No description has been provided for this image (Figure from Holland, Baeder etc. 2019)

Example 2: Don't learn what you already know¶

  • Some ML tasks involves steps that can be done by traditional analysis
    • e.g., path planning and trajectory tracking in robotics
    • Replacing these steps by pure NN's takes extra effort and usually does not extrapolate well
  • Traditional analysis can be embedded into ML for better generalizability and robustness.

No description has been provided for this image

[1] Romero et al. Actor-Critic Model Predictive Control: Differentiable Optimization meets Reinforcement Learning. 2023.

Posing the Problem¶

From an NN perspective, a model problem is

$$ \boxed{\text{Given }\vtt}\rightarrow \boxed{\text{Solve }\min_{\vu} \cE(\vu;\vtt)\text{ or }\arg\min_{\vu} \cE(\vu;\vtt)}\rightarrow \boxed{\text{Compute }\cJ(\vu;\vtt)} $$

  • $\vtt$ is the learnable parameter and $\cJ$ is the loss
  • The middle step is an optimization of an implicit function $\cE$
  • Let's see some examples next

Example 1: Model Calibration¶

Consider a solid mechanics problem. The solution is found by minimizing energy $$ \cE(\vu) = \vu^\top\vK\vu + \vf^\top\vu $$ where $\vu$ is displacement and $\vf$ is load.

  • But now the stiffness matrix $\vK$ is not fully known, so we assume $\vK=\vK_0+\delta\vK_{NN}(\vtt)$ and denote $\cE(\vu;\vtt)$.
  • Given data set $\{(\vf_i,\vu_i)\}_{i=1}^N$ we want to learn $\delta\vK_{NN}$.

Formally we can write it as a nested optimization problem, $$ \begin{align} \min_{\vtt} &\quad \cJ(\hat{\vu};\vtt) = \sum_{i=1}^N \lVert\vu_i-\hat{\vu}_i\rVert^2 \\ \mathrm{s.t.} &\quad \min_{\vu} \cE(\hat{\vu}_i;\vtt) = \hat{\vu}_i^\top(\vK_0+\vK_{NN}(\vtt))\hat{\vu}_i + \vf_i^\top\hat{\vu}_i,\quad i=1,2,\cdots,N \end{align} $$ ... and the computation graph is what we have shown earlier.

The same idea applies to many other scenarios, e.g.,

  • Fluid dynamics:
    • $\min\cE(\vu)$ solves Navier-Stokes equation, $\vu$ - density, momentum, energy
    • $\cJ$ to minimize difference in simulation vs. experiment for, e.g., pressure data on an airfoil
    • $\min\cE(\vu;\vtt)$ with correction in turbulence production term in the fluid solver
  • System dynamics:
    • $\min\cE(\vu)$ solves flight dynamics equation, $\vu$ - linear & angular displacement/velocity
    • $\cJ$ to minimize difference in sim. vs. exp. for, e.g., dynamical response
    • $\min\cE(\vu;\vtt)$ with correction in unknown external forcing term

Example 2: Imitation Learning¶

Consider a model predictive controller (MPC). $$ \begin{align} \min_{\vu,\vf} &\quad \cE(\vu;\vtt) = \sum_{i=1}^N (\vu_i-\vu_{r,i})^\top\vQ(\vu_i-\vu_{r,i}) + \vf_i^\top\vR\vf_i \\ \mathrm{s.t.} &\quad \vu_{i+1}=\vg(\vu_i,\vf_i),\quad i=1,2,\cdots,N-1 \end{align} $$ with states $\vu$, inputs $\vf$, and dynamics $\vg$; $\vu_r$ is the trajectory to track.

  • In practice, the choice of $\vQ$ and $\vR$ may require extensive tuning.
  • Given data set of "expert demonstrations" (e.g., by human) for optimal inputs $\{\vf_i\}_{i=1}^N$ and states $\{\vu_i\}_{i=1}^N$.
  • How to learn $\vtt=[\vQ,\vR]$ from the dataset?

This is again a nested optimization problem,

$$ \begin{align} \min_{\vtt} &\quad \cJ(\hat{\vu},\hat{\vf};\vtt) = \sum_{i=1}^N \lVert\vu_i-\hat{\vu}_i\rVert^2 + \lVert\vf_i-\hat{\vf}_i\rVert^2 \\ \mathrm{s.t.} &\quad \min_{\hat{\vu},\hat{\vf}} \ \cE(\hat{\vu};\vtt) = \sum_{i=1}^N (\hat{\vu}_i-\vu_{r,i})^\top\vQ(\hat{\vu}_i-\vu_{r,i}) + \hat{\vf}_i^\top\vR\hat{\vf}_i \\ &\quad \ \mathrm{s.t.}\quad \hat{\vu}_{i+1}=\vg(\hat{\vu}_i,\hat{\vf}_i),\quad i=1,2,\cdots,N-1 \end{align} $$

... just that we got some constraints in the inner problem.

General Form of Problem¶

The above examples suggest the following general form to tackle: $$ \begin{align} \min_{\vtt} &\quad \cJ(\vu;\vtt) \\ \mathrm{s.t.} &\quad \min \cE(\vu;\vtt) \\ &\quad \mathrm{s.t.} \quad \vc(\vu;\vtt) = 0 \end{align} $$

  • $\cJ$: The outer objective - typically the loss for learning
  • $\cE$ and $\vc$: The inner objective and constraints
    • The inner optimization problem
    • Implicitly define $\vu$ as a function of $\vtt$
    • Inequality constraints can be added too

For the purpose of ML, we still want $\ddf{\cJ}{\vtt}$.

Strategies for BP¶

Two cases for the inner problem:

  • Non-differentiable, or a black box
    • e.g., sampling-based, or an external solver not in a learning framework
  • The rest (easier) case
    • This typically means the evaluation of $\cE$ and $\vc$ can be backproped

Easier Case First¶

Since all the terms are differentiable, consider the equivalent form by Lagrangian multiplier, $$ L_i(\vu,\vtl;\vtt) = \cE(\vu;\vtt) + \vtl^\top \vc(\vu;\vtt) $$ The necessary condition for optimality is $$ \ppf{L_i}{\vu} = 0,\quad \ppf{L_i}{\vtl} = 0 $$ So solving the inner problem is equivalent to solving the above nonlinear algebraic equations. Denote as $$ \cR(\vu;\vtt) = 0 $$ (dropping $\vtl$ for conciseness)

The inner problem is then solved by, e.g., Newton iterations

  • Starting from initial guess $$ \vu^{(0)},\quad \cR^{(0)} = \cR(\vu^{(0)};\vtt) $$
  • Iterate $$ \vu^{(k+1)} = \vu^{(k)} - (\nabla_{\vu}\cR^{(k)})^{-1} \cR^{(k)} $$
  • Until convergence, e.g., $$ \lVert \vu^{(n)}-\vu^{(n-1)} \rVert \leq \epsilon $$

Brutal force: Unrolling¶

While the inner problem is an implicit function, the iterations are explicit. $$ \boxed{\vu^{(0)}; \vtt} \rightarrow \boxed{\cR^{(0)}} \rightarrow \boxed{\vu^{(1)} = \vu^{(0)} - (\nabla_{\vu}\cR^{(0)})^{-1} \cR^{(0)}} \rightarrow \boxed{\vu^{(2)} = \vu^{(1)} - ...} \rightarrow \boxed{...} \rightarrow \boxed{\vu=\vu^{n}} $$ We can unroll and BP the entire iteration process - each iteration is effectively one layer.

  • Pros: Easy to implement - as long as the iterations are implemented in a ML package, the package can do BP for you
  • Cons: Slow - need to BP through all iterations (similar to LSTM)

Unrolling - a fix¶

The initial guess is free to choose, and what if it is exactly the solution? How about:

First a regular solve, but do not retain gradients $$ \boxed{\vu^{(0)}; \vtt} \rightarrow \boxed{\cR^{(0)}} \rightarrow \boxed{\vu^{(1)} = \vu^{(0)} - (\nabla_{\vu}\cR^{(0)})^{-1} \cR^{(0)}} \rightarrow \boxed{\vu^{(2)} = \vu^{(1)} - ...} \rightarrow \boxed{...} \rightarrow \boxed{\vu=\vu^{n}} $$ Then use $\vu^{n}$ as the initial guess and "pretend" to solve the problem in one iteration; retain gradients here $$ \boxed{\vu^{(n)}; \vtt} \rightarrow \boxed{\cR^{(n)}} \rightarrow \boxed{\vu^{(n+1)} = \vu^{(n)} - (\nabla_{\vu}\cR^{(n)})^{-1} \cR^{(n)}} \rightarrow \boxed{\vu=\vu^{n+1}} $$ BP the second round only

  • Pros: Barely any more effort to implement, and much faster
  • Cons: Could still be slow, as the BP does not leverage problem structure
    • e.g., BP through a linear solve $\nabla_{\vu}\cR^{(n)}$, which can be expensive

(Possibly) Better solution: Adjoint¶

To be discussed in a minute.

  • Pros: Fast and accurate
  • Cons: Takes time to implement

Non-differentiable Case¶

Essentially, one can only do forward evaluation for the inner problem, i.e.,

  • A function that computes $\vu$ given $\vtt$. Denote $\vu=\cF_{ND}(\vtt)$
  • Rewrite the nested problem as $$ \min_\vtt \cJ(\cF_{ND}(\vtt),\vtt) $$

Whenever possible, avoid getting trapped in this scenario. But if you have to deal with it...

Brutal force: Penalty/Surrogate¶

Define a surrogate NN to replace $\cF_{ND}$ to estimate $\vu$ $$ \hat{\vu} = \cF_{NN}(\vtt;\vartheta) $$ Learn the original model and the surrogate simultaneously with two losses $$ \min_{\vtt,\vartheta} \cJ(\hat{\vu},\vtt) + \lVert \cF_{NN}(\vtt;\vartheta) - \cF_{ND}(\vtt) \rVert^2 $$ $\cF_{ND}(\vtt)$ is used as if a dataset generated on the fly

  • Pros: Easy to implement; (typically) fast to evaluate in prediction
  • Cons: Additional computational cost; $\cF_{NN}$ can be hard to train; only weak enforcement of $\cF_{ND}$ in prediction

(Possibly) Better solution: Probabilistic formulation¶

To be discussed in a minute.

  • Pros: Opposite of the previous cons
  • Cons: Effective only for low-dim $\vtt$

Recap¶

Next we will discuss two methods for BP implicit functions:

  • Adjoint method for differentiable cases
  • Probabilistic method for non-differentiable cases

Adjoint Method for Differentiable Cases¶

Consider the optimization problem in general, $$ \begin{align} \min_{\vtt} &\quad \cJ(\vu;\vtt) \\ \mathrm{s.t.} &\quad \cR(\vu;\vtt) = 0 \end{align} $$ where $\cR(\vu;\vtt)=0$ represents the inner problem.

Formulation by adjoint variables¶

For optimization, one wants the full derivative of $\cJ(\vu;\vtt)$ w.r.t. $\vtt$, i.e. $$ \ddf{\cJ(\vu;\vtt)}{\vtt} = \ppf{\cJ}{\vu}\ppf{\vu}{\vtt} + \ppf{\cJ}{\vtt} $$ where $\ppf{\vu}{\vtt}$ is non-trivial to compute.

From the constraint, one knows, $$ \ddf{\cR(\vu;\vtt)}{\vtt} = \ppf{\cR}{\vu}\ppf{\vu}{\vtt} + \ppf{\cR}{\vtt} = 0 $$ or $$ \ppf{\vu}{\vtt} = \left(-\ppf{\cR}{\vu}\right)^{-1}\ppf{\cR}{\vtt} $$

Therefore, $$ \ddf{\cJ(\vu;\vtt)}{\vtt} = \ppf{\cJ}{\vu}\left(-\ppf{\cR}{\vu}\right)^{-1}\ppf{\cR}{\vtt} + \ppf{\cJ}{\vtt} $$

Now let $$ \vtf^T = -\ppf{\cJ}{\vu}\left(\ppf{\cR}{\vu}\right)^{-1} $$ or $$ \left(\ppf{\cR}{\vu}\right)^T\vtf = -\left(\ppf{\cJ}{\vu}\right)^T $$

This last equation is called the Adjoint Equation, and $\vtf$ is called the Adjoint Variable.

As a result, the full derivate of $\cJ(\vu;\vtt)$ w.r.t. $\vtt$ is computed as follows,

  • Given $\vtt$, solve $\cR(\vu;\vtt)=0$ to find $\vu$
    • In this process one obtains $\ppf{\cR}{\vu}$.
  • Given $\vtt$ and $\vu$, find $\ppf{\cJ}{\vu}$, $\ppf{\cJ}{\vtt}$, and $\ppf{\cR}{\vtt}$.
  • Solve the adjoint equation and find $\vtf$. $$ \left(\ppf{\cR}{\vu}\right)^T\vtf = -\left(\ppf{\cJ}{\vu}\right)^T $$
  • The gradient of $\cJ$ $$ \ddf{\cJ(\vu;\vtt)}{\vtt} = \vtf^T\ppf{\cR}{\vtt} + \ppf{\cJ}{\vtt} $$

Alternative: Formulation by Lagrange multipliers¶

Introduce a vector of Lagrange multipliers $\vtf$ $$ \cL(\vu;\vtt) = \cJ(\vu;\vtt) + \vtf^T \cR(\vu;\vtt) $$

Now one wants the full derivative of $\cL(\vu;\vtt)$ w.r.t. $\vtt$, i.e. $$ \ddf{\cL(\vu;\vtt)}{\vtt} = \ppf{\cJ}{\vu}\ppf{\vu}{\vtt} + \ppf{\cJ}{\vtt} + \vtf^T\left( \ppf{\cR}{\vu}\ppf{\vu}{\vtt} + \ppf{\cR}{\vtt} \right) $$

The RHS is $$ \ppf{\cJ}{\vtt} + \vtf^T\ppf{\cR}{\vtt} + \left(\ppf{\cJ}{\vu} + \vtf^T\ppf{\cR}{\vu}\right)\ppf{\vu}{\vtt} $$

One can eliminate $\ppf{\vu}{\vtt}$ by setting the bracket to zero, $$ \ppf{\cJ}{\vu} + \vtf^T\ppf{\cR}{\vu} = 0 $$ or the adjoint equation $$ \left(\ppf{\cR}{\vu}\right)^T\vtf = -\left(\ppf{\cJ}{\vu}\right)^T $$

Extra: Multiple inner problems¶

It is not uncommon if a ML model involves multiple implicit functions. The optimization problem becomes $$ \begin{align} \min_{\vtt} &\quad \cJ(\vU;\vtt) \\ \mathrm{s.t.} &\quad \cR_j(\vU;\vtt) = 0,\quad j=1,\cdots,m \end{align} $$ where $m$ inner problems are involved, with unknowns $\vU=\{\vu_1,\cdots,\vu_m\}$.

Introduce a set of Lagrange multipliers, or the adjoint variables, $$ \cL(\vU;\vtt) = \cJ(\vU;\vtt) + \sum_{j=1}^m \vtf_j^T \cR_j(\vU;\vtt) $$

Take the full derivative, $$ \begin{align} \ddf{\cL(\vu;\vtt)}{\vtt} &= \ppf{\cJ}{\vU}\ppf{\vU}{\vtt} + \ppf{\cJ}{\vtt} + \sum_{j=1}^m \vtf_j^T\left( \ppf{\cR_j}{\vU}\ppf{\vU}{\vtt} + \ppf{\cR_j}{\vtt} \right) \\ &= \ppf{\cJ}{\vtt} + \sum_{j=1}^m \vtf_j^T\ppf{\cR_j}{\vtt} + \sum_{j=1}^m \left(\vtf_j^T\ppf{\cR_j}{\vU} + \ppf{\cJ}{\vU}\right)\ppf{\vU}{\vtt} \end{align} $$

A set of $m$ coupled adjoint equations is identified as $$ \sum_{j=1}^m \left(\ppf{\cR_j}{\vu_i}\right)^T\vtf_j = \left(\ppf{\cJ}{\vu_i}\right)^T,\quad i=1,\cdots,m $$

Numerical Example¶

We solve a 1D advection-diffusion equation for $x\in [0,1]$ $$ p u' - u'' = f(x;\vtt),\quad u(0)=u_0,\ u'(1)=n_1 $$ where $p$ is a constant, representing wave speed.

The RHS contains some unknown parameters $$ f(x;\vtt) = x^2 + \theta_1 x + \theta_2 $$

Suppose we know a solution $u^*$ found at the true parameters $\theta^*$. To infer $\theta^*$ from $u^*$, we solve $$ \begin{align*} \min_{\vtt} &\quad J(u;\vtt) \equiv \int_0^1 (u-u^*)^2 dx \\ \mathrm{s.t.} &\quad p u' - u'' = f(x;\vtt),\quad u(0)=u_0,\ u'(1)=n_1 \end{align*} $$

In [2]:
xx = np.linspace(0,1,41)
f = plt.figure(figsize=(6,4))
plt.plot(xx, fu(xx));
No description has been provided for this image

System Discretization¶

We use a generic Galerkin method to discretize the differential equation and the objective.

Specifically, we pick $N$ Legendre polynomials as basis functions, and represent the solution as $$ u(x) = \sum_{i=1}^N u_i\psi_i(x) \equiv \vty^T(x) \vu $$

In [3]:
f, ax = plt.subplots(ncols=2, figsize=(8,4))
for _i in range(0,5):
    ax[0].plot(xx, eval_leg0(_i, xx), label=f"Order={_i}")
    ax[1].plot(xx, eval_leg1(_i, xx))
ax[0].legend()
ax[0].set_title("Legendre polynomials")
ax[1].set_title("Derivatives");
No description has been provided for this image

The objective is converted to a quadratic form, $$ \cJ(\vu;\vtt) = \int_0^1 \left( \vty^T\vu - u^* \right)^2 dx \equiv \vu^T \vM \vu - 2\vu^T \vm + \text{const} $$ where $$ \vM = \int_0^1 \vty\vty^T dx,\quad \vm = \int_0^1 \vty u^* dx $$

The equation is converted to a linear system, $$ \cR(\vu;\vtt) = \vA\vu - \vb(\vtt) = 0 $$ where for the formulation of $\vA$ see the Differential Equations slides in Math review, and $$ \vb(\vtt) = \int_0^1 \vty f(x,\vtt) dx $$

Adjoint-based Optimization¶

This brings us to the standard form: $$ \begin{align*} \min_{\vtt} &\quad \cJ(\vu;\vtt) \\ \mathrm{s.t.} &\quad \cR(\vu;\vtt) = 0 \end{align*} $$ Now suppose we start with an initial guess, say $\vtt^{(0)}=[10.0, 10.0]$.

Then, to use the adjoint formulation to obtain the gradients for optimization $\ddf{\cL(\vu;\vtt)}{\vtt}$ at $\vtt^{(0)}$ -

First, we solve $\vu^{(0)}$ at $\vtt^{(0)}$. Clearly this is totally off.

In [5]:
theta0 = [10.0, 10.0]
sol, mats = galerkin(M, N, p, ub, nb, theta0)
print(obj(sol)[0])
compare_sol(sol, fu, 41);
0.6717431265076316
No description has been provided for this image

Then, we solve the adjoint problem $$ \left(\ppf{\cR}{\vu}\right)^T\vtf = -\left(\ppf{\cJ}{\vu}\right)^T $$ where $\ppf{\cR}{\vu}=\vA$, $\ppf{\cJ}{\vu}=(2\vM\vu-2\vm)^T$

And evaluate the gradients $$ \ddf{\cL(\vu;\vtt)}{\vtt} = \ppf{\cJ}{\vtt} + \vtf^T \ppf{\cR}{\vtt} $$ where $\ppf{\cJ}{\vtt}=0$ and $\ppf{\cR}{\vtt}=-\ppf{\vb}{\vtt}$

Adjoint-based gradients, verified by finite difference:

In [6]:
# Adjoint
sol, mats = galerkin(M, N, p, ub, nb, theta0)  # Forward solve
_, gda = galerkin_da(sol, mats, obj)           # Adjoint/Backward solve
In [7]:
# Finite difference, using a SciPy function
def Jfd(qs):
    sol, _ = galerkin(M, N, p, ub, nb, qs)
    d, _ = obj(sol)
    return d
gfd = approx_fprime(theta0, Jfd)

print(f"Adjoint: {gda}, FD: {gfd}")
Adjoint: [0.04238442 0.09196361], FD: [0.04238444 0.0919636 ]

Now let's use a vanilla optimizer from SciPy to solve the problem (i.e., train the model)

In [8]:
def Jda(qs):
    sol, mats = galerkin(M, N, p, ub, nb, qs)
    fun, grd = galerkin_da(sol, mats, obj)
    return fun, grd
res = minimize(Jda, theta0, method="l-bfgs-b", jac=True)
print(res)
  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 7.840073109219852e-09
        x: [ 3.404e-04 -2.283e-04]
      nit: 4
      jac: [-2.017e-08 -4.432e-08]
     nfev: 8
     njev: 8
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
In [9]:
# Check the converged solutions, i.e., "learned" model
sol, mats = galerkin(M, N, p, ub, nb, res.x)
compare_sol(sol, fu, 41);
No description has been provided for this image
In [10]:
# Note that much more cost would be incurred if we optimize
# by BRUTAL force (i.e., finite difference)
res = minimize(Jfd, theta0, method="l-bfgs-b", jac=False)
print(res)  # Look at `nfev`: function evaluation
  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 7.84006204764976e-09
        x: [ 4.063e-04 -2.550e-04]
      nit: 4
      jac: [-7.499e-09 -2.315e-08]
     nfev: 24
     njev: 8
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>

Key take-away: Adjoint formulation enables machine learning with large scale simulations.

No description has been provided for this image No description has been provided for this image

[Left] Neustock et al. Nature 2019; [Right] Amos et al. NeurIPS 2018

Probabilistic Method for Non-Differentiable Cases¶

Recall now the problem has to be written as $$ \min_\vtt \cJ(\cF_{ND}(\vtt),\vtt) $$ where $\cF_{ND}$ might not be differentiable.

We assume scalar $\cF$ here.

Next we present two routes for finding the equivalent or approximate quantities to $$ \ppf{\cF_{ND}(\vtt)}{\vtt}\text{, or in short, }\nabla_\vtt\cF $$ so that we can work with non-differentiable functions in a differentiable framework.

Route 1: Zeroth-order Estimate¶

For differentiable cases, one can use finite difference to estimate gradients $$ (\nabla_\vtt \cF)\vw = \frac{\cF(\vtt+\epsilon\vw)-\cF(\vtt)}{\epsilon} + O(\epsilon),\quad \text{for sufficiently small }\epsilon $$ Looping over unit vectors $\vw=[1,0,\cdots,0],[0,1,\cdots,0]$, etc. gives the columns of $\nabla_\vtt \cF$

Now define a probabilistic generalization of the finite difference $$ \tilde{\nabla}_\vtt\cF = \bE_\vw\left[ \frac{\cF(\vtt+\epsilon\vw)-\cF(\vtt)}{\epsilon}\vw^\top \right],\quad \text{for small }\epsilon,\ \vw \sim \cN(0,\vI) $$

Does not require $\cF$ to be differentiable

Practically, we sample, say, $N$ $\vw_i$ from $\cN(0,\vI)$, and estimate $$ \tilde{\nabla}_\vtt\cF = \frac{1}{N} \sum_{i=1}^N \frac{\cF(\vtt+\epsilon\vw_i)-\cF(\vtt)}{\epsilon}\vw_i^\top $$

If $\cF$ were differentiable, $$ \cF(\vtt+\epsilon\vw) \approx \cF(\vtt) + \epsilon(\nabla_\vtt\cF)\vw $$ Then we recover finite difference when $\epsilon\rightarrow 0$ $$ \tilde{\nabla}_\vtt\cF = \bE_\vw\left[ \frac{\cF(\vtt+\epsilon\vw)-\cF(\vtt)}{\epsilon}\vw^\top \right] = \bE_\vw\left[ (\nabla_\vtt\cF)\vw\vw^\top \right] = (\nabla_\vtt\cF) \underbrace{\bE_\vw\left[ \vw\vw^\top \right]}_{=\vI} = \nabla_\vtt\cF $$

A discussion on variance¶

You might notice the probabilistic version could probably be simplified $$ \tilde{\nabla}_\vtt\cF = \bE_\vw\left[ \frac{\cF(\vtt+\epsilon\vw)-\cF(\vtt)}{\epsilon}\vw^\top \right] = \frac{1}{\epsilon}\bE_\vw\left[ \cF(\vtt+\epsilon\vw)\vw^\top \right] - \frac{1}{\epsilon}\bE_\vw\left[\cF(\vtt)\vw^\top \right] = \frac{1}{\epsilon}\bE_\vw\left[ \cF(\vtt+\epsilon\vw)\vw^\top \right] $$ because, in the second $\bE_\vw$, $\cF(\vtt)$ is independent of $\vw$ and $\bE[\vw]=0$ by definition.

Why people do not use this simpler version?

To quantify the quality of an estimation, we consider

  • Bias: Average deviation of the mean from truth
    • Non-zero bias gives erroneous gradients - bad for optimization
  • Variance: Spread of deviation
    • Higher variance gives larger randomness of gradients - bad for optimization
  • Ideally we want zero bias and low variance

Both probabilistic versions are already zero bias, how about variance?

Consider a general case - find $b$ to minimize the variance of $$ \vg(b) = \frac{\cF(\vtt+\epsilon\vw)-b}{\epsilon}\vw $$ Already know $\bE_\vw[\vg(b)]=0$, so we only need $$ \min_b \lVert\bE_{\vw}[\vg\vg^\top]\rVert,\text{ or loosely: } \min_b \bE_{\vw}[\lVert\vg\rVert^2] $$

Note that the objective is simply a quadratic function of $b$ $$ \bE_{\vw}[\lVert\vg\rVert^2] = \bE_{\vw}\left[ \left(\frac{\cF(\vtt+\epsilon\vw)-b}{\epsilon}\right)^2 \lVert\vw\rVert^2 \right] = \frac{1}{\epsilon^2}\bE_{\vw}[\lVert\vw\rVert^2] b^2 + [blah] b + [blah] $$ so the minimizer can be found analytically $$ b^* = \frac{\bE_{\vw}[\cF(\vtt+\epsilon\vw)\lVert\vw\rVert^2]}{\bE_{\vw}[\lVert\vw\rVert^2]} $$ One can either estimate $b^*$ at each iteration, or simply choose $b^*\approx \cF(\vtt)$

  • This brings back our initial probabilistic formulation!

Side Note¶

Along a similar vein, but a different route, one would get "Evolution Strategy" (ES) type algorithms for optimization.

  • ES algorithms have decades of history of development. One popular variant is CMA-ES.

One notable instance of ES in ML is the following work from OpenAI:

  • Salimans et al. Evolution Strategies as a Scalable Alternative to Reinforcement Learning. ArXiv 2017.

Route 2: Policy Gradient¶

Let's consider a special case, $$ \cF(\vtt) = \arg\max_\vu p(\vu;\vtt) $$ i.e., finding the $\vu$ that is "the most probable" in a probability distribution.

  • Key idea: $\cF$ is non-differentiable, but $p$ is.
  • This set up is common in reinforcement learning - $p$ is the "policy".
  • We will return to the general case in the next route.

This route involves several transformations - First an approximate objective (using $\max$ for consistency) $$ \max_\vtt \cJ(\cF(\vtt))=\cJ(\arg\max_\vu p(\vu;\vtt)) \Rightarrow \max_\vtt \int_\vu \cJ(\vu)p(\vu;\vtt) d\vu $$

The new objective is differentiable $$ \nabla_\vtt \left( \int_\vu \cJ(\vu)p(\vu;\vtt) d\vu \right) = \int_\vu \cJ(\vu) \nabla_\vtt p(\vu;\vtt) d\vu $$

But it requires an expensive integral over $\vu$ - So we do the log-trick $$ RHS = \int_\vu \cJ(\vu) \left(\frac{\nabla_\vtt p(\vu;\vtt)}{p(\vu;\vtt)}\right)p(\vu;\vtt) d\vu = \int_\vu \cJ(\vu) \nabla_\vtt \log p(\vu;\vtt) p(\vu;\vtt) d\vu $$

Now the gradient can be written as an expectation - which can be estimated from $N$ samples $$ \nabla_\vtt \cJ \approx \bE_\vu[\cJ(\vu) \nabla_\vtt \log p(\vu;\vtt)] \approx \frac{1}{N}\sum_{i=1}^N \cJ(\vu_i) \nabla_\vtt \log p(\vu_i;\vtt) $$

Variance¶

Just like the previous route, here we can adjust the gradient estimation to reduce the variance.

First note that subtracting a constant does not change the estimation $$ \bE_\vu[(\cJ(\vu)-b) \nabla_\vtt \log p(\vu;\vtt)] = \bE_\vu[\cJ(\vu) \nabla_\vtt \log p(\vu;\vtt)] - b \underbrace{\bE_\vu[\nabla_\vtt \log p(\vu;\vtt)]}_{=0} $$ The last term is 0 because $$ \bE_\vu[\nabla_\vtt \log p(\vu;\vtt)] = \int_\vu \nabla_\vtt p(\vu;\vtt) d\vu = \nabla_\vtt \int_\vu p(\vu;\vtt) d\vu = \nabla_\vtt 1 = 0 $$

Then, we choose the best $b$ to minimize the variance $$ \min_b \bE_\vu[(\cJ(\vu)-b)^2 \lVert\nabla_\vtt \log p(\vu;\vtt)\rVert^2] $$ and the solution is similar as before $$ b^* = \frac{\bE_\vu[\cJ(\vu)\lVert\cdot\rVert^2]}{\bE_\vu[\lVert\cdot\rVert^2]} \approx \bE_\vu[\cJ(\vu)] $$

So a low-variance gradient estimate would be $$ \nabla_\vtt \cJ(\cR(\vtt);\vtt) \approx \frac{1}{N}\sum_{i=1}^N (\cJ(\vu_i)-\bar{\cJ}) \nabla_\vtt \log p(\vu_i;\vtt),\quad \bar{\cJ} = \frac{1}{N} \sum_{i=1}^N \cJ(\vu_i) $$

The technical term for $\bar{\cJ}$ is "baseline".

Numerical Example - Cont'd¶

Recall we were looking at a differential equation with unknown parameters $$ p u' - u'' = f(x;\vtt) = x^2 + \theta_1 x + \theta_2,\quad u(0)=u_0,\ u'(1)=n_1 $$ and have discretized and formulated the problem as $$ \begin{align*} \min_{\vtt} &\quad \cJ(\vu;\vtt) = \vu^T \vM \vu - 2\vu^T \vm \\ \mathrm{s.t.} &\quad \cR(\vu;\vtt) = \vA\vu - \vb(\vtt) = 0 \end{align*} $$

Apply zeroth-order optimization¶

Since the formulation was for scalar output, here we reformulate the problem as $$ \min_\vtt\quad \cJ(\cF(\vtt)),\quad \cF(\vtt)=\vA^{-1}\vb(\vtt) $$ and directly work with $\cJ$.

We estimate gradient as $$ \ppf{\cJ}{\vtt} \approx \frac{1}{N}\sum_{i=1}^N \frac{\cJ(\vtt+\epsilon\vw_i)-\cJ(\vtt)}{\epsilon}\vw_i^\top,\quad \text{for small }\epsilon,\ \vw \sim \cN(0,\vI) $$

First try the original continuous problem, so the adjoint solution is available as benchmark.

In [12]:
def zoo(func, qs, Nq, eps=1e-3):   # Zeroth-Order Optimization
    ws = np.random.randn(Nq, len(qs))
    J0 = func(qs)
    tmp = 0
    for _i in range(Nq):
        J = func(qs + eps*ws[_i])
        tmp += (J-J0) / eps * ws[_i]
    return tmp / Nq
print(f"Adjoint: {gda}, Zeroth: {zoo(Jfd, theta0, 100)}")
Adjoint: [0.04238442 0.09196361], Zeroth: [0.03181694 0.09244476]
In [13]:
Ns = [20, 50, 100, 200, 500, 1000, 10000]  # Check convergence
for _n in Ns:
    _g = zoo(Jfd, theta0, _n, eps=1e-3)
    print(f"{_n:<5}: {_g}, Error: {np.linalg.norm(gda-_g) / np.linalg.norm(gda)}")
20   : [0.03850679 0.10418842], Error: 0.12665369540944582
50   : [0.03966067 0.10514643], Error: 0.13293654133796545
100  : [0.07316111 0.11569813], Error: 0.38381656015138965
200  : [0.04563858 0.0786218 ], Error: 0.13561943280468086
500  : [0.04716712 0.09594641], Error: 0.06146403440542132
1000 : [0.04682252 0.10215667], Error: 0.10978914707587059
10000: [0.04242091 0.09247902], Error: 0.005102673084574341
In [14]:
# Zeroth-order optimization - Note # of iterations (nit)
res_zo = minimize(Jfd, theta0, method="l-bfgs-b", jac=lambda q: zoo(Jfd, q, 100))
print(res_zo)
  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 2.427189391111174e-08
        x: [-2.630e-02  1.212e-02]
      nit: 10
      jac: [-1.306e-07  2.061e-06]
     nfev: 16
     njev: 16
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
In [15]:
# Adjoint solution as reference
res_da = minimize(Jda, theta0, method="l-bfgs-b", jac=True)
print(res_da)
  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 7.840073109219852e-09
        x: [ 3.404e-04 -2.283e-04]
      nit: 4
      jac: [-2.017e-08 -4.432e-08]
     nfev: 8
     njev: 8
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
In [16]:
# Check the converged solutions, i.e., "learned" model
sol, _ = galerkin(M, N, p, ub, nb, res_zo.x)
compare_sol(sol, fu, 41);
No description has been provided for this image

Next, modify the parameters to $\tilde{\vtt}=[\tilde{\theta}_1,\tilde{\theta}_2]$, so $\tilde{\theta}_2$ experiences numerous tiny steps $$ \tilde{\theta}_1=\theta_1,\quad \tilde{\theta}_2=\frac{1}{100}\mathrm{int}(100\theta_2) $$

In [17]:
# Non-differentiable version; no useful gradient for theta_2!
def Jnd(qs):
    q0 = qs[0]
    q1 = int(qs[1]*100)/100
    sol, _ = galerkin(M, N, p, ub, nb, np.array([q0, q1]))
    d, _ = obj(sol)
    return d
gnd = approx_fprime(theta0, Jnd)

print(f"Diff ver.: {gda}, ND: {gnd}")
Diff ver.: [0.04238442 0.09196361], ND: [0.04238444 0.        ]
In [18]:
# Finite difference: No longer works
res_fd = minimize(Jnd, theta0, method="l-bfgs-b", jac=False)
print(res_fd)
  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 0.010686658899366581
        x: [-2.119e+01  1.000e+01]
      nit: 2
      jac: [ 5.829e-08  0.000e+00]
     nfev: 12
     njev: 4
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
In [19]:
# Zeroth-order optimization - Note the larger eps
res_zo = minimize(Jnd, theta0, method="l-bfgs-b",
                  jac=lambda q: zoo(Jnd, q, 100, eps=0.01))
print(res_zo)
  message: CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL
  success: True
   status: 0
      fun: 1.6598598437503237e-07
        x: [-7.315e-02  3.889e-02]
      nit: 17
      jac: [ 7.673e-07  1.196e-06]
     nfev: 43
     njev: 43
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
In [20]:
# Check the converged solutions, i.e., "learned" model
sol, _ = galerkin(M, N, p, ub, nb, res_zo.x)
compare_sol(sol, fu, 41);
No description has been provided for this image