DS_Opt slides

Machine Learning for Engineering¶

Dynamical System: Optimization¶

Instructor: Daning Huang¶

$$ \newcommand\norm[1]{{\left\lVert #1 \right\rVert}} \newcommand\ddf[2]{\frac{\mathrm{d} #1}{\mathrm{d} #2}} \newcommand\ppf[2]{\frac{\partial #1}{\partial #2}} \newcommand{\bR}{\mathbb{R}} \newcommand{\vX}{\mathbf{X}} \newcommand{\vtL}{\boldsymbol{\Lambda}} \newcommand{\cB}{\mathcal{B}} \newcommand{\cD}{\mathcal{D}} \newcommand{\cJ}{\mathcal{J}} \newcommand{\cL}{\mathcal{L}} \newcommand\rd[1]{\color{red}{\delta #1}} \newcommand\bd[1]{\color{blue}{\delta #1}} \newcommand\gd[1]{\color{green}{\delta #1}} \newcommand{\dudq}{\color{red}{u_\theta}} \newcommand{\dLdq}{\ddf{\cL}{\theta}} $$

TODAY: Dynamical System¶

Optimization constrained by differential equations (DEs)
Learning with DE-constrained optimization

References¶

A tutorial from Stanford

DE-Constrained Optimization¶

DE - differential equations

So far we focused on objectives and constraints as functions, which does not involve derivatives of unknowns.

Now let's deal with the case where objectives and constraints are functionals, which are "functions" of functions.

Formally, define $$ \begin{array}{ll} \mbox{minimize}_{\theta} & \cJ[u;x;\theta] \\ \mbox{subject to} & \cD[u;x;\theta]=f(x;\theta) \\ & \cB[u;x;\theta]=g(x;\theta)\mbox{ on }\partial\Omega \\ & x\in\Omega \end{array} $$ where $x$ is the coordinate, $u(x)$ is a function, and $\theta$ is the design variable.

$\cJ$, $\cD$, $\cB$ are functionals.

Example - Part I¶

Suppose $u(x)$ is the solution of the following DE on $\Omega=[0,1]$ $$ p u' - u'' = x^2+\theta_1 x+\theta_2,\quad u(0)=u_0,\ u'(1)=n_1 $$ and the goal is to minimize an integral on $\Omega$ $$ \int_0^1 u(x)^2 dx $$

Then

$\cJ[u;x;\theta]=\int_0^1 u(x)^2 dx$

$\cD[u;x;\theta]=p u' - u''$, $f(x;\theta)=x^2+\theta_1 x+\theta_2$

$\cB_1[u;x;\theta]=u(0)$, $g_1(x;\theta)=u_0$

$\cB_2[u;x;\theta]=u'(1)$, $g_2(x;\theta)=n_1$

Finding Gradients by Adjoint¶

Recall in regular equality-constrained optimization, $$ \begin{array}{ll} \mbox{minimize}_{\theta} & J(\theta) \\ \mbox{subject to} & D(\theta)=0 \end{array} $$ with Lagrangian $L(\theta,\lambda)=J(\theta)+\lambda D(\theta)$, the gradient $$ \dLdq = J_\theta(\theta)+\lambda D_\theta(\theta) $$ gives the direction to decrease $J$ while satisfying the constraints.

Then with $\dLdq$ we can do gradient descent to minimize $J(\theta)$.

We will do the same for DE-constrained problems.

To handle the DE-constraints, we first extend the concept of the Lagrange multiplier to a continuous version, $\lambda(x)$

Define Lagrangian ($x$ omitted) $$ \cL[u,\lambda;\theta] = \cJ[u;\theta] + \int_{\Omega} \lambda (\cD[u;\theta]-f(\theta)) dx + \int_{\partial\Omega} \lambda (\cB[u;\theta]-g(\theta)) dx $$

We still want $\nabla_\theta\cL$, but the complication is that $u$ depends on $\theta$, which requires a chain-rule-like operation.

Then, for example, for the $\cJ$ part, formally we need: $$ \ddf{}{\theta}\cJ[u;\theta] = \cJ_u \dudq + \cJ_\theta $$

But $u$ is a DE solution that can be expensive to obtain, then $\dudq$ is non-trivial to compute.

$\cJ_u$ is not an "innocent" term. For now "pretend" it is.

By stationarity condition, we need $$ \cL_u = 0 $$ which translates to $$ \cJ_u[u;\theta] + \int_{\Omega} \lambda \cD_u[u;\theta] dx + \int_{\partial\Omega} \lambda \cB_u[u;\theta] dx = 0 $$

This would produce a DE for $\lambda$. This is the so-called adjoint equation.

We will see a specific example soon.

Now we let's compute $\dLdq$. Its three terms are

$$ \begin{align*} \ddf{}{\theta}\cJ[u;\theta] &= \cJ_u[u;\theta] \dudq + \cJ_\theta[u;\theta] \\ \ddf{}{\theta}\int_{\Omega} \lambda (\cD[u;\theta]-f(\theta)) dx &= \int_{\Omega} \lambda \left( \cD_u[u;\theta]\dudq + \cD_\theta[u;\theta] - f_\theta(\theta) \right) dx \\ \ddf{}{\theta}\int_{\partial\Omega} \lambda (\cB[u;\theta]-g(\theta)) dx &= \int_{\partial\Omega} \lambda \left( \cB_u[u;\theta]\dudq - g_\theta(\theta) \right) dx \end{align*} $$

Those involving $\dudq$ sum to zero, given the adjoint solution $\lambda$.

Then the gradient simplifies to $$ \dLdq = \cJ_\theta[u;\theta] + \int_\Omega \lambda \left( \cD_\theta[u;\theta] - f_\theta(\theta) \right) dx - \int_{\partial\Omega} \lambda g_\theta(\theta) dx $$

Example - Part II¶

Back to our example. The Lagrangian looks like the following

$$ \cL[u,\lambda;\theta] = \underbrace{\int_0^1 u^2 dx}_{\mbox{Objective}} + \underbrace{\int_\Omega \lambda \left(p u' - u'' - (x^2+\theta_1 x+\theta_2)\right) dx}_{\mbox{DE}} + \underbrace{\lambda(0)(u(0)-u_0) + \lambda(1)(u'(1)-n_1)}_{\mbox{BCs}} $$

First we derive the adjoint equation, i.e., $\cL_u = 0$.

But the devil is in the details: What do terms like $\partial u'/\partial u$ even mean?

Here we need a bit Calculus of Variation.

We consider some small arbitrary perturbation $\delta u$ to $u$, called variation.

Three simple rules:

$\delta u$ results in changes in function(al)s of $u$, e.g., $$ \delta (u^2) = (u+\delta u)^2 - u^2 = 2u\delta u $$ as $\delta u\rightarrow 0$.
$\delta$ interchanges with derivative, i.e., $\delta (u') = (\delta u)'$.
$u+\delta u$ is still solution to the DE. Then the BC's for $\delta u$ must be homogeneous. For the current problem $$ \bd{u(0)}=0,\quad \gd{u'(1)}=0 $$

Then for the Lagrangian $$ \begin{align*} \rd{}\cL[u,\lambda;\theta] &= \int_0^1 \rd{}(u^2) dx + \int_0^1 \lambda \rd{}\left(p u' - u'' - (x^2+\theta_1 x+\theta_2)\right) dx \\ &\quad + \lambda(0)\rd{}(u(0)-u_0) + \lambda(1)\rd{}(u'(1)-n_1) \\ &= \int_0^1 2u\rd{u} dx + \int_0^1 \lambda \underbrace{\left(p \rd{}u' - \rd{}u''\right)}_{\mbox{to tackle next}} dx + \underbrace{\lambda(0)\bd{u(0)} + \lambda(1)\gd{u'(1)}}_{=0} \end{align*} $$

For the second integral, using integral by parts, $$ \begin{align*} \int_0^1 \lambda \rd{} u' dx &= \lambda \bd{u} |_0^1 - \int_0^1 \lambda' \rd{u} dx \\ &= \lambda(1) \bd{u}(1) - \int_0^1 \lambda' \rd{u} dx \\ \int_0^1 \lambda \rd{} u'' dx &= \lambda \gd{u'} |_0^1 - \lambda' \bd{u} |_0^1 + \int_0^1 \lambda'' \rd{u} dx \\ &= -\lambda(0) \gd{u'}(0) - \lambda'(1) \bd{u}(1) + \int_0^1 \lambda'' \rd{u} dx \end{align*} $$

Then the variation of Lagrangian becomes $$ \begin{align*} \delta\cL[u,\lambda;\theta] &= \int_0^1 2u\rd{u} dx + p\left[ \lambda(1) \bd{u}(1) - \int_0^1 \lambda' \rd{u} dx \right] \\ &\quad - \left[ -\lambda(0) \gd{u'}(0) - \lambda'(1) \bd{u}(1) + \int_0^1 \lambda'' \rd{u} dx \right] \\ &= \int_0^1 \left(2u - p\lambda' - \lambda'' \right)\rd{u} dx + (p\lambda(1) + \lambda'(1)) \bd{u(1)} + \lambda(0) \gd{u'(0)} \end{align*} $$

The variation $\delta u$ is arbitrary, so for $\delta\cL$ to be zero, we need $$ \begin{align*} \rd{u} &\Rightarrow 2u - p\lambda' - \lambda''=0 \\ \bd{u(1)} &\Rightarrow p\lambda(1) + \lambda'(1) = 0 \\ \gd{u'(0)} &\Rightarrow \lambda(0) \end{align*} $$ This is our adjoint equation!

With $\lambda$ known, we can find the gradients for our example: $$ \begin{align*} \dLdq &= \cJ_\theta[u;\theta] + \int_\Omega \lambda \left( \cD_\theta[u;\theta] - f_\theta(\theta) \right) dx - \int_{\partial\Omega} \lambda g_\theta(\theta) dx \\ &= -\int_\Omega \lambda f_\theta(\theta) dx \end{align*} $$

Example - Part III¶

Now we solve the optimization numerically.

A minor modification: Suppose we know a solution $u^*$ found at the true parameters $\theta^*$, and want to infer $\theta^*$ from $u^*$, $$ \begin{align*} \min_{\theta} &\quad J(u;\theta) \equiv \int_0^1 (u-u^*)^2 dx \\ \mathrm{s.t.} &\quad p u' - u'' = f(x;\theta),\quad u(0)=u_0,\ u'(1)=n_1 \end{align*} $$

In Neural Network - Model Embedding, we have solved the discrete version of this.

In [2]:

xx = np.linspace(0,1,41)
f = plt.figure(figsize=(6,4))
plt.plot(xx, fu(xx));  # Reference solution, theta=[0,0]

First, we solve $u^{(0)}$ at $\theta^{(0)}$. Clearly this is totally off.

In [4]:

theta0 = [-10.0, -10.0]
sol, mats = galerkin(M, N, p, ub, nb, theta0)
print(obj(sol)[0])
compare_sol(sol, fu, 41);

0.6717312803314148

Next, we solve the adjoint equation using Galerkin method.

See Math Differential Equation for more details.

In [5]:

# Adjoint
_, gca = galerkin_ca(sol, mats, obj)    # Adjoint/Backward solve

Sanity check to compare with finite difference.

In [6]:

# Finite difference, using a SciPy function
def Jfd(qs):
    sol, _ = galerkin(M, N, p, ub, nb, qs)
    d, _ = obj(sol)
    return d
gfd = approx_fprime(theta0, Jfd)

print(f"Adjoint: {gca}, FD: {gfd}")

Adjoint: [-0.04238407 -0.09196278], FD: [-0.0423841  -0.09196281]

Now let's use a vanilla optimizer from SciPy to solve the problem

In [7]:

def Jca(qs):
    sol, mats = galerkin(M, N, p, ub, nb, qs)
    fun, grd = galerkin_ca(sol, mats, obj)
    return fun, grd
res = minimize(Jca, theta0, method="l-bfgs-b", jac=True)
print(res)

      fun: 7.84007340724003e-09
 hess_inv: <2x2 LbfgsInvHessProduct with dtype=float64>
      jac: array([2.01907111e-08, 4.43649201e-08])
  message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 8
      nit: 4
     njev: 8
   status: 0
  success: True
        x: array([ 0.00033895, -0.0002136 ])

In [8]:

# Check the converged solutions
sol, mats = galerkin(M, N, p, ub, nb, res.x)
compare_sol(sol, fu, 41);

In [ ]: