DS_Adj slides

Machine Learning for Engineering¶

Dynamical System: Adjoint¶

Instructor: Daning Huang¶

$$ \newcommand\norm[1]{{\left\lVert #1 \right\rVert}} \newcommand\ddf[2]{\frac{\mathrm{d} #1}{\mathrm{d} #2}} \newcommand{\bR}{\mathbb{R}} \newcommand{\vf}{\mathbf{f}} \newcommand{\vg}{\mathbf{g}} \newcommand{\vv}{\mathbf{v}} \newcommand{\vw}{\mathbf{w}} \newcommand{\vx}{\mathbf{x}} \newcommand{\vy}{\mathbf{y}} \newcommand{\vA}{\mathbf{A}} \newcommand{\vB}{\mathbf{B}} \newcommand{\vU}{\mathbf{U}} \newcommand{\vV}{\mathbf{V}} \newcommand{\vW}{\mathbf{W}} \newcommand{\vX}{\mathbf{X}} \newcommand{\vtS}{\boldsymbol{\Sigma}} \newcommand{\vtL}{\boldsymbol{\Lambda}} \newcommand{\Im}{\mathrm{Im}} \newcommand{\cL}{\mathcal{L}} \newcommand{\xt}{{\color{red}{x_\theta}}} $$

TODAY: Dynamical System¶

Adjoint formulations of ODEs
Neural ODEs

References:¶

Problem Statement¶

Consider nonlinear dynamics with observation $$ \begin{align} \dot{x} &= \eta(x,u;\theta) \\ y &= \phi(x;\theta) \\ x(0) &= x_0(\theta) \end{align} $$

Given input-output data $\{t_i, u(t_i),\hat{y}(t_i)\}_{i=0}^T$, we want to estimate $\theta$.

$u(t)$ can be ignored if we consider autonomous dynamics.

Difference with SINDy:

We do not directly observe $x$ and hence no $\dot{x}$
$\eta$ and $\phi$ can be nonlinear in $\theta$, e.g., a neural network, so no way to apply least squares.

Here we introduce the Neural ODE approach.

Effectively, we need to solve $$ \begin{align} \min_\theta &\ \int_0^T f(x, t; \theta) dt \\ \mathrm{s.t.} &\ \dot{x} = \eta(x,u;\theta),\quad x(0) = x_0(\theta) \end{align} $$

... where in the objective

The integrand can be $$ f(x, t; \theta) = \norm{\hat{y}-\phi(x;\theta)}^2 $$
The integral can be computed by numerical integration schemes, e.g., trapezoidal rule
Regularization terms, e.g., on $\theta$, can be added.

Options of Algorithms¶

We have already seen something similar before

In the Model Embedding lecture, we solved a DE-constrained optimization
- The PDE was converted to algebraic equations
- Then we used lagrange multiplier method to find the gradients
- This is called discretize-then-optimize

Following this approach, we would use an ODE solver, say Euler method (for simplicity) $$ \begin{align} \min_\theta &\ \sum_{i=1}^T f(x_i, t_i; \theta) \Delta t \\ \mathrm{s.t.} &\ \frac{x_i-x_{i-1}}{\Delta t} = \eta(x_i,u_i;\theta),\ i=1,\dots,T,\quad x(0) = x_0(\theta) \end{align} $$

Potential issues:

Numerous time steps, esp. for long term prediction
- Expensive to solve the (algebraic) adjoint
Solving the adjoint corresponds to a "backward" version of the original ODE
- The backward is often unstable, and vanilla ODE solvers can fail

We take an alternative path

Sticking with the original DE
- We will use a continuous version of lagrange multiplier
- Find a new ODE that we solve to find the gradients
- This is called optimize-then-discretize

Adjoint Formulation¶

We first formalize the optimization problem in a general way $$ \begin{align*} \min_\theta &\ F(x;\theta) = \int_0^T f(x,t;\theta) dt \\ \mathrm{s.t.} &\ h(x,\dot{x},t;\theta)=0,\quad g(x(0),\theta)=0 \end{align*} $$

The form of $h$ allows for the treatment of differential-algebraic equations (DAEs)

Technically, we need $$ \ddf{F}{\theta} = \int_0^T f_x \xt + f_\theta dt $$ where subscripts indicate partial derivatives.

But $\xt$ is a nasty term, e.g.,

$\textrm{dim}(x)=O(10^2)$
$\textrm{dim}(\theta)=O(10^4)$ or more
Intractable for both computation and memory!

To eliminate $\xt$ ...

We take the usual drill of Lagrange multiplier $$ \cL = \int_0^T \left( f(x,t;\theta) + \lambda^T h(x,\dot{x},t;\theta) \right) dt + \mu^T g(x(0),\theta) $$

Taking gradients w.r.t. $\theta$ $$ \ddf{\cL}{\theta} = \int_0^T \left( \underbrace{f_x \xt + f_\theta}_{\text{From } f} + \underbrace{\lambda^T (h_x \xt + h_{\dot{x}} \dot{\xt} + h_\theta)}_{\text{From } \lambda^Th} \right) dt + \underbrace{\mu^T (g_{x} \xt |_0 + g_\theta)}_{\text{From } \mu^Tg} $$

Integrate by parts for $\dot{\xt}$, $$ \int_0^T \lambda^T h_{\dot{x}} \dot{\xt} dt = \lambda^T h_{\dot{x}} \xt |_0^T - \int_0^T \left(\dot{\lambda}^T h_{\dot{x}} + \lambda^T \dot{h_{\dot{x}}}\right) \xt dt $$

Note $$ \dot{h_{\dot{x}}}\equiv\ddf{}{t}\left( h_{\dot{x}} \right) $$

Rearranging $$ \ddf{\cL}{\theta} = \int_0^T \left[f_\theta + \lambda^T h_\theta + \left(f_x + \lambda^T h_x - \dot{\lambda}^T h_{\dot{x}} - \lambda^T \dot{h_{\dot{x}}}\right) \xt \right] dt + \lambda^T h_{\dot{x}} \xt |_T + \left( \mu^T g_{x} - \lambda^T h_{\dot{x}} \right) \xt |_0 + \mu^T g_\theta $$

Choose $\lambda$ such that all terms having $\xt$ are eliminated.

The adjoint equation: $$ f_x + \lambda^T h_x - \dot{\lambda}^T h_{\dot{x}} - \lambda^T \dot{h_{\dot{x}}} = 0 $$

BC at $t=T$: $$ \lambda(T) = 0 $$

BC at $t=0$: $$ \mu^T g_{x} - \lambda(0)^T h_{\dot{x}} = 0 $$

Define $\tau=T-t$, we can solve an initial value problem, $$ h_{\dot{x}}^T \ddf{\lambda}{\tau} + (h_x-\dot{h_{\dot{x}}})^T\lambda + f_x^T = 0,\quad \lambda(\tau=0)=0 $$ and compute $$ \mu = g_x^{-T}h_{\dot{x}}^T\lambda \quad\text{ at }\quad \tau=T $$

Then gradients $$ \ddf{\cL}{\theta} = \int_0^T \left( f_\theta + \lambda^T h_\theta \right) dt + \mu^T g_\theta $$

Forward Sensitivity (Optional)¶

When $\mathrm{dim}(\theta)$ is manageable (e.g., $\ll\mathrm{dim}(x)$), it is actually better to work with $\xt$.

Note that the nonlinear dynamics can be rewritten as $$ x(t) = x_0 + \int_0^t \eta(x,u) ds $$

Taking derivative $$ \xt(t) = {\xt}_{,0} + \int_0^t \eta_x(s) \xt(s) ds $$

Define $\psi(t) = \xt(t)$ and $A(t)=\eta_x(t)$ $$ \psi(t) = \psi_0 + \int_0^t A(s) \psi(s) ds $$

This corresponds to the forward sensitivity equation $$ \dot{\psi} = A(t)\psi,\quad \psi(0)=\psi_0 $$

Then the gradient can be computed as $$ \ddf{F}{\theta} = \int_0^T f_x \psi + f_\theta dt $$

In state estimation and adaptive control, the forward sensitivity is used more often

$\psi$ evolves forward in time, as the original dynamics
One can solve the two dynamics in parallel and update sensitivity in real time
... and update states/parameters in real time when needed

Example: A Linear Case - Derivations¶

Consider the special case for a linear time invariant (LTI) system with $D=0$ $$ \begin{align} \dot{x} &= A(\theta)x+B(\theta)u \\ y &= C(\theta)x \end{align} $$ with $$ x(0) = x_0(\theta) $$

Then, if we consider a given input $u(t)$, using general notation $$ h(x,\dot{x},t;\theta) = \dot{x} - A(\theta)x - B(\theta)u(t) $$ and $$ g(x(0),\theta)=x(0)-x_0(\theta) $$

Furthermore, suppose we measure the output $\hat{y}(t)$, then a typical objective functions is $$ F(x;\theta) = \frac{1}{2}\int_0^T \norm{\hat{y}(t) - C(\theta)x}^2 dt $$ so, in general notation $$ f(x,t;\theta) = \frac{1}{2}\norm{\hat{y}(t) - C(\theta)x}^2 $$

Then we can write down our adjoint equation.

Knowing $$ h_{\dot{x}} = I,\quad h_x=-A,\quad \dot{h_{\dot{x}}} = 0,\quad f_x = -(\hat{y}(t) - Cx)^TC \equiv -e^TC $$

We get $$ \ddf{\lambda}{\tau} = A^T\lambda + C^Te,\quad \lambda(\tau=0)=0 $$ and $$ \mu = \lambda(T) $$

If you have seen the dual of a LTI system, here is one interpretation of the dual.

Lastly, knowing $$ f_\theta = -e^T(Cx)_\theta,\quad h_\theta = - (Ax)_\theta - (Bu)_\theta,\quad g_\theta = -x_{0,\theta} $$

We can compute the gradients by $$ \ddf{\cL}{\theta} = - \int_0^T \left( e^T(Cx)_\theta + \lambda^T (Ax)_\theta + \lambda^T (Bu)_\theta \right) dt - \mu^T x_{0,\theta} $$

Consider a special case, where we want to infer $x_0$ from inputs and outputs for a known LTI system.

Then $\theta=x_0$.

The adjoint solution is $$ \lambda(T) = \int_0^T \exp(A^T (T-s))C^T e(s) ds = \mu $$

The gradient is $$ \ddf{\cL}{\theta} = -\mu^T $$

Example: A Linear Case - Computations¶

We implement everything using only the SciPy packages.

Consider a parametrized LTI dynamics $$ \begin{align} \begin{bmatrix} \dot{x}_1 \\ \dot{x}_2 \end{bmatrix} &= \begin{bmatrix} -1 & \theta_1 \\ -2 & \theta_2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} 1 \\ 0 \end{bmatrix} u(t) \\ y &= \begin{bmatrix} -2 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \end{align} $$

We generate observations $\hat{y}(t)$ from $u(t)=\sin(4t)/(t+1)$ and $x(0)=[1,0]^T$.

We implement the dynamics in a class. Key methods:

    def _prime(self, t, x):   # Original/prime dynamics
        return self._A.dot(x) + self._B.dot(self._u(t))

    def _observe(self, t, x): # Observation function
        return self._C.dot(x)

    def _adjoint(self, t, l): # Adjoint dynamics
        # Here the gradients are computed by hand
        # For NN's, we would use torch, TF, etc.
        return self._A.T.dot(l) + self._C.dot(self._e(t))

and the sensitivity $h_\theta$

    def _sens_f(self, t, x):
        # Assuming x is a time sequence
        df = np.zeros((self._nx,2,len(x[0])))
        df[0,0] = x[1]   # Here again the gradients are computed by hand
        df[1,1] = x[1]   # For NN's, we would use torch, TF, etc.
        return df

Then we will implement a neural ODE class, that is agnostic of the particular dynamics.

The forward calculation is simple,

    def forward(self, us, ys):   # The main forward function
        # Note here we store the intermediate results, x
        # which are needed for backward calculation
        fu = spp.interp1d(self._ts, us, kind='cubic', fill_value='extrapolate')
        _, self._xp, self._yp = self._forward_dyn(fu)  # Solve the dynamics
        _e = self._forward_obj(ys, self._yp)     # Evaluate the error
        return _e

    def _forward_dyn(self, fu):  # Forward Step 1
        self._mdl.set_u(fu)
        _s = spi.solve_ivp(self._mdl._prime, self._rng, self._mdl._x0, t_eval=self._ts)
        xs = _s.y
        ys = self._mdl._observe(self._ts, xs)
        us = fu(self._ts)  # Mainly for sanity checks
        return us, xs, ys

    def _forward_obj(self, data, pred):  # Forward Step 2
        # Note here we store the intermediate results, x
        self._es = data-pred
        return spi.simpson(self._es**2, x=self._ts)/2

Then the backward:

    def backward(self):   # The main backward function
        _fx = self._backward_obj()
        self._mdl.set_e(_fx)
        grad = self._backward_dyn()
        return grad

    def _backward_obj(self):   # Backward Step 1
        fe = spp.interp1d(self._ts, self._es[::-1], kind='cubic', fill_value='extrapolate')
        return fe

    def _backward_dyn(self):   # Backward Step 2
        # Note here we solve the adjoint dynamics
        l0 = np.zeros(self._mdl._nx,)
        _s = spi.solve_ivp(self._mdl._adjoint, self._rng, l0, t_eval=self._ts)
        ls = _s.y[:,::-1]
        # Then we compute (\lambda^T f_\theta)
        df = self._mdl._sens_f(self._ts, self._xp)
        tmp = np.einsum('ik,ijk->jk', ls, df)
        grad = -spi.simpson(tmp, x=self._ts).squeeze()
        return grad

In [5]:

f, ax = plt.subplots(nrows=3, sharex=True, figsize=(12,5))
ax[0].plot(TS, us); ax[0].set_title('Data - Input')
ax[1].plot(TS, xs[0])
ax[1].plot(TS, xs[1]); ax[1].set_title('States')
ax[2].plot(TS, ys); ax[2].set_title('Data - Observation'); ax[2].set_xlabel('Time');

Example: A Linear Case - Results¶

In [6]:

theta1 = np.array([-1.5, -1.5])
node.set_theta(theta1)
err = node.forward(us, ys)
grad_ad = node.backward()     # Adjoint-based gradients

In [8]:

tmp = lambda theta: node(fu, ys, theta)[1]
eps = 1e-3
grad_fd = central_diff(tmp, theta1, eps)
node.set_theta(theta1)

print(grad_ad)     # Gradients by adjoint
print(grad_fd)     # Gradients by finite difference

[-91.43291426  64.77020998]
[-91.44088163  64.75830491]

In [10]:

f = plt.figure(figsize=(8,5))  # Optimization history
for _t in hst[::2]:
    _y, _e = node(fu, ys, _t)
    plt.plot(TS, _y, label=f"Error = {_e:5.4e}")
plt.plot(TS, ys, 'k--', label="Truth")
plt.legend(); plt.xlabel('t'); plt.ylabel('y');

In [14]:

f, ax = plt.subplots(ncols=len(ss), sharey=True, figsize=(12,4))
for _i, _t in enumerate(hst):   # Effect of noise
    _y, _e = node(fu, ys, _t)   # Works even in the last case
    ax[_i].plot(TS, _y, 'b-')
    ax[_i].plot(TS, ys+noi[_i], 'k--', linewidth=0.5)
    ax[_i].set_xlabel('t'); ax[_i].set_title(f"Noise: {ss[_i]:4.3f}")
ax[0].set_ylabel('y');