Consider nonlinear dynamics with observation $$ \begin{align} \dot{x} &= \eta(x,u;\theta) \\ y &= \phi(x;\theta) \\ x(0) &= x_0(\theta) \end{align} $$
Given input-output data $\{t_i, u(t_i),\hat{y}(t_i)\}_{i=0}^T$, we want to estimate $\theta$.
$u(t)$ can be ignored if we consider autonomous dynamics.
Difference with SINDy:
Here we introduce the Neural ODE approach.
Effectively, we need to solve $$ \begin{align} \min_\theta &\ \int_0^T f(x, t; \theta) dt \\ \mathrm{s.t.} &\ \dot{x} = \eta(x,u;\theta),\quad x(0) = x_0(\theta) \end{align} $$
... where in the objective
We have already seen something similar before
Model Embedding
lecture, we solved a DE-constrained optimizationFollowing this approach, we would use an ODE solver, say Euler method (for simplicity) $$ \begin{align} \min_\theta &\ \sum_{i=1}^T f(x_i, t_i; \theta) \Delta t \\ \mathrm{s.t.} &\ \frac{x_i-x_{i-1}}{\Delta t} = \eta(x_i,u_i;\theta),\ i=1,\dots,T,\quad x(0) = x_0(\theta) \end{align} $$
Potential issues:
We take an alternative path
We first formalize the optimization problem in a general way $$ \begin{align*} \min_\theta &\ F(x;\theta) = \int_0^T f(x,t;\theta) dt \\ \mathrm{s.t.} &\ h(x,\dot{x},t;\theta)=0,\quad g(x(0),\theta)=0 \end{align*} $$
The form of $h$ allows for the treatment of differential-algebraic equations (DAEs)
Technically, we need $$ \ddf{F}{\theta} = \int_0^T f_x \xt + f_\theta dt $$ where subscripts indicate partial derivatives.
But $\xt$ is a nasty term, e.g.,
To eliminate $\xt$ ...
We take the usual drill of Lagrange multiplier $$ \cL = \int_0^T \left( f(x,t;\theta) + \lambda^T h(x,\dot{x},t;\theta) \right) dt + \mu^T g(x(0),\theta) $$
Taking gradients w.r.t. $\theta$ $$ \ddf{\cL}{\theta} = \int_0^T \left( \underbrace{f_x \xt + f_\theta}_{\text{From } f} + \underbrace{\lambda^T (h_x \xt + h_{\dot{x}} \dot{\xt} + h_\theta)}_{\text{From } \lambda^Th} \right) dt + \underbrace{\mu^T (g_{x} \xt |_0 + g_\theta)}_{\text{From } \mu^Tg} $$
Integrate by parts for $\dot{\xt}$, $$ \int_0^T \lambda^T h_{\dot{x}} \dot{\xt} dt = \lambda^T h_{\dot{x}} \xt |_0^T - \int_0^T \left(\dot{\lambda}^T h_{\dot{x}} + \lambda^T \dot{h_{\dot{x}}}\right) \xt dt $$
Note $$ \dot{h_{\dot{x}}}\equiv\ddf{}{t}\left( h_{\dot{x}} \right) $$
Rearranging $$ \ddf{\cL}{\theta} = \int_0^T \left[f_\theta + \lambda^T h_\theta + \left(f_x + \lambda^T h_x - \dot{\lambda}^T h_{\dot{x}} - \lambda^T \dot{h_{\dot{x}}}\right) \xt \right] dt + \lambda^T h_{\dot{x}} \xt |_T + \left( \mu^T g_{x} - \lambda^T h_{\dot{x}} \right) \xt |_0 + \mu^T g_\theta $$
Choose $\lambda$ such that all terms having $\xt$ are eliminated.
The adjoint equation: $$ f_x + \lambda^T h_x - \dot{\lambda}^T h_{\dot{x}} - \lambda^T \dot{h_{\dot{x}}} = 0 $$
BC at $t=T$: $$ \lambda(T) = 0 $$
BC at $t=0$: $$ \mu^T g_{x} - \lambda(0)^T h_{\dot{x}} = 0 $$
Define $\tau=T-t$, we can solve an initial value problem, $$ h_{\dot{x}}^T \ddf{\lambda}{\tau} + (h_x-\dot{h_{\dot{x}}})^T\lambda + f_x^T = 0,\quad \lambda(\tau=0)=0 $$ and compute $$ \mu = g_x^{-T}h_{\dot{x}}^T\lambda \quad\text{ at }\quad \tau=T $$
Then gradients $$ \ddf{\cL}{\theta} = \int_0^T \left( f_\theta + \lambda^T h_\theta \right) dt + \mu^T g_\theta $$
When $\mathrm{dim}(\theta)$ is manageable (e.g., $\ll\mathrm{dim}(x)$), it is actually better to work with $\xt$.
Note that the nonlinear dynamics can be rewritten as $$ x(t) = x_0 + \int_0^t \eta(x,u) ds $$
Taking derivative $$ \xt(t) = {\xt}_{,0} + \int_0^t \eta_x(s) \xt(s) ds $$
Define $\psi(t) = \xt(t)$ and $A(t)=\eta_x(t)$ $$ \psi(t) = \psi_0 + \int_0^t A(s) \psi(s) ds $$
This corresponds to the forward sensitivity equation $$ \dot{\psi} = A(t)\psi,\quad \psi(0)=\psi_0 $$
Then the gradient can be computed as $$ \ddf{F}{\theta} = \int_0^T f_x \psi + f_\theta dt $$
In state estimation and adaptive control, the forward sensitivity is used more often
Consider the special case for a linear time invariant (LTI) system with $D=0$ $$ \begin{align} \dot{x} &= A(\theta)x+B(\theta)u \\ y &= C(\theta)x \end{align} $$ with $$ x(0) = x_0(\theta) $$
Then, if we consider a given input $u(t)$, using general notation $$ h(x,\dot{x},t;\theta) = \dot{x} - A(\theta)x - B(\theta)u(t) $$ and $$ g(x(0),\theta)=x(0)-x_0(\theta) $$
Furthermore, suppose we measure the output $\hat{y}(t)$, then a typical objective functions is $$ F(x;\theta) = \frac{1}{2}\int_0^T \norm{\hat{y}(t) - C(\theta)x}^2 dt $$ so, in general notation $$ f(x,t;\theta) = \frac{1}{2}\norm{\hat{y}(t) - C(\theta)x}^2 $$
Then we can write down our adjoint equation.
Knowing $$ h_{\dot{x}} = I,\quad h_x=-A,\quad \dot{h_{\dot{x}}} = 0,\quad f_x = -(\hat{y}(t) - Cx)^TC \equiv -e^TC $$
We get $$ \ddf{\lambda}{\tau} = A^T\lambda + C^Te,\quad \lambda(\tau=0)=0 $$ and $$ \mu = \lambda(T) $$
If you have seen the dual of a LTI system, here is one interpretation of the dual.
Lastly, knowing $$ f_\theta = -e^T(Cx)_\theta,\quad h_\theta = - (Ax)_\theta - (Bu)_\theta,\quad g_\theta = -x_{0,\theta} $$
We can compute the gradients by $$ \ddf{\cL}{\theta} = - \int_0^T \left( e^T(Cx)_\theta + \lambda^T (Ax)_\theta + \lambda^T (Bu)_\theta \right) dt - \mu^T x_{0,\theta} $$
Consider a special case, where we want to infer $x_0$ from inputs and outputs for a known LTI system.
Then $\theta=x_0$.
The adjoint solution is $$ \lambda(T) = \int_0^T \exp(A^T (T-s))C^T e(s) ds = \mu $$
The gradient is $$ \ddf{\cL}{\theta} = -\mu^T $$
We implement everything using only the SciPy packages.
Consider a parametrized LTI dynamics $$ \begin{align} \begin{bmatrix} \dot{x}_1 \\ \dot{x}_2 \end{bmatrix} &= \begin{bmatrix} -1 & \theta_1 \\ -2 & \theta_2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} + \begin{bmatrix} 1 \\ 0 \end{bmatrix} u(t) \\ y &= \begin{bmatrix} -2 & 1 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \end{align} $$
We generate observations $\hat{y}(t)$ from $u(t)=\sin(4t)/(t+1)$ and $x(0)=[1,0]^T$.
We implement the dynamics in a class. Key methods:
def _prime(self, t, x): # Original/prime dynamics
return self._A.dot(x) + self._B.dot(self._u(t))
def _observe(self, t, x): # Observation function
return self._C.dot(x)
def _adjoint(self, t, l): # Adjoint dynamics
# Here the gradients are computed by hand
# For NN's, we would use torch, TF, etc.
return self._A.T.dot(l) + self._C.dot(self._e(t))
and the sensitivity $h_\theta$
def _sens_f(self, t, x):
# Assuming x is a time sequence
df = np.zeros((self._nx,2,len(x[0])))
df[0,0] = x[1] # Here again the gradients are computed by hand
df[1,1] = x[1] # For NN's, we would use torch, TF, etc.
return df
Then we will implement a neural ODE class, that is agnostic of the particular dynamics.
The forward calculation is simple,
def forward(self, us, ys): # The main forward function
# Note here we store the intermediate results, x
# which are needed for backward calculation
fu = spp.interp1d(self._ts, us, kind='cubic', fill_value='extrapolate')
_, self._xp, self._yp = self._forward_dyn(fu) # Solve the dynamics
_e = self._forward_obj(ys, self._yp) # Evaluate the error
return _e
def _forward_dyn(self, fu): # Forward Step 1
self._mdl.set_u(fu)
_s = spi.solve_ivp(self._mdl._prime, self._rng, self._mdl._x0, t_eval=self._ts)
xs = _s.y
ys = self._mdl._observe(self._ts, xs)
us = fu(self._ts) # Mainly for sanity checks
return us, xs, ys
def _forward_obj(self, data, pred): # Forward Step 2
# Note here we store the intermediate results, x
self._es = data-pred
return spi.simpson(self._es**2, x=self._ts)/2
Then the backward:
def backward(self): # The main backward function
_fx = self._backward_obj()
self._mdl.set_e(_fx)
grad = self._backward_dyn()
return grad
def _backward_obj(self): # Backward Step 1
fe = spp.interp1d(self._ts, self._es[::-1], kind='cubic', fill_value='extrapolate')
return fe
def _backward_dyn(self): # Backward Step 2
# Note here we solve the adjoint dynamics
l0 = np.zeros(self._mdl._nx,)
_s = spi.solve_ivp(self._mdl._adjoint, self._rng, l0, t_eval=self._ts)
ls = _s.y[:,::-1]
# Then we compute (\lambda^T f_\theta)
df = self._mdl._sens_f(self._ts, self._xp)
tmp = np.einsum('ik,ijk->jk', ls, df)
grad = -spi.simpson(tmp, x=self._ts).squeeze()
return grad
f, ax = plt.subplots(nrows=3, sharex=True, figsize=(12,5))
ax[0].plot(TS, us); ax[0].set_title('Data - Input')
ax[1].plot(TS, xs[0])
ax[1].plot(TS, xs[1]); ax[1].set_title('States')
ax[2].plot(TS, ys); ax[2].set_title('Data - Observation'); ax[2].set_xlabel('Time');
theta1 = np.array([-1.5, -1.5])
node.set_theta(theta1)
err = node.forward(us, ys)
grad_ad = node.backward() # Adjoint-based gradients
tmp = lambda theta: node(fu, ys, theta)[1]
eps = 1e-3
grad_fd = central_diff(tmp, theta1, eps)
node.set_theta(theta1)
print(grad_ad) # Gradients by adjoint
print(grad_fd) # Gradients by finite difference
[-91.43291426 64.77020998] [-91.44088163 64.75830491]
f = plt.figure(figsize=(8,5)) # Optimization history
for _t in hst[::2]:
_y, _e = node(fu, ys, _t)
plt.plot(TS, _y, label=f"Error = {_e:5.4e}")
plt.plot(TS, ys, 'k--', label="Truth")
plt.legend(); plt.xlabel('t'); plt.ylabel('y');
f, ax = plt.subplots(ncols=len(ss), sharey=True, figsize=(12,4))
for _i, _t in enumerate(hst): # Effect of noise
_y, _e = node(fu, ys, _t) # Works even in the last case
ax[_i].plot(TS, _y, 'b-')
ax[_i].plot(TS, ys+noi[_i], 'k--', linewidth=0.5)
ax[_i].set_xlabel('t'); ax[_i].set_title(f"Noise: {ss[_i]:4.3f}")
ax[0].set_ylabel('y');