A typical snippet of code for training
for sample, target in zip(data, targets): # Loop over the samples
optimizer.zero_grad() # [1] ???
output = model(sample) # Model prediction given current parameters
loss = (output - target) ** 2 # Compute the loss
loss.backward() # [2] ???
optimizer.step() # Given the gradients update the parameters
Apparently [1] and [2] have something to do with the gradient computation. But how?
Define a recursive function $z_{20}(x)$, such that $$ z_{n+1} = \sin(z_n x) + x,\quad z_0=0 $$
def func(x):
_z = 0.0
for _i in range(20):
_z = np.sin(x*_z) + x
return _z
f = plt.figure(figsize=(6,4))
xx = np.linspace(0.0, 1.5, 151)
_=plt.plot(xx, func(xx))
Even though the definition looks innocent, the full solution is very long. How about more complex models?
For example, central difference: $$ f'(x) = \frac{f(x+h)-f(x-h)}{2h} + O(h^2) $$
Issues:
If a real-valued function $f(x)$ is analytic, then we can do analytic continuation to extend it to complex plane $f(x+iy)$.
Then, if we want the derivative at $x=x_0$ $$ f(x_0+ih) = f(x_0) + ih f'(x_0) - \frac{h^2}{2}f''(x_0) - i\frac{h^3}{6}f'''(x_0) $$
This gives us the complex step method: $$ f'(x_0) = \frac{\mathrm{imag}(f(x_0+ih))}{h} + O(h^2) $$
Pros:
Cons:
Ref: Cleve Moler's blog
f = plt.figure(figsize=(6,4))
plt.loglog(ds, es[0], 'b-', label='Central Diff.')
plt.loglog(ds, es[1], 'r--', label='Complex Step')
_=plt.legend()
_=plt.xlabel('Step size')
_=plt.ylabel('Error')
Or "Tangent" mode, think about the chain rule in differential form: $$ dy = \frac{dy}{dw_n}\left(\cdots\left(\frac{dw_2}{dw_1}\left(\frac{dw_1}{dx}dx\right)\right)\right) $$ for $y=w_n(\cdots (w_2(w_1(x))))$
Write out the first iteration step by step: $$ \begin{array}{cclcccl} x &=& ? \\ a &=& z_0 x \\ b &=& \sin(a) \\ z_1 &=& b+x \end{array} $$ Also try drawing in the form of computational graph.
Then, starting from the input, differentiate each step: $$ \begin{array}{cclcccl} x &=& ? &\Rightarrow& dx &=& ? \\ a &=& z_0 x &\quad& da &=& z_0 dx \\ b &=& \sin(a) &\quad& db &=& \cos(a)da \\ z_1 &=& b+x &\quad& dz_1 &=& db+dx \end{array} $$
Let the seed $dx=1$, $$ dz_1 = \frac{dz_1}{dx} = \frac{db}{dx}+1 = z_0\cos(z_0x)+1 $$
For multiple iterations: $$ \begin{array}{cclcccl} x &=& ? &\Rightarrow& dx &=& ? \\ z_0 &=& ? &\ & dz_0 &=& 0 \\ z_1 &=& \sin(z_0 x) + x &\ & dz_1 &=& x\cos(z_0 x)dz_0 + (z_0\cos(z_0 x) + 1)dx = dx \\ z_2 &=& \sin(z_1 x) + x &\ & dz_2 &=& x\cos(z_1 x)dz_1 + (z_1\cos(z_1 x) + 1)dx \\ \vdots &=& \vdots &\ & \vdots &=& \vdots \\ z_{n+1} &=& \sin(z_n x) + x &\ & dz_{n+1} &=& x\cos(z_n x)dz_n + (z_n\cos(z_n x) + 1)dx \end{array} $$
One could let the seed $dx=1$ again and note that $dz_0=0$, then one can find the gradient along with the forward pass, i.e. the function evaluation.
def df_fw(x):
_dz, _dx = 0.0, 1.0 ##### NEW: Initialize the seed
_z = 0.0 # OLD code
for _i in range(20): # OLD code
##### NEW: Evaluate differentials
_dz = x*np.cos(x*_z)*_dz + (_z*np.cos(x*_z) + 1) * _dx
_z = np.sin(x*_z) + x # OLD code
return _dz
f = plt.figure(figsize=(6,4))
xx = np.linspace(0.0, 1.5, 201)
plt.plot(xx, df_ag(xx), 'r--', linewidth=4.0, label='Autograd')
plt.plot(xx, df_fw(xx), 'b-', label='Forward AD')
plt.plot(xx, df_cs(xx), 'k-.', label='Complex Step')
_=plt.legend()
Or "Adjoint" mode, think about the chain rule in operator form: $$ \frac{d}{dx} = \frac{dw_1}{dx}\left(\frac{dw_2}{dw_1}\left(\cdots\left(\frac{dy}{dw_{n}}\frac{d}{dy}\right)\right)\right) $$ for $y=w_n(\cdots (w_2(w_1(x))))$
Again, write out one iteration step by step: $$ \begin{array}{cclcccl} x &=& ? \\ a &=& z_0 x \\ b &=& \sin(a) \\ z_1 &=& b+x \end{array} $$ Recall the computational graph too.
This time we start from the output, i.e. the last step, for one iteration: $$ \begin{array}{cclcccl} x &=& ? &\quad& \dfrac{d}{dx} &=& \dfrac{da}{dx}\dfrac{d}{da} + \dfrac{dz_1}{dx}\dfrac{d}{dz_1} = z_0\dfrac{d}{da} + \dfrac{d}{dz_1} \\ a &=& z_0 x &\quad& \dfrac{d}{da} &=& \dfrac{db}{da}\dfrac{d}{db} = \cos(a)\dfrac{d}{db} \\ b &=& \sin(a) &\quad& \dfrac{d}{db} &=& \dfrac{dz_1}{db}\dfrac{d}{dz_1} = \dfrac{d}{dz_1} \\ z_1 &=& b+x &\Rightarrow& \dfrac{d}{dz_1} &=& ? \end{array} $$
Now we know the operator, $$ \dfrac{d}{dx} = z_0\dfrac{d}{da} + \dfrac{d}{dz_1} = z_0\cos(a)\dfrac{d}{dz_1} + \dfrac{d}{dz_1} $$
And we can apply $\dfrac{d}{dx}$ to $z_1$, and note that $\dfrac{dz_1}{dz_1}=1$ $$ \frac{dz_1}{dx} = z_0\cos(z_0x)+1 $$ This is as if we set a seed $dz_1=1$.
For multiple iterations: $$ \begin{array}{cclcccl} x &=& ? &\quad& \dfrac{d}{dx} &=& \dfrac{dz_1}{dx}\dfrac{d}{dz_1} + \cdots + \dfrac{dz_{n+1}}{dx}\dfrac{d}{dz_{n+1}} \\ z_0 &=& ? &\quad& \dfrac{d}{dz_0} &=& \dfrac{dz_1}{dz_0}\dfrac{d}{dz_1} = 0 \\ z_1 &=& \sin(z_0 x) + x &\quad& \dfrac{d}{dz_1} &=& \dfrac{dz_2}{dz_1}\dfrac{d}{dz_2} \\ \vdots &=& \vdots &\quad& \vdots &=& \vdots \\ z_n &=& \sin(z_{n-1} x) + x &\quad& \dfrac{d}{dz_n} &=& \dfrac{dz_{n+1}}{dz_n}\dfrac{d}{dz_{n+1}} = x\cos(z_n x)\dfrac{d}{dz_{n+1}} \\ z_{n+1} &=& \sin(z_n x) + x &\Rightarrow& \dfrac{d}{dz_{n+1}} &=& ? \end{array} $$
One could let the seed $dz_{n+1}=1$, then find the gradient in the reversed order of the forward pass.
def df_rv(x):
_z = 0.0 # OLD code
_s = [_z] ##### NEW: A list to store the intermediate
##### z's in the iteration
for _i in range(20): # OLD code
_z = np.sin(x*_z) + x # OLD code
_s.append(_z) ##### NEW: intermediate z's
##### NEW: Below is an entirely new "reverse" pass
_dz = 1.0 # Initialize the seed
_dx = 0.0
for _i in range(-1,-21,-1): # Reversely go through the loop
_z = _s[_i-1]
_dx = _dx + (_z*np.cos(_z*x)+1) * _dz
_dz = (x*np.cos(_z*x)) * _dz
return _dx
f = plt.figure(figsize=(6,4))
xx = np.linspace(0.0, 1.5, 201)
plt.plot(xx, df_ag(xx), 'r--', linewidth=4.0, label='Autograd')
plt.plot(xx, df_rv(xx), 'b-', label='Reverse AD')
plt.plot(xx, df_cs(xx), 'k-.', label='Complex Step')
_=plt.legend()
Given a chain of functions, $\vx\in\bR^N$, $\vy\in\bR^M$ $$ \vy = \vF_n(\vF_{n-1}(\cdots \vF_2(\vF_1(\vx)))) $$
The Jacobian is $$ \vJ = \ppf{\vy}{\vx} = \ppf{\vF_n}{\vF_{n-1}}\ppf{\vF_{n-1}}{\vF_{n-2}} \cdots \ppf{\vF_2}{\vF_{1}} \ppf{\vF_1}{\vx} $$
Next we consider the products of Jacobians and vectors, where we want to avoid products of Jacobians, because they are expensive.
Consider a Jacobian-Vector Product (JVP) $$ \vJ d\vx = \ppf{\vy}{\vx}d\vx = \left(\ppf{\vF_n}{\vF_{n-1}} \cdots \left(\ppf{\vF_2}{\vF_{1}} \left(\ppf{\vF_1}{\vx}d\vx \right)\right)\right) $$ Replace $d\vx=\ve_i$, where $\ve_i\in\bR^N$, the $i$th column of an identity matrix, $$ \vJ\ve_i = \left(\ppf{\vF_n}{\vF_{n-1}} \cdots \left(\ppf{\vF_2}{\vF_{1}} \left(\ppf{\vF_1}{\vx}\ve_i \right)\right)\right) $$ Then $\ve_i$ is like the seed, and one gets the $i$th column of $\vJ$, $$ \vJ\ve_i = \begin{bmatrix} \ppf{y_1}{x_i} \\ \ppf{y_2}{x_i} \\ \vdots \\ \ppf{y_M}{x_i} \end{bmatrix} $$ This is the Forward mode, good for fewer inputs than outputs.
Consider a Vector-Jacobian Product (VJP) $$ d\vy^T\vJ = d\vy^T\ppf{\vy}{\vx} = \left(\left(\left( d\vy^T\ppf{\vF_n}{\vF_{n-1}}\right) \ppf{\vF_{n-1}}{\vF_{n-2}}\right) \cdots \ppf{\vF_1}{\vx} \right) $$ Replace $d\vy=\ve_j$, where $\ve_j\in\bR^M$, the $j$th column of an identity matrix, $$ \ve_j^T\vJ = \left(\left(\left( \ve_j^T\ppf{\vF_n}{\vF_{n-1}}\right) \ppf{\vF_{n-1}}{\vF_{n-2}}\right) \cdots \ppf{\vF_1}{\vx} \right) $$ Then $\ve_j$ is like the seed, and one gets the $j$th row of $\vJ$, $$ \ve_j^T\vJ = \begin{bmatrix} \ppf{y_j}{x_1} & \ppf{y_j}{x_2} & \cdots & \ppf{y_j}{x_N} \end{bmatrix} $$ This is the Reverse mode, good for more inputs than outputs.
Forward mode:
Reverse mode:
There is also Mixed mode.
det
, log
, exp
, etc.
Not unusual - e.g. a Newton solve for x, $f(x,b)=x^TAx-b=0$, given b
x = minimize(f, x0=init_guess)
Use implicit function theorem $$ \frac{\partial x}{\partial b} = \left(\frac{\partial f}{\partial x}\right)^{-1} \frac{\partial f}{\partial b} $$
Why Reverse instead of Forward?