The FAQ basically says we can design any parametrized model of any architecture.
But we would want the family of models to be compatible with the problem at hand.
For example, the model should
There are (at least) four types of biases: observational, inductive, learning, and physics.
Refs: Battaglia2018, Karniadakis2021
To make it more tangible, suppose we want to fit a simple model for a nonlinear spring $$ F = k_1 x + k_2 x^3 $$ given data of $(x_i,F_i)$.
Cons: Higher training cost for larger dataset; only approximation
One more example: "physics-informed" neural network
More on these models in this module.
These are mainly the cases of inductive bias.
An online interactive CNN visualization example:
But not all data reside in a regular array-like space. In particular, networks
Examples of networks, or graph-based data structure
Generalized convolution
(Figure from Alex Graves)
Mathematical details: Let $\norm{\ppf{\vh_{\tau+1}}{\vh_\tau}}\approx \alpha$
$$ \begin{align*} \norm{\ppf{\cL}{\vh_t}} &\propto \norm{\ppf{\vh_{t+1}}{\vh_t}\ppf{\cL}{\vh_{t+1}}} \propto \norm{ \left(\prod_{\tau=t}^{T-1}\ppf{\vh_{\tau+1}}{\vh_\tau}\right) \ppf{\cL}{\vh_T}} \leq \prod_{\tau=t}^{T-1}\norm{\ppf{\vh_{\tau+1}}{\vh_\tau}} \norm{\ppf{\cL}{\vh_T}} \\ &\Rightarrow \norm{\ppf{\cL}{\vh_t}} \approx \alpha^{T-t} \norm{\ppf{\cL}{\vh_T}} \end{align*} $$At very old steps, i.e. when $T\gg t$,
The vanishing gradient problem turns out to be universal in deep learning, e.g. in CNN architectures.
This leads to the skip connection technique (which is now standard) and the Residual Network (ResNet).
Ref: Deep Residual Learning for Image Recognition, arXiv 1512.03385 (200k+ citations as of Mar. 2024...)
The ResNet essentially makes the following change: $$ \vx^{(j+1)} = \vF(\vx^{(j)}) \quad\Rightarrow\quad \vx^{(j+1)} = \vx^{(j)} + \vF(\vx^{(j)}) $$ to provide a "bypass" for the back-propagation.
Recall for a first-order ordinary differential equation (ODE) $$ \dot\vx = \vf(\vx),\quad \vx(0)=\vx_0,\quad t\in[0,T] $$ Forward Euler method with step size $\Delta t$, $\vx_j=\vx(j\Delta t)$ $$ \vx^{j+1} = \vx^{j} + \Delta t\vf(\vx^{j}) $$ Then $\Delta t\vf(\vx^{j})$ is as if the ResNet block $\vF(\vx^{(j)})$ in previous slide!
But ...
There of course many ... For example
And in the bigger picture: