Machine Learning for Engineering¶

Neural Networks: Basic concepts¶

Instructor: Daning Huang¶

TODAY: Neural Networks - I¶

  • Feature engineering
  • Forward propagation

References¶

  • PRML Chp. 5
  • Referenced papers

Feature Engineering¶

So far...¶

The regression models we have seen so far require the determination and selection of features, e.g.

  • Linear regression: $f(x) = \phi(x)^Tw$
    • Possible to "handcraft" as many features as possible
    • And introduce regularization to "pick" features
  • Kernel regression: $f(x) = \kappa(x,x')^Ta$
    • Features are implicitly defined in the kernels
    • Non-parametric: heavy model
  • Gaussian process regression: $f(x) = m(x) + z(x,x')$
    • Matching all data samples by "brutal force"
    • Non-parametric: heavy model (again)

Necessity for "clever" features¶

Canonical example: XOR function

  • The positive/negative examples are not linearly separable.
  • Need to map the input ($x_1,x_2$) to a feature space where examples are linearly separable.

(Figure from Raquel Urtasun & Rich Zemel)

Possible choice: $\phi(x_1, x_2) = \cos(\pi(x_1+x_2))$

so that $\phi(0,0)=\phi(1,1)=1$ and $\phi(0,1)=\phi(1,0)=-1$

One more example from Aeroacoustics¶

(E. Greenwood, et al. JAHS 60, 022007 (2015))

One more example from Aeroacoustics¶

(E. Greenwood, et al. JAHS 60, 022007 (2015))

Drawbacks of handcrafting features¶

  • Requires expert knowledge
  • Requires time-consuming hand-tuning

This is why people are interested in neural networks, which can somehow figure out features from raw data.

Now recall this figure¶

Overview of Neural Networks¶

  • Input Layer: provides input
  • Hidden Layers: features extracted from input - there can be many
  • Output Layer: output of the network
  • Parameters (or weights) for each layer, $\theta^{(i)}$

Overview of Neural Networks¶

  • A loss function is defined over the output units and desired outputs (i.e., labels) $$ \mathcal{L}\left( \textbf{y}, \hat{\textbf{y}}; \theta \right) \mbox{ where } \hat{\textbf{y}}=f(\textbf{x};\theta) $$
  • The parameter of the network is trained to minimize the loss function based on gradient descent methods $$ \min_{\theta} \left[ \mathcal{L}\left( \textbf{y}, \hat{\textbf{y}}; \theta \right) \right] $$
  • Forward Propagation (inference): Compute $\hat{\textbf{y}}=f(\textbf{x};\theta)$ (output given input)
  • Backward Propagation (learning): Compute $\nabla_{\theta}\mathcal{L}$ (gradient of loss w.r.t. parameters)

Focus of Today: Forward Propagation¶

Forward Propagation¶

Forward Propagation¶

Forward Propagation¶

  • The activation of each unit is computed based on the previous layer and parameters (or weights) associated with edges $$\underbrace{\textbf{h}^{(l)}}_{l\mbox{-th layer}}=f^{(l)}(\underbrace{\textbf{h}^{(l-1)}}_{(l-1)\mbox{-th layer}}; \underbrace{\theta^{(l)}}_{\mbox{weights}}) \mbox { where } \textbf{h}^{(0)} \equiv \textbf{x}, \textbf{h}^{(L)} \equiv \hat{\textbf{y}}$$ $$\hat{\textbf{y}}=f(\textbf{x};\theta)=f^{(L)} \circ f^{(L-1)} \cdots f^{(2)} \circ f^{(1)}\left(\textbf{x} ; \theta^{(1)} \right) $$

Types of Layers: Linear¶

$$ h_i=\sum_{j}w_{ij}x_j + b_i$$$$ \textbf{h} = \textbf{W}\textbf{x} + \textbf{b} $$

  • $\textbf{x} \in \mathbb{R}^m $ : Input, $\textbf{h} \in \mathbb{R}^n $ : Output
  • $\textbf{W} \in \mathbb{R}^{n \times m}$ : Weight, $\textbf{b} \in \mathbb{R}^{n}$ : Bias $\rightarrow$ parameter
  • Often called "fully-connected layer"

Types of Layers: Non-linear Activation Function¶

  • Applies a non-linear function to individual units.
  • There is no weight.
  • Allows neural networks to learn non-linear features.
  • ex) Sigmoid, Hyperbolic Tangent, Rectified Linear Function

Non-linear Activation: Sigmoid¶

$$ h_i=\sigma (x_i) = \frac{1}{1+\exp\left(-x_i\right)} $$$$ \textbf{h} = \sigma \left(\textbf{x} \right) $$

In [2]:
xx = np.linspace(-5, 5, 100)
_=plt.plot(xx, 1/(1 + np.exp(-1 * xx)), '-g')

Non-linear Activation: Hyperbolic Tangent (Tanh)¶

$$ h_i= \mbox{tanh}(x_i)=\frac{\exp(x_i)-\exp(-x_i)}{\exp(x_i)+\exp(-x_i)} $$$$ \textbf{h} = \mbox{tanh} \left(\textbf{x} \right) $$

In [3]:
xx = np.linspace(-5, 5, 100)
_=plt.plot(xx, (np.exp(xx) - np.exp(-1 * xx))/(np.exp(xx) + np.exp(-1 * xx)), '-g')

Non-linear Activation: Rectified Linear (ReLU)¶

$$ h_i= \mbox{ReLU}(x_i)=\max\left(x_i, 0 \right) $$$$ \textbf{h} = \mbox{ReLU} \left(\textbf{x} \right) $$

Easier to optimize

In [14]:
xx = np.linspace(-5, 5, 101)
_=plt.plot(xx, xx * (xx > 0).astype(np.int), '-g')

The ReLU family¶

  • Smoother for gradient computation at $x=0$
  • Adding contribution from the negative end
  • Balance for computational cost

Flow chart for selecting ReLU-type activations¶

There are certainly more activation functions...¶

For example

  • $\exp(-x^2)$ makes the NN similar to GPR's, and belongs to the "Radial Basis Function NN" (RBFNN)
  • $\sin(x)$ can work amazingly well in the reconstruction of complex signals (and their derivatives) [1]; example below

[1] Sitzmann, Vincent, et al. Implicit neural representations with periodic activation functions. Arxiv 2006.09661

Types of Layers: Softmax¶

$$ h_i = \frac{\exp(x_i)}{\sum_{j}\exp(x_j)} $$$$ \textbf{h} = \mbox{Softmax}(\textbf{x}) $$
  • Note: $h_i \geq 0$ and $\sum_{i}h_i=1$
  • Useful for generating a multinomial distribution (classification)

Multi-layer Neural Network¶

  • Consists of multiple (linear + non-linear activation) layers.
  • Each layer learns non-linear features from its previous layer.
  • Often called Multi-Layer Perceptron (MLP).

Multi-layer Neural Network¶

  • Simplified illustration that only shows edges with weights.
  • We assume that each layer is followed by a non-linear activation function (except for the output layer).

Multi-layer Neural Network¶

  • More simplified illustration

Multi-layer Neural Network¶

  • Even more simplified illustration

MLP as universal functional approximation¶

2-layer MLP with infinite number of hidden units can approximate any functions.

  • Classical result for shallow NN: to approximate a $C^n$-function (i.e. continuous up to $n$-th order derivative) on a $d$-dimensional set with infinitesimal error $\epsilon$ one needs a network of size about $O(\epsilon^{-d/n})$, assuming a smooth activation function [1]

[1] A. Pinkus, Approximation theory of the MLP model in neural networks, Acta numerica, 8(1999), 143-195.

  • Deep NN - still under active research
    • For ReLU-based NN, it is equivalent to the finite element method
    • The $d$-dimensional $C^1$ functions can be represented with at most $\log_2(d + 1)$ hidden layers with size $O(\epsilon^{-d})$ (i.e. $n=1$). [2]
    • If the NN depth does not depend on $\epsilon$, the size $O(\mathrm{poly}(1/\epsilon))$; if the NN depth is $O(1/\epsilon)$, the size $(\mathrm{polylog}(1/\epsilon))$. [3]

[2] ReLU deep neural networks and linear finite elements arXiv:1807.03973

[3] Why Deep Neural Networks for Function Approximation? arXiv:1610.04161

Manual example: Piecewise linear curve fitting¶

Another (easier) exercise would be using sigmoid activation functions to do the XOR problem.

In [11]:
plt.plot(xx, yy, 'rs-', label='Target Function')
_=plt.legend()
  • Linear+ReLU:

$h(x;k,b) = ReLU(kx+b)$

  • Divide and conquer
In [13]:
plt.plot(xx, yy, 'rs-', label='Target Function')
plt.plot(xx, y2, 'bo-', label='Layer 1')
plt.plot(xx, yy-y2, 'g^--', label='Residual')
_=plt.legend()