Machine Learning for Engineering¶

Linear Regression: Formulation by optimization¶

Instructor: Daning Huang¶

TODAY: Linear Regression - I¶

  • Least squares
  • Accuracy assessment

References:¶

  • [PRML] Chp. 3.1

Supervised Learning¶

  • Goal
    • Given data $X$ in feature sapce and the labels $Y$
    • Learn to predict $Y$ from $X$
  • Labels could be discrete or continuous
    • Continuous-valued labels: Regression
    • Discrete-valued labels: Classification - I don't care
No description has been provided for this image

Notation¶

In this lecture, we will use

  • Data $x \in \mathbb{R}^D$ (scalar- or vector-valued)
  • Features $\phi(x) \in \mathbb{R}^M$ for data $x$
  • Continuous-valued labels $t \in \mathbb{R}$ (target values)

We will use

  • $x^{(n)}$ to denote the $n^\text{th}$ training example.
  • $t^{(n)}$ to denote the $n^\text{th}$ target value.

Linear Regression (1d inputs)¶

  • Consider 1d case (e.g. D=1)
    • Given a set of observations $x^{(1)}, \dots, x^{(N)}$
    • and corresponding target values $t^{(1)}, \dots, t^{(N)}$
  • We want to learn a function $y(x, {w}) \approx t$ to predict future values.

$$ y(x, {w}) = w_0 + w_1 x + w_2 x^2 + \dots w_M x^M = \sum_{k=0}^M w_k x^k $$

Regression: Noisy Data¶

In [3]:
xx, yy = generateData(100, False)  # Truth, without noise
x, y = generateData(13, True)      # Noisy observation, training samples
plt.plot(xx, yy, 'g--')
plt.plot(x, y, 'ro');
No description has been provided for this image

Regression: 0th Order Polynomial¶

In [4]:
coeffs = np.polyfit(x, y, 0)  # Leveraging NumPy's polyfit as short-cut; 0 is the degree of the poly
poly = np.poly1d(coeffs) # Polynomial with "learned" parameters
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-");
No description has been provided for this image

Regression: 1st Order Polynomial¶

In [5]:
coeffs = np.polyfit(x, y, 1) # Now let's try degree = 1
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-");
No description has been provided for this image

Regression: 3rd Order Polynomial¶

In [6]:
coeffs = np.polyfit(x, y, 3) # Now degree = 3
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-");
No description has been provided for this image

Linear Regression (General Case)¶

The function $y({x}, {w})$ is linear in parameters ${w}$.

  • Goal: Find the best value for the weights, ${w}$.
  • For simplicity, add a bias term $\phi_0({x}) = 1$.

$$ y({x}, {w}) = w_0 + \sum_{j=0}^{M-1} w_j \phi_j({x}) = {w}^T \phi({x}) $$

Basis Functions¶

The basis functions $\phi_j({x})$ need not be linear.

In [8]:
r = np.linspace(-1,1,100)
f, axs = plt.subplots(1, 3, figsize=(12,4))
for j in range(8):
    axs[0].plot(r, np.power(r,j))
    axs[1].plot(r, np.exp( - (r - j/7.0 + 0.5)**2 / 2*5**2 ))
    axs[2].plot(r, 1 / (1 + np.exp( - (r - j/5.0 + 0.5) * 5)) )

set_nice_plot_labels(axs)
No description has been provided for this image

Least Squares: Objective Function¶

Minimize the residual error over the training data.

$$ E({w}) = \frac12 \sum_{n=1}^N (y(x^{(n)}, {w}) - t^{(n)})^2 = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j({x}^{(n)}) - t^{(n)} \right)^2 $$

No description has been provided for this image

Least Squares: Gradient Calculation¶

Closed Form Solution: Derivation¶

$$ \begin{aligned} \text{Scalar Form: } E({w}) &= \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j \phi_j({x}^{(n)}) - t^{(n)} \right)^2 \\ &= \frac12 \sum_{n=1}^N \left( {w}^T \phi({x}^n) - t^{(n)} \right)^2 \\ &= \frac12 \sum_{n=1}^N ({w}^T \phi( {x}^{(n)} ) )^2 - \sum_{n=1}^N t^{(n)} {w}^T \phi( {x}^{(n)} ) + \frac12 \sum_{n=1}^N (t^{(n)})^2 \\ \mbox{Vector Form: } E({w}) &= \frac12 \|\Phi {w} - {t}\|^2 \\ &= \frac12 {w}^T \Phi^T \Phi {w} - {w}^T \Phi^T {t} + \frac12 {t}^T {t} \end{aligned} $$

Closed Form Solution: Data Matrix¶

The design matrix is a matrix $\Phi \in \mathbb{R}^{N \times M}$, applying

  • the $M$ basis functions (columns)
  • to $N$ data points (rows)

$$ \Phi = \begin{bmatrix} \phi_0({x}^{(1)}) & \phi_1({x}^{(1)}) & \cdots & \phi_{M-1}({x}^{(1)}) \\ \phi_0({x}^{(2)}) & \phi_1({x}^{(2)}) & \cdots & \phi_{M-1}({x}^{(2)}) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0({x}^{(N)}) & \phi_1({x}^{(N)}) & \cdots & \phi_{M-1}({x}^{(N)}) \\ \end{bmatrix} $$

Goal: $\Phi {w} \approx {t}$

Least Squares: Gradient via Matrix Calculus¶

  • Compute the gradient and set to zero

$$ \begin{aligned} \nabla_{w} E({w}) &= \nabla_{w} \left[ \frac12 {w}^T \Phi^T \Phi {w} - {w}^T \Phi^T {t} + \frac12 {t}^T {t} \right] \\ &= \Phi^T \Phi {w} - \Phi^T {t} = 0 \end{aligned} $$

  • Solve the resulting normal equation: $$ \Phi^T \Phi {w} = \Phi^T {t},\quad {w} = (\Phi^T \Phi)^{-1} \Phi^T {t} $$

This is the Moore-Penrose pseudoinverse, $\Phi^\dagger = (\Phi^T \Phi)^{-1} \Phi^T$ applied to solve the linear system $\Phi {w} \approx {t}$.

Digression: Moore-Penrose Pseudoinverse¶

  • When we have a matrix $A$ that is non-invertible or not even square, we might want to invert anyway
  • For these situations we use $A^\dagger$, the Moore-Penrose Pseudoinverse of $A$
  • When $A$ has lin. indep. columns then $A^\dagger = (A^\top A)^{-1} A^\top$
  • In general, we can get $A^\dagger$ by SVD: if we write $A = U \Sigma V^\top$ then $A^\dagger = V \Sigma^\dagger U^\top$, where $\Sigma^\dagger$ is obtained by taking reciprocals of non-zero entries of $\Sigma$.

Alternative View: Projection onto subspace¶

  • Columns of $\Phi$ forms a subspace and we are essentially projecting ${t}$ onto $\Phi$ when doing $\Phi {x}={t}$
  • The SVD $\Phi=U\Sigma V^T$ produces the orthonormal basis $U$ and transformation matrix $T=\Sigma V^T$
  • Recall from review of linear algebra, let ${y}=T{x}$, then $U{y}={t}$ and ${y}\approx U^T{t}$
  • Then ${x}=T^{-1}{y}=V\Sigma^\dagger U^T {t}=\Phi^\dagger {t}$

Detailed derivation - Scalar version - Skip in class¶

Least Squares: Gradient Calculation¶

To minimize the error, take partial derivatives w.r.t. each weight $w_j$:

$$ \frac{\partial E({w})}{\partial w_k}= \frac{\partial}{\partial w_k} \left[ \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j({x}^{(n)}) - t^{(n)} \right)^2 \right] $$

Apply the chain rule:

$$ = \sum_{n=1}^N \left[ \left( \sum_{j=0}^{M-1} w_j\phi_j({x}^{(n)}) - t^{(n)} \right) \frac{\partial}{\partial w_k} \left[ \sum_{j=0}^{M-1} w_j\phi_j({x}^{(n)}) - t^{(n)} \right] \right] $$

$$ = \sum_{n=1}^N \left[ \left( \sum_{j=0}^{M-1} w_j\phi_j({x}^{(n)}) - t^{(n)} \right) \phi_k({x}^{(n)}) \right] $$

Least Squares: Vectorized Gradient¶

$$ \begin{aligned} \nabla_w E({w}) &= \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j \phi_j({x}^{(n)}) - t^{(n)} \right) \phi({x}^{(n)}) \\ &= \sum_{n=1}^N \left( {w}^T \phi({x}^{(n)}) - t^{(n)} \right) \phi({x}^{(n)}) \end{aligned} $$

Closed Form Solution¶

Main Idea: Compute gradient and set to gradient to zero, solving in closed form.

  • Objective function: $$E({w}) = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j({x}^{(n)})- t^{(n)}\right)^2$$
  • We will derive the gradient using matrix calculus.

Detailed derivation - Matrix Calculus - Skip in class¶

Matrix Calculus: The Gradient¶

Suppose that $f : \mathbb{R}^{m \times n} \mapsto \mathbb{R}$, that is, the function $f$

  • takes as input a matrix $A \in \mathbb{R}^{m\times n} = [a_{ij}]$
  • returns a real value

Then, the gradient of $f$ with respect to $A$ is:

$$ \nabla_A f(A) \in \mathbb{R}^{m \times n} = \begin{bmatrix} % \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ % \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ % \vdots & \vdots & \ddots & \vdots \\ % \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \\ \end{bmatrix},\quad [\nabla_A f(A)]_{ij} = \frac{\partial f}{\partial a_{ij}} $$

Matrix Calculus: The Gradient¶

Note that the size of $\nabla_A f(A)$ is always the same as the size of $A$. In particular, for vectors $x \in \mathbb{R}^n$,

$$ \nabla_x f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix}^T $$

The gradient is a linear operator from $\mathbb{R}^n \mapsto \mathbb{R}^n$:

  • $ \nabla_x (f + g) = \nabla_x f + \nabla_x g $
  • $ \forall\, c \in R, \nabla_x (c f) = c \nabla_x f$

Gradient: Linear Functions¶

The gradient of the linear function $f(x) = \sum_{k=1}^n b_k x_k = b^T x$ is

$$ \frac{\partial f}{\partial x_k} = \frac{\partial}{\partial x_k} \sum_{k=1}^n b_k x_k = \sum_{k=1}^n \frac{\partial}{\partial x_k} b_k x_k = b_k $$

In a more compact form,

$$ \nabla_x b^T x = b $$

Gradient: Quadratic Forms¶

  • Every symmetric $A \in \mathbb{R}^{n \times n}$ corresponds to a quadratic form: $$ f(x) = \sum_{i=1}^n \sum_{j=1}^n x_i A_{ij} x_j = x^T A x $$
  • The partial derivatives are $$ \frac{\partial f}{\partial x_k} = 2 \sum_{j=1}^n A_{ij} x_j = 2 [Ax]_i $$
  • Compact form $\nabla_x f(x) = 2 Ax$

Accuracy Assessment¶

Back to the polynomial example¶

In [10]:
stys = ['k-', 'k-.', 'm-', 'm-.']
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro", label="Training data")
for _i in range(4):
    plt.plot(xx, tsts[_i], stys[_i], label='Order {0}'.format(_i))
plt.legend();
No description has been provided for this image

Root of Mean-Squared Error, RMSE¶

$$ E_{RMS} = \sqrt{2E(w^*) / N} = \sqrt{\frac{1}{N}\sum_{i=1}^N[f(x^{(i)})-t^{(i)}]^2} $$

  • Easy to compute
  • Potential issue: Improperly scaled
In [11]:
f, ax = plt.subplots(ncols=4, sharey=True, figsize=(16,4))
for _i in range(4):
    ax[_i].plot(xx, yy, "g--")
    ax[_i].plot(x, y, "ro")
    ax[_i].plot(xx, tsts[_i])
    ax[_i].set_title('Order {0}, RMSE={1:4.3f}'.format(_i, es[_i]))
No description has been provided for this image

Coefficient of determination¶

$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$

  • Let $\bar{t}$ be average of $\{t_i\}$
  • Residual sum of squares: $$SS_{res}=\sum_i[f(x^{(i)})-t^{(i)}]^2$$
  • Total sum of squares: $$SS_{tot}=\sum_i[t^{(i)}-\bar{t}]^2$$
  • Explained sum of squares: $$SS_{reg}=\sum_i[f(x^{(i)})-\bar{t}]^2$$
  • In simple least squares: $$SS_{tot}=SS_{reg}+SS_{res}$$
In [12]:
f, ax = plt.subplots(ncols=4, sharey=True, figsize=(16,4))
for _i in range(4):
    ax[_i].plot(y, y, "r-")
    ax[_i].plot(y, trns[_i], "bo")
    ax[_i].set_title('Order {0}, R2={1:4.3f}'.format(_i, rs[_i]))
No description has been provided for this image

Geometrical interpretation of R2¶

  • Blue: Squared residuals w.r.t. the linear regression.
  • Red: Squared residuals w.r.t. the average value.

$$ R^2 = 1 - \frac{\color{blue}{SS_{res}}}{\color{red}{SS_{tot}}} $$

No description has been provided for this image