LR_Basics slides

Machine Learning for Engineering¶

Linear Regression: Formulation by optimization¶

Instructor: Daning Huang¶

TODAY: Linear Regression - I¶

Least squares
Accuracy assessment

References:¶

[PRML] Chp. 3.1

Supervised Learning¶

Goal
- Given data $X$ in feature sapce and the labels $Y$
- Learn to predict $Y$ from $X$
Labels could be discrete or continuous
- Continuous-valued labels: Regression
- Discrete-valued labels: Classification - I don't care

No description has been provided for this image

Notation¶

In this lecture, we will use

Data $x \in R^D$ (scalar- or vector-valued)
Features $\phi(x) \in R^M$ for data $x$
Continuous-valued labels $t \in R$ (target values)

We will interchangeably use

$x^{(n)}, x_n$ to denote the $n^\text{th}$ training example.
$t^{(n)}, t_n$ to denote the $n^\text{th}$ target value.

Linear Regression (1d inputs)¶

Consider 1d case (e.g. D=1)
- Given a set of observations $x_1, \dots, x_N$
- and corresponding target values $t_1, \dots, t_N$
We want to learn a function $y(x, \vec{w}) \approx t$ to predict future values.

$$ y(x, \vec{w}) = w_0 + w_1 x + w_2 x^2 + \dots w_M x^M = \sum_{k=0}^M w_k x^k $$

Regression: Noisy Data¶

In [3]:

xx, yy = generateData(100, False)  # Truth, without noise
x, y = generateData(13, True)      # Noisy observation, training samples
plt.plot(xx, yy, 'g--')
plt.plot(x, y, 'ro');

Regression: 0th Order Polynomial¶

In [4]:

coeffs = np.polyfit(x, y, 0)  # Leveraging NumPy's polyfit as short-cut; 0 is the degree of the poly
poly = np.poly1d(coeffs) # Polynomial with "learned" parameters
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-");

Regression: 1st Order Polynomial¶

In [5]:

coeffs = np.polyfit(x, y, 1) # Now let's try degree = 1
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-");

Regression: 3rd Order Polynomial¶

In [6]:

coeffs = np.polyfit(x, y, 3) # Now degree = 3
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-");

Linear Regression (General Case)¶

The function $y(\vec{x}, \vec{w})$ is linear in parameters $\vec{w}$.

Goal: Find the best value for the weights, $\vec{w}$.
For simplicity, add a bias term $\phi_0(\vec{x}) = 1$.

$$ y(\vec{x}, \vec{w}) = w_0 + \sum_{j=0}^{M-1} w_j \phi_j(\vec{x}) = \vec{w}^T \phi(\vec{x}) $$

Basis Functions¶

The basis functions $\phi_j(\vec{x})$ need not be linear.

In [8]:

r = np.linspace(-1,1,100)
f, axs = plt.subplots(1, 3, figsize=(12,4))
for j in range(8):
    axs[0].plot(r, np.power(r,j))
    axs[1].plot(r, np.exp( - (r - j/7.0 + 0.5)**2 / 2*5**2 ))
    axs[2].plot(r, 1 / (1 + np.exp( - (r - j/5.0 + 0.5) * 5)) )

set_nice_plot_labels(axs)

Least Squares: Objective Function¶

Minimize the residual error over the training data.

$$ E(\vec{w}) = \frac12 \sum_{n=1}^N (y(x_n, \vec{w}) - t_n)^2 = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right)^2 $$

Least Squares: Gradient Calculation¶

Closed Form Solution: Derivation¶

$$ \begin{align} \mbox{Scarlar Form: } E(\vec{w}) &= \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j \phi_j(\vec{x}^{(n)}) - t^{(n)} \right)^2 \\ &= \frac12 \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}^n) - t^{(n)} \right)^2 \\ &= \frac12 \sum_{n=1}^N (\vec{w}^T \phi( \vec{x}^{(n)} ) )^2 - \sum_{n=1}^N t^{(n)} \vec{w}^T \phi( \vec{x}^{(n)} ) + \frac12 \sum_{n=1}^N (t^{(n)})^2 \\ \mbox{Vector Form: } E(\vec{w}) &= \frac12 \|\Phi \vec{w} - \vec{t}\|^2 \\ &= \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{w}^T \Phi^T \vec{t} + \frac12 \vec{t}^T \vec{t} \end{align}$$

Closed Form Solution: Data Matrix¶

The design matrix is a matrix $\Phi \in R^{N \times M}$, applying

the $M$ basis functions (columns)
to $N$ data points (rows)

$$ \Phi = \begin{bmatrix} \phi_0(\vec{x}_1) & \phi_1(\vec{x}_1) & \cdots & \phi_{M-1}(\vec{x}_1) \\ \phi_0(\vec{x}_2) & \phi_1(\vec{x}_2) & \cdots & \phi_{M-1}(\vec{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\vec{x}_N) & \phi_1(\vec{x}_N) & \cdots & \phi_{M-1}(\vec{x}_N) \\ \end{bmatrix} $$

Goal: $\Phi \vec{w} \approx \vec{t}$

Least Squares: Gradient via Matrix Calculus¶

Compute the gradient and set to zero

\begin{align} \nabla_\vec{w} E(\vec{w}) &= \nabla_\vec{w} \left[ \frac12 \vec{w}^T \Phi^T \Phi \vec{w} - \vec{w}^T \Phi^T \vec{t} + \frac12 \vec{t}^T \vec{t} \right] \\ &= \Phi^T \Phi \vec{w} - \Phi^T \vec{t} = 0 \end{align}

Solve the resulting normal equation: $$ \Phi^T \Phi \vec{w} = \Phi^T \vec{t},\quad \vec{w} = (\Phi^T \Phi)^{-1} \Phi^T \vec{t} $$

This is the Moore-Penrose pseudoinverse, $\Phi^\dagger = (\Phi^T \Phi)^{-1} \Phi^T$ applied to solve the linear system $\Phi \vec{w} \approx \vec{t}$.

Digression: Moore-Penrose Pseudoinverse¶

When we have a matrix $A$ that is non-invertible or not even square, we might want to invert anyway
For these situations we use $A^\dagger$, the Moore-Penrose Pseudoinverse of $A$
When $A$ has lin. indep. columns then $A^\dagger = (A^\top A)^{-1} A^\top$
In general, we can get $A^\dagger$ by SVD: if we write $A = U \Sigma V^\top$ then $A^\dagger = V \Sigma^\dagger U^\top$, where $\Sigma^\dagger$ is obtained by taking reciprocals of non-zero entries of $\Sigma$.

Alternative View: Projection onto subspace¶

Columns of $\Phi$ forms a subspace and we are essentially projecting $\vec{t}$ onto $\Phi$ when doing $\Phi \vec{x}=\vec{t}$
The SVD $\Phi=U\Sigma V^T$ produces the orthonormal basis $U$ and transformation matrix $T=\Sigma V^T$
Recall from review of linear algebra, let $\vec{y}=T\vec{x}$, then $U\vec{y}=\vec{t}$ and $\vec{y}\approx U^T\vec{t}$
Then $\vec{x}=T^{-1}\vec{y}=V\Sigma^\dagger U^T \vec{t}=\Phi^\dagger \vec{t}$

Detailed derivation - Scalar version - Skip in class¶

Least Squares: Gradient Calculation¶

To minimize the error, take partial derivatives w.r.t. each weight $w_j$:

$$ \frac{\partial E(\vec{w})}{\partial w_k}= \frac{\partial}{\partial w_k} \left[ \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right)^2 \right] $$

Apply the chain rule:

$$ = \sum_{n=1}^N \left[ \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right) \frac{\partial}{\partial w_k} \left[ \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right] \right] $$

$$ = \sum_{n=1}^N \left[ \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right) \phi_k(\vec{x}^{(n)}) \right] $$

Least Squares: Vectorized Gradient¶

$$ \begin{align} \nabla_w E(\vec{w}) &= \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j \phi_j(\vec{x}^{(n)}) - t^{(n)} \right) \phi(\vec{x}^{(n)}) \\ &= \sum_{n=1}^N \left( \vec{w}^T \phi(\vec{x}^{(n)}) - t^{(n)} \right) \phi(\vec{x}^{(n)}) \end{align} $$

Closed Form Solution¶

Main Idea: Compute gradient and set to gradient to zero, solving in closed form.

Objective function: $$E(\vec{w}) = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)})- t^{(n)}\right)^2$$
We will derive the gradient using matrix calculus.

Detailed derivation - Matrix Calculus - Skip in class¶

Matrix Calculus: The Gradient¶

Suppose that $f : R^{m \times n} \mapsto R$, that is, the function $f$

takes as input a matrix $A \in R^{m\times n} = [a_{ij}]$
returns a real value

Then, the gradient of $f$ with respect to $A$ is:

$$ \nabla_A f(A) \in R^{m \times n} = \begin{bmatrix} % \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ % \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ % \vdots & \vdots & \ddots & \vdots \\ % \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \\ \end{bmatrix},\quad [\nabla_A f(A)]_{ij} = \frac{\partial f}{\partial a_{ij}} $$

Matrix Calculus: The Gradient¶

Note that the size of $\nabla_A f(A)$ is always the same as the size of $A$. In particular, for vectors $x \in R^n$,

$$ \nabla_x f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix}^T $$

The gradient is a linear operator from $R^n \mapsto R^n$:

$ \nabla_x (f + g) = \nabla_x f + \nabla_x g $
$ \forall\, c \in R, \nabla_x (c f) = c \nabla_x f$

Gradient: Linear Functions¶

The gradient of the linear function $f(x) = \sum_{k=1}^n b_k x_k = b^T x$ is

$$ \frac{\partial f}{\partial x_k} = \frac{\partial}{\partial x_k} \sum_{k=1}^n b_k x_k = \sum_{k=1}^n \frac{\partial}{\partial x_k} b_k x_k = b_k $$

In a more compact form,

$$ \nabla_x b^T x = b $$

Gradient: Quadratic Forms¶

Every symmetric $A \in R^{n \times n}$ corresponds to a quadratic form: $$ f(x) = \sum_{i=1}^n \sum_{j=1}^n x_i A_{ij} x_j = x^T A x $$
The partial derivatives are $$ \frac{\partial f}{\partial x_k} = 2 \sum_{j=1}^n A_{ij} x_j = 2 [Ax]_i $$
Compact form $\nabla_x f(x) = 2 Ax$

Accuracy Assessment¶

Back to the polynomial example¶

In [10]:

stys = ['k-', 'k-.', 'm-', 'm-.']
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro", label="Training data")
for _i in range(4):
    plt.plot(xx, tsts[_i], stys[_i], label='Order {0}'.format(_i))
plt.legend();

Root of Mean-Squared Error, RMSE¶

$$ E_{RMS} = \sqrt{2E(w^*) / N} = \sqrt{\frac{1}{N}\sum_{i=1}^N[f(x_i)-t_i]^2} $$

Easy to compute
Potential issue: Improperly scaled

In [11]:

f, ax = plt.subplots(ncols=4, sharey=True, figsize=(16,4))
for _i in range(4):
    ax[_i].plot(xx, yy, "g--")
    ax[_i].plot(x, y, "ro")
    ax[_i].plot(xx, tsts[_i])
    ax[_i].set_title('Order {0}, RMSE={1:4.3f}'.format(_i, es[_i]))

Coefficient of determination¶

$$ R^2 = 1 - \frac{SS_{res}}{SS_{tot}} $$

Let $\bar{t}$ be average of $\{t_i\}$
Residual sum of squares: $$SS_{res}=\sum_i[f(x_i)-t_i]^2$$
Total sum of squares: $$SS_{tot}=\sum_i[t_i-\bar{t}]^2$$
Explained sum of squares: $$SS_{reg}=\sum_i[f(x_i)-\bar{t}]^2$$

In [12]:

f, ax = plt.subplots(ncols=4, sharey=True, figsize=(16,4))
for _i in range(4):
    ax[_i].plot(y, y, "r-")
    ax[_i].plot(y, trns[_i], "bo")
    ax[_i].set_title('Order {0}, R2={1:4.3f}'.format(_i, rs[_i]))

Geometrical interpretation of R2¶

Blue: Squared residuals w.r.t. the linear regression.
Red: Squared residuals w.r.t. the average value.

$$ R^2 = 1 - \frac{\color{blue}{SS_{res}}}{\color{red}{SS_{tot}}} $$