Wolpert and Macready (1997)
We have dubbed the associated results NFL theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.
In this lecture, we will use
We will interchangeably use
xx, yy = generateData(100, False) # Truth, without noise
x, y = generateData(13, True) # Noisy observation, training samples
plt.plot(xx, yy, 'g--')
plt.plot(x, y, 'ro')
[<matplotlib.lines.Line2D at 0x106ff2850>]
coeffs = np.polyfit(x, y, 0) # Leveraging NumPy's polyfit as short-cut; 0 is the degree of the poly
poly = np.poly1d(coeffs) # Polynomial with "learned" parameters
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-")
[<matplotlib.lines.Line2D at 0x1070e49a0>]
coeffs = np.polyfit(x, y, 1) # Now let's try degree = 1
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-")
[<matplotlib.lines.Line2D at 0x107151520>]
coeffs = np.polyfit(x, y, 3) # Now degree = 3
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
plt.plot(xx, poly(xx), "b-")
[<matplotlib.lines.Line2D at 0x1071af790>]
The function $y(\vec{x}, \vec{w})$ is linear in parameters $\vec{w}$.
The basis functions $\phi_j(\vec{x})$ need not be linear.
r = np.linspace(-1,1,100)
f, axs = plt.subplots(1, 3, figsize=(12,4))
for j in range(8):
axs[0].plot(r, np.power(r,j))
axs[1].plot(r, np.exp( - (r - j/7.0 + 0.5)**2 / 2*5**2 ))
axs[2].plot(r, 1 / (1 + np.exp( - (r - j/5.0 + 0.5) * 5)) )
set_nice_plot_labels(axs) # I'm hiding some helper code that adds labels
Minimize the residual error over the training data.
$$ E(\vec{w}) = \frac12 \sum_{n=1}^N (y(x_n, \vec{w}) - t_n)^2 = \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right)^2 $$The design matrix is a matrix $\Phi \in R^{N \times M}$, applying
Goal: $\Phi \vec{w} \approx \vec{t}$
This is the Moore-Penrose pseudoinverse, $\Phi^\dagger = (\Phi^T \Phi)^{-1} \Phi^T$ applied to solve the linear system $\Phi w \approx t$.
To minimize the error, take partial derivatives w.r.t. each weight $w_j$:
$$ \frac{\partial E(\vec{w})}{\partial w_k}= \frac{\partial}{\partial w_k} \left[ \frac12 \sum_{n=1}^N \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right)^2 \right] $$Apply the chain rule:
$$ = \sum_{n=1}^N \left[ \left( \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right) \frac{\partial}{\partial w_k} \left[ \sum_{j=0}^{M-1} w_j\phi_j(\vec{x}^{(n)}) - t^{(n)} \right] \right] $$Main Idea: Compute gradient and set to gradient to zero, solving in closed form.
Suppose that $f : R^{m \times n} \mapsto R$, that is, the function $f$
Then, the gradient of $f$ with respect to $A$ is:
$$ \nabla_A f(A) \in R^{m \times n} = \begin{bmatrix} % \frac{\partial f}{\partial a_{11}} & \frac{\partial f}{\partial a_{12}} & \cdots & \frac{\partial f}{\partial a_{1n}} \\ % \frac{\partial f}{\partial a_{21}} & \frac{\partial f}{\partial a_{22}} & \cdots & \frac{\partial f}{\partial a_{2n}} \\ % \vdots & \vdots & \ddots & \vdots \\ % \frac{\partial f}{\partial a_{m1}} & \frac{\partial f}{\partial a_{m2}} & \cdots & \frac{\partial f}{\partial a_{mn}} \\ \end{bmatrix} $$$$ [\nabla_A f(A)]_{ij} = \frac{\partial f}{\partial a_{ij}} $$Note that the size of $\nabla_A f(A)$ is always the same as the size of $A$. In particular, for vectors $x \in R^n$,
$$ \nabla_x f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \cdots & \frac{\partial f}{\partial x_n} \end{bmatrix}^T $$The gradient is a linear operator from $R^n \mapsto R^n$:
The gradient of the linear function $f(x) = \sum_{k=1}^n b_k x_k = b^T x$ is
$$ \frac{\partial f}{\partial x_k} = \frac{\partial}{\partial x_k} \sum_{k=1}^n b_k x_k = \sum_{k=1}^n \frac{\partial}{\partial x_k} b_k x_k = b_k $$In a more compact form,
$$ \nabla_x b^T x = b $$stys = ['k-', 'k-.', 'm-', 'm-.']
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro", label="Training data")
for _i in range(4):
plt.plot(xx, tsts[_i], stys[_i], label='Order {0}'.format(_i))
plt.legend()
<matplotlib.legend.Legend at 0x1073a49a0>
f, ax = plt.subplots(ncols=4, sharey=True, figsize=(16,4))
for _i in range(4):
ax[_i].plot(xx, yy, "g--")
ax[_i].plot(x, y, "ro")
ax[_i].plot(xx, tsts[_i])
ax[_i].set_title('Order {0}, RMSE={1:4.3f}'.format(_i, es[_i]))
f, ax = plt.subplots(ncols=4, sharey=True, figsize=(16,4))
for _i in range(4):
ax[_i].plot(y, y, "r-")
ax[_i].plot(y, trns[_i], "bo")
ax[_i].set_title('Order {0}, R2={1:4.3f}'.format(_i, rs[_i]))