coeffs = np.polyfit(x, y, 0)
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
_=plt.plot(xx, poly(xx), "b-")
coeffs = np.polyfit(x, y, 3)
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
_=plt.plot(xx, poly(xx), "b-")
coeffs = np.polyfit(x, y, 12)
poly = np.poly1d(coeffs)
plt.plot(xx, yy, "g--")
plt.plot(x, y, "ro")
_=plt.plot(xx, poly(xx), "b-")
Root-Mean-Square (RMS) Error: $E_{RMS} = \sqrt{2E(w^*) / N} = \sqrt{\frac{1}{N}\sum_{i=1}^N[f(x_i)-t_i]^2}$
f = plt.figure()
plt.plot(_ps, _trns, 'bo-', label='Training error')
plt.plot(_ps, _tsts, 'rs-', label='Test error')
plt.legend()
plt.xlabel("Order of polynomial")
_=plt.ylabel("RMSE")
for _s in _d:
print(_s)
w0 0.72054 -0.12625 -0.03885 -0.08472 w1 -0.20345 1.80812 1.37196 -14.13425 w2 -0.82882 -0.62727 101.81045 w3 0.08765 0.20524 -261.91109 w4 -0.08839 354.91284 w5 0.01775 -290.25054 w6 -0.00115 153.36883 w7 -54.19458 w8 12.94331 w9 -2.06431 w10 0.21074 w11 -0.01246 w12 0.00032
(12th order polynomial)
x1, y1 = generateData(15, True)
coeffs = np.polyfit(x1, y1, 12)
poly = np.poly1d(coeffs)
_=plt.plot(xx, yy, "g--", x1, y1, "ro", xx, poly(xx), "b-")
(12th order polynomial)
x2, y2 = generateData(100, True)
coeffs = np.polyfit(x2, y2, 12)
poly = np.poly1d(coeffs)
_=plt.plot(xx, yy, "g--", x2, y2, "ro", xx, poly(xx), "b-")
AKA number of features
Numerically, the issue traces back to the "conditioning" of the $\Phi^T\Phi$ matrix.
Example: Solve for $x=A^{-1}b$ $$ A=\begin{bmatrix}2\alpha & 1 \\ 1 & \frac{2}{\alpha}\end{bmatrix},\quad b=[1,10^{10}] $$ and we verify if we recover $b$ by doing $b=Ax=AA^{-1}b$.
kA = lambda a: np.linalg.cond([[2*a,1.],[1.,2/a]])
a = 10**np.linspace(1,11,21)
k = [kA(_) for _ in a]
plt.loglog(a, k)
plt.xlabel(r'$\alpha$')
_=plt.ylabel(r'$\kappa(A)$')
fA = lambda a: np.array([[2*a,1.],[1.,2./a]])
sA = lambda a: fA(a).dot(np.linalg.solve(fA(a), [1,1e10]))[0]
a = 10**np.linspace(1,11,21)
s = [sA(_) for _ in a]
plt.semilogx(a, np.ones_like(a), 'r--', label='Expected')
plt.semilogx(a, s, 'b-', label='Numerical')
plt.xlabel(r'$\alpha$')
plt.ylabel(r'First element of $AA^{-1}b$')
_=plt.legend()
Or -
Recall that our objective function is
\begin{align} E(w) &= \frac12 \sum_{n=1}^N (w^T \phi(x^{(n)}) - t^{(n)})^2 + \frac{\lambda}{2} w^T w \\ &= \frac12 w^T \Phi^T \Phi w - w^T \Phi^T t + \frac12 t^T t + \frac{\lambda}{2} w^T w \end{align}Compute gradient and set to zero:
$$ \begin{align} \nabla_w E(w) &= \nabla_w \left[ \frac12 w^T \Phi^T \Phi w - w^T \Phi^T t + \frac12 t^T t + \frac{\lambda}{2} w^T w \right] \\ &= \Phi^T \Phi w - \Phi^T t + \lambda w \\ &= (\Phi^T \Phi + \lambda I) w - \Phi^T t = 0 \end{align} $$Therefore, we get $\boxed{w_{ML} = (\Phi^T \Phi + \lambda I)^{-1} \Phi^T t}$
We can make use of the various $L_p$ norms for different regularizers:
$$ \widetilde{E}(w) = \frac12 \sum_{n=1}^N (t_n - w^T \phi(x_n))^2 + \frac{\lambda}{2} \sum_{j=1}^M |w_j|^q $$L2 regularization is also called Ridge Regression.
Another explanation for L2 regularization, using Fourier analysis: https://arxiv.org/abs/1806.08734v1
rdg_sc, rdg_coef = fitModel(1.0, lm.Ridge)
las_sc, las_coef = fitModel(0.01, lm.Lasso)
ii = np.arange(13)
f = plt.figure(figsize=(6,4))
plt.semilogy(ii, np.abs(_cs[11]), 'r--', label='LR')
plt.semilogy(ii, np.abs(rdg_coef), 'b-', label='Ridge')
plt.semilogy(ii, np.abs(las_coef), 'g-', label='Lasso')
plt.legend()
_=plt.xlabel(r'$w_i$')
_, _ = fitModel(1e-8, lm.Ridge)
_, _ = fitModel(1e8, lm.Ridge)
f = plt.figure()
plt.plot(np.log(lams), errs[0], 'b-', label='Training error')
plt.plot(np.log(lams), errs[1], 'r-', label='Test error')
plt.xlabel(r'$\log(\lambda)$')
_=plt.ylabel('1-R2')
NOTE: For simplicity of presentation, we divided the data into training set and test set. However, it’s not legitimate to find the optimal hyperparameter based on the test set. We will talk about legitimate ways of doing this when we cover model selection and cross-validation.
The bias of the Lasso estimate for a truly nonzero variable is about $\lambda$ for large regression coefficients coefficients.
To fix the issue: Adaptive Lasso, MCP, SCAD
We will see Double-Descent phenomenon - Errors in training and test both decrease as number of features increase.
To be discussed more in L7.