Repeat until convergence
We are going to cover backward propagation in the next lecture.
To minimize $F(w)$, find roots of $F'(w)$ via Newton's Method.
Repeat until convergence: $$ \boxed{w \leftarrow w - \frac{F'(w)}{F''(w)}} $$
Replace second derivative with the Hessian Matrix, $$ H_{ij}(w) = \frac{\partial^2 F}{\partial w_i \partial w_j} $$
Newton update becomes: $$ \boxed{ w \leftarrow w - H^{-1} \nabla_w F} $$
Let $p=-H^{-1} \nabla_w F$, find a step size $\alpha$ such that $$ \min_\alpha F(w+\alpha p) $$ Then $$ \boxed{ w \leftarrow w - \alpha H^{-1} \nabla_w F} $$
The $\epsilon-\alpha$ curve [1]
[1] Y. Dauphin et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. NeurIPS 2014.
Intuitively, in high dimensions, the chance that all the directions around a critical point lead upward (positive curvature) is exponentially small w.r.t. the number of dimensions
Implications:
Given dataset $\{ (x,t) \}$ and initial guess $\theta_0$, repeat until convergence:
where $\eta$ is the step size(s), or learning rate, and assuming quadratic loss,
$$ \nabla_\theta \mathcal{L}(\theta) = \sum_{n=1}^N \left[ f\left( \textbf{x}^{(n)} ; \theta \right) - y^{(n)} \right] \frac{\partial f\left( \textbf{x}^{(n)} ; \theta \right)}{\partial \theta} $$Main Idea: Instead of computing batch gradient (over entire training data), just compute gradient for individual example and update.
Repeat until convergence: for $n=1,\dots,N$
$$ \theta = \theta - \eta \nabla_\theta \mathcal{L}(\textbf{x}^{(n)}; \theta) $$where
$$ \nabla_\theta \mathcal{L}(\textbf{x}^{(n)}; \theta) = \left[ f\left( \textbf{x}^{(n)} ; \theta \right) - y^{(n)} \right] \frac{\partial f\left( \textbf{x}^{(n)} ; \theta \right)}{\partial \theta} $$Let $n$ be the batch size, $\epsilon\sim 10^{-3}$ the final "loss", $T$ the allowable time, then [1]
[1] L. Bottou, et al. Stochastic Gradient Methods For Large-Scale Machine Learning. Arxiv: 1606.04838; Also here as slides: http://users.iems.northwestern.edu/~nocedal/ICML
Let the optimizer remember the previous gradients, $$ v_k = \rho v_{k-1} + (1-\rho)\nabla_\theta F(\theta_k) $$ Then $$ \boxed{ \theta_{k+1} = \theta_k - \eta v_k} $$
Let the optimizer tune learning rate automatically, $$ g_k = g_{k-1} + [\nabla_\theta F(\theta_k)]^2 $$ Then $$ \boxed{ \theta_{k+1} = \theta_k - \frac{\eta}{\sqrt{g_k}+\epsilon} \nabla_\theta F(\theta_k)} $$
Similar to AdaGrad, but allow the optimizer to "forget", $$ g_k = \alpha g_{k-1} + (1-\alpha)[\nabla_\theta F(\theta_k)]^2 $$ Then $$ \boxed{ \theta_{k+1} = \theta_k - \frac{\eta}{\sqrt{g_k}+\epsilon} \nabla_\theta F(\theta_k)} $$
Combine the best of SGD w/ momentum and RMSProp, $$v_k = \beta_1 v_{k-1} + (1-\beta_1)\nabla_\theta F(\theta_k)$$ $$g_k = \beta_2 g_{k-1} + (1-\beta_2)[\nabla_\theta F(\theta_k)]^2$$ Then $$ \boxed{ \theta_{k+1} = \theta_k - \frac{\eta}{\sqrt{g_k}+\epsilon} v_k} $$
People have devised many, many other gradient-descent methods. Some perform the best in some specific applications, but there is not "one-for-all" solution.
You will implement one of them in HW3.
There are also attempts to combine the advantages of different methods, e.g. [1]
[1] N. Keskar, R. Socher, Improving Generalization Performance by Switching from Adam to SGD, Arxiv 1712.07628
x0 = [-1.5,0.5]
# x0 = [1.0, 0.0]
# xs, fs, gs, hs = gradDesc(FGH, x0, BBLR)
# xs, fs, gs, hs = gradDesc(FGH, x0, BFGS, ifLS=True)
xs, fs, gs, hs = gradDesc(FGH, x0, NR, ifLS=True)
f1, ax = pltFunc()
ax[0] = pltTraj(ax[0], xs)
ax[1] = pltHist(ax[1], xs)
x0 = [-1.5,0.5]
ss = 0.5 # Noise in gradient
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, fixLR(0.0001))
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, GDM(0.0001,0.9))
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, AdaGrad(2.0))
# xs, fs, gs, hs = gradDesc(FGHS(ss), x0, RMSProp(0.1,0.9))
xs, fs, gs, hs = gradDesc(FGHS(ss), x0, Adam(0.1,0.9,0.999))
f1, ax = pltFunc()
ax[0] = pltTraj(ax[0], xs)
ax[1] = pltHist(ax[1], xs)
One more awesome interactive visualization: https://github.com/lilipads/gradient_descent_viz
Additional conclusion from [1]: A small learning rate or large-batch makes it much harder to escape from sharp minima
[1] Z. Xie. A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima. Arxiv: 2002.03495
dev=valid
[1] A. Ng, Nuts and Bolts of Building Applications using Deep Learning, NeurIPS 2016
Always normalize and shuffle your data!
Pick the right learning rate: e.g., Grid search [1], or (cyclic) scheduling [2,3]
[1] S. Gugger, How Do You Find A Good Learning Rate
[2] I. Loshchilov, SGDR: Stochastic gradient descent with warm restarts, Arxiv: 1608.03983
[3] L. Smith, Cyclical Learning Rates for Training Neural Networks, 2017
Pick the right optimizer: Typically start with Adam, fine-tune with SGD
Start with ReLU or tanh for activation; avoid sigmoid unless for "gate"-like layers
Try overfitting on a small dataset
Record your training experiments - learn from history
[1] R. Sun. Optimization for deep learning: theory and algorithms. Arxiv 1912.08957 A nice comprehensive summary
[1] X. Glorot, et al. Understanding the difficulty of training deep feedforward neural networks. AISTATS 2010
[2] K. He, et al. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. 1502.01852
[1] G. Hinton, et al. Improving neural networks by preventing co-adaptation of feature detectors. Arxiv:1207.0580
[1] S. Ioffe, et al. Batch normalization: Accelerating deep network training by reducing internal covariate shift. ICML 2015.
[2] T. Salimans, et al. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. NeurIPS 2016
BTW, look at that deep-double descent!
[1] T. Ishida, et al. Do We Need Zero Training Loss After Achieving Zero Training Error? Arxiv: 2002.08709