Maximum Likelihood: Pick the parameters under which the data is most probable under our model.
$$ w_{ML} = \arg\max_w P(\mathcal{X} | w) $$Maximum a Posteriori: Pick the parameters under which the data is most probable, weighted by our prior beliefs.
$$ w_{MAP} = \arg\max_w P(\mathcal{X} | w) P(w) $$Assume a stochastic model $$ \begin{gather} t = y(x,w) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ where $\beta$ represents precision.
This gives the following likelihood function: $$ p(t|x,w,\beta) = \mathcal{N}(t|y(x,w),\beta^{-1}) $$
We will now show that the log likelihood is $$ \ln p(t|X,w,\beta) = \frac{N}{2} \ln\beta - \frac{N}{2} \ln{2\pi}- \beta E_D(w) $$ where $$ E_D(w) = \frac{1}{2} \sum_{n=1}^N \left[ t_n-w^T\phi(x_n) \right]^2 = \frac{1}{2} ||t-\Phi w||^2 $$
Recall the design matrix $\Phi$: $$ \Phi = \begin{bmatrix} \phi_0(\vec{x}_1) & \phi_1(\vec{x}_1) & \cdots & \phi_{M-1}(\vec{x}_1) \\ \phi_0(\vec{x}_2) & \phi_1(\vec{x}_2) & \cdots & \phi_{M-1}(\vec{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\vec{x}_N) & \phi_1(\vec{x}_N) & \cdots & \phi_{M-1}(\vec{x}_N) \\ \end{bmatrix} $$
From $P(t| x,w) = \sqrt\frac{\beta}{2\pi} \exp(-\frac{\beta}{2} || t-w^T \phi(x) ||^2)$ we have $$ \begin{align} &\; \ln P(t_1, \dots, t_N | x,w) \\ =&\; \ln \prod_{k=1}^N \mathcal{N}(t_k| w^T \phi(x^{(k)}), \beta^{-1}) \\ =& \sum_{k=1}^N \ln \left[ \sqrt\frac{\beta}{2\pi} \exp \left( -\frac{\beta}{2} \cdot || t_k-w^T \phi(x^{(k)}) ||^2 \right) \right] \end{align} $$
Maximizing the likelihood is equivalent to minimizing the sum of squared errors
Set gradient of log-likelihood to zero, $$ \begin{align} \nabla_w \ln p(t | w, \beta) &= \nabla_w \left[ \frac{N}{2}\ln\beta - \frac{N}{2}\ln{2\pi} - \beta E_D(w) \right] = 0 \\ &\Rightarrow \nabla_w \left[ \frac{1}{2} ||t-\Phi w||^2 \right] \\ &\Rightarrow \boxed{(\Phi^T \Phi) w = \Phi^T t} \end{align} $$
Again we assume a stochastic model $$ \begin{gather} t = y(x,w) + \epsilon \\ \epsilon \sim \mathcal{N}(0, \beta^{-1}) \end{gather} $$ where $\beta$ represents accuracy.
With inputs $X=(x_1, \dots, x_n)$ and target values $t=(t_1,\dots,t_n)$, the data likelihood is $$ p(t|w,\beta) = \prod_{n=1}^N \mathcal{N}(t_n|w^T\phi(x_n),\beta^{-1}) $$ (Note $X$ is omitted for simplicity)
There are many knobs in this example:
alpha = 1e-6
beta = 1e1
N = 20
x, t = xSmpl[:N], tSmpl[:N]
mN, SN = pwD(x, t, alpha, beta)
print(mN)
f, ax = plt.subplots(ncols=2, figsize=(16,8))
pltMvN(ax[0], mN, SN)
pltSmp(ax[1], x, t, mN, SN, beta, bnd=True)
[0.5881034 1.93609618]
Another example from PRML:
The iterative procedure to determine $m_N$, $S_N$, $\alpha$, $\beta$
You will try this out in the homework.