# Fitting the parameters of a Normal using the maximum likelihood principle

As before we have \(N\) independent measurements.
Assume that the measurement \(X_i\) follows a Normal distribution with parameters \(\mu\) and \(\sigma^2\):

\[
p(x_i|\mu,\sigma^2) = f(x_i;\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}.
\]

So, in this case:

\[
\theta = (\mu, \sigma^2),
\]

and

\[
f(x;\theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x-\mu)^2}{2\sigma^2}\right\}.
\]

The likelihood of the data is:

\[\begin{split}
\begin{split}
p(x_{1:N}|\theta) &= \prod_{i=1}^Nf(x_i;\theta)\\
&= \prod_{i=1}^N\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}\\
&= (2\pi\sigma^2)^{-\frac{N}{2}}\prod_{i=1}^N\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}\\
&= (2\pi\sigma^2)^{-\frac{N}{2}}\exp\left\{-\sum_{i=1}^N\frac{(x_i-\mu)^2}{2\sigma^2}\right\}.
\end{split}
\end{split}\]

According to the maximum likelihood principle, we must pick \(\mu\) and \(\sigma^2\) that maximizes the logarithm of this expression.
Let’s find the logarithm first.
I am going to call it \(J(\mu,\sigma^2)\):

\[
J(\mu,\sigma^2) = \log p(x_{1:N}|\theta) = -\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2.
\]

So, now model training has become a calculus problem.
You need to maximize the two-variable function \(J(\mu,\sigma^2)\) with respect to
\(\mu\) and \(\sigma^2\).
How do you proceed?
We could either employ an optimization algorithm or we could do it analytically.
Let’s do it analytically in this simple case.
A necessary condition is that the derivative of \(J\) with respect to the parameters is zero.
Let’s find the derivative of \(J\) with respect to \(\mu\).
It is:

\[\begin{split}
\begin{split}
\frac{\partial J}{\partial \mu} &=
\frac{\partial}{\partial \mu}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\
&= \frac{\partial}{\partial \mu}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\
&= -\frac{1}{2\sigma^2}\sum_{i=1}^N\frac{\partial}{\partial \mu}(x_i-\mu)^2\\
&= -\frac{1}{2\sigma^2}\sum_{i=1}^N2(x_i-\mu)\frac{\partial}{\partial \mu}(x_i-\mu)\\
&= -\frac{1}{2\sigma^2}\sum_{i=1}^N2(x_i-\mu)(-1)\\
&= \frac{1}{\sigma^2}\sum_{i=1}^N(x_i-\mu)\\
&= \frac{1}{\sigma^2}\left[\sum_{i=1}^Nx_i - \sum_{i=1}^N\mu\right]\\
&= \frac{1}{\sigma^2}\left[\sum_{i=1}^Nx_i - N\mu\right].
\end{split}
\end{split}\]

Alright, that wasn’t too bad!
Now set this derivative equal to zero and you can solve for \(\mu\).
You get:

\[
\hat{\mu} = \frac{1}{N}\sum_{i=1}^Nx_i.
\]

This is nice! It is exactly what we would expect! It also matches what we got using the method of moments.
Let’s proceed to \(\sigma^2\).
We need to find the derivative of \(J\) with respect to it.

\[\begin{split}
\begin{split}
\frac{\partial J}{\partial \sigma^2} &=
\frac{\partial}{\partial \sigma^2}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\
&= \frac{\partial}{\partial \sigma^2}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \left(\sigma^2\right)^{-1}\frac{1}{2}\sum_{i=1}^N(x_i-\mu)^2\right]\\
&= -\frac{N}{2}\frac{\partial}{\partial \sigma^2}\log\sigma^2
- \frac{\sum_{i=1}^N(x_i-\mu)^2}{2}\frac{\partial}{\partial \sigma^2}\left(\sigma^2\right)^{-1}\\
&=-\frac{N}{2}\frac{1}{\sigma^2} - \frac{\sum_{i=1}^N(x_i-\mu)^2}{2}(-1)\left(\sigma^2\right)^{-2}\\
&= -\frac{N}{2\sigma^2} + \frac{\sum_{i=1}^N(x_i-\mu)^2}{2\sigma^4}\\
&= \frac{-N\sigma^2 + \sum_{i=1}^N(x_i-\mu)^2}{2\sigma^4}.
\end{split}
\end{split}\]

Setting this equal to zero and solving for \(\sigma^2\), yields:

\[
\hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N(x_i-\hat{\mu})^2,
\]

where I substituted \(\hat{\mu}\) for \(\mu\).
With a little bit of algebra you can show that this is exactly the same as the result obtained with the method of moments, i.e., it can be rewritten as:

\[
\hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^Nx_i^2-\hat{\mu}^2.
\]

We are not going to give a new example of this as the estimate is exactly the same to what we saw in Fitting Normal distributions to data.