Fitting the parameters of a Normal using the maximum likelihood principle

Fitting the parameters of a Normal using the maximum likelihood principle

As before we have \(N\) independent measurements. Assume that the measurement \(X_i\) follows a Normal distribution with parameters \(\mu\) and \(\sigma^2\):

\[ p(x_i|\mu,\sigma^2) = f(x_i;\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}. \]

So, in this case:

\[ \theta = (\mu, \sigma^2), \]

and

\[ f(x;\theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x-\mu)^2}{2\sigma^2}\right\}. \]

The likelihood of the data is:

\[\begin{split} \begin{split} p(x_{1:N}|\theta) &= \prod_{i=1}^Nf(x_i;\theta)\\ &= \prod_{i=1}^N\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}\\ &= (2\pi\sigma^2)^{-\frac{N}{2}}\prod_{i=1}^N\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}\\ &= (2\pi\sigma^2)^{-\frac{N}{2}}\exp\left\{-\sum_{i=1}^N\frac{(x_i-\mu)^2}{2\sigma^2}\right\}. \end{split} \end{split}\]

According to the maximum likelihood principle, we must pick \(\mu\) and \(\sigma^2\) that maximizes the logarithm of this expression. Let’s find the logarithm first. I am going to call it \(J(\mu,\sigma^2)\):

\[ J(\mu,\sigma^2) = \log p(x_{1:N}|\theta) = -\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2. \]

So, now model training has become a calculus problem. You need to maximize the two-variable function \(J(\mu,\sigma^2)\) with respect to \(\mu\) and \(\sigma^2\). How do you proceed? We could either employ an optimization algorithm or we could do it analytically. Let’s do it analytically in this simple case. A necessary condition is that the derivative of \(J\) with respect to the parameters is zero. Let’s find the derivative of \(J\) with respect to \(\mu\). It is:

\[\begin{split} \begin{split} \frac{\partial J}{\partial \mu} &= \frac{\partial}{\partial \mu}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= \frac{\partial}{\partial \mu}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^N\frac{\partial}{\partial \mu}(x_i-\mu)^2\\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^N2(x_i-\mu)\frac{\partial}{\partial \mu}(x_i-\mu)\\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^N2(x_i-\mu)(-1)\\ &= \frac{1}{\sigma^2}\sum_{i=1}^N(x_i-\mu)\\ &= \frac{1}{\sigma^2}\left[\sum_{i=1}^Nx_i - \sum_{i=1}^N\mu\right]\\ &= \frac{1}{\sigma^2}\left[\sum_{i=1}^Nx_i - N\mu\right]. \end{split} \end{split}\]

Alright, that wasn’t too bad! Now set this derivative equal to zero and you can solve for \(\mu\). You get:

\[ \hat{\mu} = \frac{1}{N}\sum_{i=1}^Nx_i. \]

This is nice! It is exactly what we would expect! It also matches what we got using the method of moments. Let’s proceed to \(\sigma^2\). We need to find the derivative of \(J\) with respect to it.

\[\begin{split} \begin{split} \frac{\partial J}{\partial \sigma^2} &= \frac{\partial}{\partial \sigma^2}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= \frac{\partial}{\partial \sigma^2}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \left(\sigma^2\right)^{-1}\frac{1}{2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= -\frac{N}{2}\frac{\partial}{\partial \sigma^2}\log\sigma^2 - \frac{\sum_{i=1}^N(x_i-\mu)^2}{2}\frac{\partial}{\partial \sigma^2}\left(\sigma^2\right)^{-1}\\ &=-\frac{N}{2}\frac{1}{\sigma^2} - \frac{\sum_{i=1}^N(x_i-\mu)^2}{2}(-1)\left(\sigma^2\right)^{-2}\\ &= -\frac{N}{2\sigma^2} + \frac{\sum_{i=1}^N(x_i-\mu)^2}{2\sigma^4}\\ &= \frac{-N\sigma^2 + \sum_{i=1}^N(x_i-\mu)^2}{2\sigma^4}. \end{split} \end{split}\]

Setting this equal to zero and solving for \(\sigma^2\), yields:

\[ \hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N(x_i-\hat{\mu})^2, \]

where I substituted \(\hat{\mu}\) for \(\mu\). With a little bit of algebra you can show that this is exactly the same as the result obtained with the method of moments, i.e., it can be rewritten as:

\[ \hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^Nx_i^2-\hat{\mu}^2. \]

We are not going to give a new example of this as the estimate is exactly the same to what we saw in Fitting Normal distributions to data.