Fitting the parameters of a Normal using the maximum likelihood principle¶

As before we have \(N\) independent measurements. Assume that the measurement \(X_i\) follows a Normal distribution with parameters \(\mu\) and \(\sigma^2\):

\[ p(x_i|\mu,\sigma^2) = f(x_i;\mu,\sigma)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}. \]

So, in this case:

\[ \theta = (\mu, \sigma^2), \]

and

\[ f(x;\theta) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x-\mu)^2}{2\sigma^2}\right\}. \]

The likelihood of the data is:

\[\begin{split} \begin{split} p(x_{1:N}|\theta) &= \prod_{i=1}^Nf(x_i;\theta)\\ &= \prod_{i=1}^N\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}\\ &= (2\pi\sigma^2)^{-\frac{N}{2}}\prod_{i=1}^N\exp\left\{-\frac{(x_i-\mu)^2}{2\sigma^2}\right\}\\ &= (2\pi\sigma^2)^{-\frac{N}{2}}\exp\left\{-\sum_{i=1}^N\frac{(x_i-\mu)^2}{2\sigma^2}\right\}. \end{split} \end{split}\]

According to the maximum likelihood principle, we must pick \(\mu\) and \(\sigma^2\) that maximizes the logarithm of this expression. Let’s find the logarithm first. I am going to call it \(J(\mu,\sigma^2)\):

\[ J(\mu,\sigma^2) = \log p(x_{1:N}|\theta) = -\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2. \]

So, now model training has become a calculus problem. You need to maximize the two-variable function \(J(\mu,\sigma^2)\) with respect to \(\mu\) and \(\sigma^2\). How do you proceed? We could either employ an optimization algorithm or we could do it analytically. Let’s do it analytically in this simple case. A necessary condition is that the derivative of \(J\) with respect to the parameters is zero. Let’s find the derivative of \(J\) with respect to \(\mu\). It is:

\[\begin{split} \begin{split} \frac{\partial J}{\partial \mu} &= \frac{\partial}{\partial \mu}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= \frac{\partial}{\partial \mu}\left[-\frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^N\frac{\partial}{\partial \mu}(x_i-\mu)^2\\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^N2(x_i-\mu)\frac{\partial}{\partial \mu}(x_i-\mu)\\ &= -\frac{1}{2\sigma^2}\sum_{i=1}^N2(x_i-\mu)(-1)\\ &= \frac{1}{\sigma^2}\sum_{i=1}^N(x_i-\mu)\\ &= \frac{1}{\sigma^2}\left[\sum_{i=1}^Nx_i - \sum_{i=1}^N\mu\right]\\ &= \frac{1}{\sigma^2}\left[\sum_{i=1}^Nx_i - N\mu\right]. \end{split} \end{split}\]

Alright, that wasn’t too bad! Now set this derivative equal to zero and you can solve for \(\mu\). You get:

\[ \hat{\mu} = \frac{1}{N}\sum_{i=1}^Nx_i. \]

This is nice! It is exactly what we would expect! It also matches what we got using the method of moments. Let’s proceed to \(\sigma^2\). We need to find the derivative of \(J\) with respect to it.

\[\begin{split} \begin{split} \frac{\partial J}{\partial \sigma^2} &= \frac{\partial}{\partial \sigma^2}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \frac{1}{2\sigma^2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= \frac{\partial}{\partial \sigma^2}\left[-\frac{N}{2}\log (2\pi) - \frac{N}{2}\log\sigma^2 - \left(\sigma^2\right)^{-1}\frac{1}{2}\sum_{i=1}^N(x_i-\mu)^2\right]\\ &= -\frac{N}{2}\frac{\partial}{\partial \sigma^2}\log\sigma^2 - \frac{\sum_{i=1}^N(x_i-\mu)^2}{2}\frac{\partial}{\partial \sigma^2}\left(\sigma^2\right)^{-1}\\ &=-\frac{N}{2}\frac{1}{\sigma^2} - \frac{\sum_{i=1}^N(x_i-\mu)^2}{2}(-1)\left(\sigma^2\right)^{-2}\\ &= -\frac{N}{2\sigma^2} + \frac{\sum_{i=1}^N(x_i-\mu)^2}{2\sigma^4}\\ &= \frac{-N\sigma^2 + \sum_{i=1}^N(x_i-\mu)^2}{2\sigma^4}. \end{split} \end{split}\]

Setting this equal to zero and solving for \(\sigma^2\), yields:

\[ \hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^N(x_i-\hat{\mu})^2, \]

where I substituted \(\hat{\mu}\) for \(\mu\). With a little bit of algebra you can show that this is exactly the same as the result obtained with the method of moments, i.e., it can be rewritten as:

\[ \hat{\sigma}^2 = \frac{1}{N}\sum_{i=1}^Nx_i^2-\hat{\mu}^2. \]

We are not going to give a new example of this as the estimate is exactly the same to what we saw in Fitting Normal distributions to data.

Introduction to Data Science for Mechanical Engineers (Lecture Book)

Fitting the parameters of a Normal using the maximum likelihood principle

Fitting the parameters of a Normal using the maximum likelihood principle¶