Regression with one variable revisited

Regression with one variable revisited

Let’s say that we have a \(N\) observations consisting of inputs (features):

\[ x_{1:N} = (x_1,\dots,x_N), \]

and outputs (targets):

\[ y_{1:N} = (y_1,\dots,y_N). \]

We want to learn the map (function) that connects the inputs to the outputs. We say that we have a regression problem when the outputs are continuous quantities, e.g., mass, height, price. When the outputs are discrete, e.g., colors, numbers, then we say that we have a classification problem. In this lecture, the focus will be on regression.

To proceed, you need to make a model that connects the inputs to the outputs. The simplest such model is:

\[ y = w_0 + w_1 x + \text{measurement noise}. \]

This is the linear model we saw in the previous lecture with \(w_0 = b\) and \(w_1 = a\). The parameters \(w_0\) and \(w_1\) are called the regression weights and we need to find them using the observations \(x_{1:N}\) and \(y_{1:N}\).

In the previous lecture, we fitted the model by minimizing the sum of square errors:

\[ L(w_0, w_1) = \sum_{i=1}^N(y_i - w_0 - w_1 x_i)^2. \]

Now we are going to express this equation using linear algebra. We do this for two reasons:

  • It is a lot of fun!

  • It is essential for formulating the fitting problem for more complicated models.

We will need the design matrix:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix}. \end{split}\]

The design matrix \(\mathbf{X}\) is a \(N\times 2\) matrix with the first column being just one and the second column being the observed inputs. We will also need, the vector of observed outputs:

\[ \mathbf{y} = y_{1:N} = (y_1, \dots, y_N), \]

and the vector of weights:

\[ \mathbf{w} = (w_0, w_1). \]

I hope that you remember how to do matrix-vector multiplication. Notice what we get when we multiply the design matrix \(\mathbf{X}\) with the weight vector \(\mathbf{w}\):

\[\begin{split} \begin{split} \mathbf{X}\mathbf{w} &= \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix}\cdot \begin{bmatrix} w_0\\ w_1 \end{bmatrix}\\ &= \begin{bmatrix} w_0 + w_1 x_1 \\ w_0 + w_1 x_2 \\ \vdots\\ w_0 + w_1 x_N \end{bmatrix}. \end{split} \end{split}\]

Wow! So, \(\mathbf{X}\mathbf{w}\) is an \(N\)-dimensional vector that contains the predictions of our linear model at the observed inputs. If we subtract this vector from the vector of observed outputs \(\mathbf{y}\), we get the prediction errors:

\[\begin{split} \mathbf{y} - \mathbf{X}\mathbf{w} = \begin{bmatrix} y_1 - w_0 - w_1 x_1\\ y_2 - w_0 - w_1 x_2\\ \vdots y_N - w_0 - w_1 x_N \end{bmatrix}. \end{split}\]

Okay. Now recall that the [Euclidian norm]( \(\parallel\mathbf{v}\parallel\) of a vector is the square root of the sum of the squares of its components. Hmm. Let’s take the square of the Euclidian norm of the error vector. It is:

\[ \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2 = \sum_{i=1}^N(y_i - w_0 - w_1x_i)^2. \]

But this is just sum of square errors, i.e., we have shown that:

\[ L(w_0, w_1) = L(\mathbf{w}) = \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2. \]

We have managed to express the loss function using linear algebra. The mathematical problem of finding the best weight vector is now:

\[ \min_{\mathbf{w}} L(\mathbf{w}) = \min_{\mathbf{w}} \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2. \]

This form is much more convenient mathematically. Remember that to solve the minimization problem, we need to take the gradient of \(L(\mathbf{w})\) with respect to \(\mathbf{w}\) and set the result equal to zero. This form, allows us to take derivatives in a much easier way. But there is one more thing that we could do before we take the gradient. Notice that the Euclidian norm of a vector \(\mathbf{v}\) satisfies:

\[ \parallel \mathbf{v}\parallel^2 = \mathbf{v}^T\mathbf{v}, \]

where we are thinking of \(\mathbf{v}\) as a column matrix and \(\mathbf{v}^T\) is the transpose of \(\mathbf{v}\), i.e., a row matrix. To prove the equality, we start from the right hand side:

\[\begin{split} \begin{split} \mathbf{v}^T\mathbf{v} &= \begin{bmatrix} v_1 & v_2 & \cdots & v_N \end{bmatrix}\cdot \begin{bmatrix} v_1\\ v_2\\ \vdots\\ v_N \end{bmatrix}\\ &= \sum_{i=1}^N v_i^2\\ &= \parallel \mathbf{v}\parallel^2. \end{split} \end{split}\]

Interesting! So we can rewrite the sum of square errors as:

\[\begin{split} \begin{split} L(\mathbf{w}) &= \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2 &= \left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)^T\left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)\\ &= \left[\mathbf{y}^T - \left(\mathbf{X}\mathbf{w}\right)^T\right]\left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)\\ &= \left(\mathbf{y}^T - \mathbf{w}^T\mathbf{X}^T\right)\left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)\\ &= \mathbf{y}^T\mathbf{y} - \mathbf{w}^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\mathbf{w} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w} \end{split} \end{split}\]

Now, because \(\mathbf{w}^T\mathbf{X}^T\mathbf{y}\) is just a number (think about the dimensions \((1\times 2)\times (2\times N)\times (N \times 1) = 1\times 1\)), it is the same as its transpose, i.e.,

\[ \mathbf{w}^T\mathbf{X}^T\mathbf{y} = \left(\mathbf{w}^T\mathbf{X}^T\mathbf{y}\right)^T = \mathbf{y}^T\mathbf{X}\mathbf{w}. \]

Using this fact, we can write:

\[ L(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}. \]

Now we can take the gradient with respect to \(\mathbf{w}\).

\[\begin{split} \begin{split} \nabla_{\mathbf{w}}L(\mathbf{w}) &= \nabla_{\mathbf{w}}\left[\mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}\right]\\ &= -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w}. \end{split} \end{split}\]

Okay, I do admit that I did some derivative magic there. But the result is correct. If you really want to understand it, you would have to work out the gradient of the following \(\mathbf{w}^T\mathbf{u}\) and \(\mathbf{w}^T\mathbf{A}\mathbf{w}\) where \(\mathbf{u}\) is a 2-dimensional vector and \(\mathbf{A}\) is a \(2\times 2\) matrix.

Setting the gradient to zero, yields the following linear system:

\[ \mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}. \]

The bottom line: to find the best \(\mathbf{w}\) you must solve this linear system! As I will show you in a while, for more complex models that remain linear in the parameters you basically have to do exactly the same thing but with a different design matrix. By the way, if you work out the analytical solution for a linear model with 2 parameters you will get exactly the expression with the correlation between \(X\) and \(Y\) we derived in the previous lecture.