Regression with one variable revisited¶

Let’s say that we have a \(N\) observations consisting of inputs (features):

\[ x_{1:N} = (x_1,\dots,x_N), \]

and outputs (targets):

\[ y_{1:N} = (y_1,\dots,y_N). \]

We want to learn the map (function) that connects the inputs to the outputs. We say that we have a regression problem when the outputs are continuous quantities, e.g., mass, height, price. When the outputs are discrete, e.g., colors, numbers, then we say that we have a classification problem. In this lecture, the focus will be on regression.

To proceed, you need to make a model that connects the inputs to the outputs. The simplest such model is:

\[ y = w_0 + w_1 x + \text{measurement noise}. \]

This is the linear model we saw in the previous lecture with \(w_0 = b\) and \(w_1 = a\). The parameters \(w_0\) and \(w_1\) are called the regression weights and we need to find them using the observations \(x_{1:N}\) and \(y_{1:N}\).

In the previous lecture, we fitted the model by minimizing the sum of square errors:

\[ L(w_0, w_1) = \sum_{i=1}^N(y_i - w_0 - w_1 x_i)^2. \]

Now we are going to express this equation using linear algebra. We do this for two reasons:

It is a lot of fun!
It is essential for formulating the fitting problem for more complicated models.

We will need the design matrix:

\[\begin{split} \mathbf{X} = \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix}. \end{split}\]

The design matrix \(\mathbf{X}\) is a \(N\times 2\) matrix with the first column being just one and the second column being the observed inputs. We will also need, the vector of observed outputs:

\[ \mathbf{y} = y_{1:N} = (y_1, \dots, y_N), \]

and the vector of weights:

\[ \mathbf{w} = (w_0, w_1). \]

I hope that you remember how to do matrix-vector multiplication. Notice what we get when we multiply the design matrix \(\mathbf{X}\) with the weight vector \(\mathbf{w}\):

\[\begin{split} \begin{split} \mathbf{X}\mathbf{w} &= \begin{bmatrix} 1 & x_1 \\ 1 & x_2 \\ \vdots & \vdots \\ 1 & x_N \end{bmatrix}\cdot \begin{bmatrix} w_0\\ w_1 \end{bmatrix}\\ &= \begin{bmatrix} w_0 + w_1 x_1 \\ w_0 + w_1 x_2 \\ \vdots\\ w_0 + w_1 x_N \end{bmatrix}. \end{split} \end{split}\]

Wow! So, \(\mathbf{X}\mathbf{w}\) is an \(N\)-dimensional vector that contains the predictions of our linear model at the observed inputs. If we subtract this vector from the vector of observed outputs \(\mathbf{y}\), we get the prediction errors:

\[\begin{split} \mathbf{y} - \mathbf{X}\mathbf{w} = \begin{bmatrix} y_1 - w_0 - w_1 x_1\\ y_2 - w_0 - w_1 x_2\\ \vdots y_N - w_0 - w_1 x_N \end{bmatrix}. \end{split}\]

Okay. Now recall that the [Euclidian norm](https://en.wikipedia.org/wiki/Norm_(mathematics) \(\parallel\mathbf{v}\parallel\) of a vector is the square root of the sum of the squares of its components. Hmm. Let’s take the square of the Euclidian norm of the error vector. It is:

\[ \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2 = \sum_{i=1}^N(y_i - w_0 - w_1x_i)^2. \]

But this is just sum of square errors, i.e., we have shown that:

\[ L(w_0, w_1) = L(\mathbf{w}) = \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2. \]

We have managed to express the loss function using linear algebra. The mathematical problem of finding the best weight vector is now:

\[ \min_{\mathbf{w}} L(\mathbf{w}) = \min_{\mathbf{w}} \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2. \]

This form is much more convenient mathematically. Remember that to solve the minimization problem, we need to take the gradient of \(L(\mathbf{w})\) with respect to \(\mathbf{w}\) and set the result equal to zero. This form, allows us to take derivatives in a much easier way. But there is one more thing that we could do before we take the gradient. Notice that the Euclidian norm of a vector \(\mathbf{v}\) satisfies:

\[ \parallel \mathbf{v}\parallel^2 = \mathbf{v}^T\mathbf{v}, \]

where we are thinking of \(\mathbf{v}\) as a column matrix and \(\mathbf{v}^T\) is the transpose of \(\mathbf{v}\), i.e., a row matrix. To prove the equality, we start from the right hand side:

\[\begin{split} \begin{split} \mathbf{v}^T\mathbf{v} &= \begin{bmatrix} v_1 & v_2 & \cdots & v_N \end{bmatrix}\cdot \begin{bmatrix} v_1\\ v_2\\ \vdots\\ v_N \end{bmatrix}\\ &= \sum_{i=1}^N v_i^2\\ &= \parallel \mathbf{v}\parallel^2. \end{split} \end{split}\]

Interesting! So we can rewrite the sum of square errors as:

\[\begin{split} \begin{split} L(\mathbf{w}) &= \parallel \mathbf{y} - \mathbf{X}\mathbf{w}\parallel^2 &= \left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)^T\left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)\\ &= \left[\mathbf{y}^T - \left(\mathbf{X}\mathbf{w}\right)^T\right]\left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)\\ &= \left(\mathbf{y}^T - \mathbf{w}^T\mathbf{X}^T\right)\left(\mathbf{y} - \mathbf{X}\mathbf{w}\right)\\ &= \mathbf{y}^T\mathbf{y} - \mathbf{w}^T\mathbf{X}^T\mathbf{y} - \mathbf{y}^T\mathbf{X}\mathbf{w} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w} \end{split} \end{split}\]

Now, because \(\mathbf{w}^T\mathbf{X}^T\mathbf{y}\) is just a number (think about the dimensions \((1\times 2)\times (2\times N)\times (N \times 1) = 1\times 1\)), it is the same as its transpose, i.e.,

\[ \mathbf{w}^T\mathbf{X}^T\mathbf{y} = \left(\mathbf{w}^T\mathbf{X}^T\mathbf{y}\right)^T = \mathbf{y}^T\mathbf{X}\mathbf{w}. \]

Using this fact, we can write:

\[ L(\mathbf{w}) = \mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}. \]

Now we can take the gradient with respect to \(\mathbf{w}\).

\[\begin{split} \begin{split} \nabla_{\mathbf{w}}L(\mathbf{w}) &= \nabla_{\mathbf{w}}\left[\mathbf{y}^T\mathbf{y} - 2\mathbf{w}^T\mathbf{X}^T\mathbf{y} + \mathbf{w}^T\mathbf{X}^T\mathbf{X}\mathbf{w}\right]\\ &= -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\mathbf{w}. \end{split} \end{split}\]

Okay, I do admit that I did some derivative magic there. But the result is correct. If you really want to understand it, you would have to work out the gradient of the following \(\mathbf{w}^T\mathbf{u}\) and \(\mathbf{w}^T\mathbf{A}\mathbf{w}\) where \(\mathbf{u}\) is a 2-dimensional vector and \(\mathbf{A}\) is a \(2\times 2\) matrix.

Setting the gradient to zero, yields the following linear system:

\[ \mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}. \]

The bottom line: to find the best \(\mathbf{w}\) you must solve this linear system! As I will show you in a while, for more complex models that remain linear in the parameters you basically have to do exactly the same thing but with a different design matrix. By the way, if you work out the analytical solution for a linear model with 2 parameters you will get exactly the expression with the correlation between \(X\) and \(Y\) we derived in the previous lecture.

Introduction to Data Science for Mechanical Engineers (Lecture Book)

Regression with one variable revisited

Regression with one variable revisited¶