Regression with one variable revisited
Regression with one variable revisited¶
Let’s say that we have a \(N\) observations consisting of inputs (features):
and outputs (targets):
We want to learn the map (function) that connects the inputs to the outputs. We say that we have a regression problem when the outputs are continuous quantities, e.g., mass, height, price. When the outputs are discrete, e.g., colors, numbers, then we say that we have a classification problem. In this lecture, the focus will be on regression.
To proceed, you need to make a model that connects the inputs to the outputs. The simplest such model is:
This is the linear model we saw in the previous lecture with \(w_0 = b\) and \(w_1 = a\). The parameters \(w_0\) and \(w_1\) are called the regression weights and we need to find them using the observations \(x_{1:N}\) and \(y_{1:N}\).
In the previous lecture, we fitted the model by minimizing the sum of square errors:
Now we are going to express this equation using linear algebra. We do this for two reasons:
It is a lot of fun!
It is essential for formulating the fitting problem for more complicated models.
We will need the design matrix:
The design matrix \(\mathbf{X}\) is a \(N\times 2\) matrix with the first column being just one and the second column being the observed inputs. We will also need, the vector of observed outputs:
and the vector of weights:
I hope that you remember how to do matrix-vector multiplication. Notice what we get when we multiply the design matrix \(\mathbf{X}\) with the weight vector \(\mathbf{w}\):
Wow! So, \(\mathbf{X}\mathbf{w}\) is an \(N\)-dimensional vector that contains the predictions of our linear model at the observed inputs. If we subtract this vector from the vector of observed outputs \(\mathbf{y}\), we get the prediction errors:
Okay. Now recall that the [Euclidian norm](https://en.wikipedia.org/wiki/Norm_(mathematics) \(\parallel\mathbf{v}\parallel\) of a vector is the square root of the sum of the squares of its components. Hmm. Let’s take the square of the Euclidian norm of the error vector. It is:
But this is just sum of square errors, i.e., we have shown that:
We have managed to express the loss function using linear algebra. The mathematical problem of finding the best weight vector is now:
This form is much more convenient mathematically. Remember that to solve the minimization problem, we need to take the gradient of \(L(\mathbf{w})\) with respect to \(\mathbf{w}\) and set the result equal to zero. This form, allows us to take derivatives in a much easier way. But there is one more thing that we could do before we take the gradient. Notice that the Euclidian norm of a vector \(\mathbf{v}\) satisfies:
where we are thinking of \(\mathbf{v}\) as a column matrix and \(\mathbf{v}^T\) is the transpose of \(\mathbf{v}\), i.e., a row matrix. To prove the equality, we start from the right hand side:
Interesting! So we can rewrite the sum of square errors as:
Now, because \(\mathbf{w}^T\mathbf{X}^T\mathbf{y}\) is just a number (think about the dimensions \((1\times 2)\times (2\times N)\times (N \times 1) = 1\times 1\)), it is the same as its transpose, i.e.,
Using this fact, we can write:
Now we can take the gradient with respect to \(\mathbf{w}\).
Okay, I do admit that I did some derivative magic there. But the result is correct. If you really want to understand it, you would have to work out the gradient of the following \(\mathbf{w}^T\mathbf{u}\) and \(\mathbf{w}^T\mathbf{A}\mathbf{w}\) where \(\mathbf{u}\) is a 2-dimensional vector and \(\mathbf{A}\) is a \(2\times 2\) matrix.
Setting the gradient to zero, yields the following linear system:
The bottom line: to find the best \(\mathbf{w}\) you must solve this linear system! As I will show you in a while, for more complex models that remain linear in the parameters you basically have to do exactly the same thing but with a different design matrix. By the way, if you work out the analytical solution for a linear model with 2 parameters you will get exactly the expression with the correlation between \(X\) and \(Y\) we derived in the previous lecture.