# Regression with one variable revisited

# Regression with one variable revisited¶

Let’s say that we have a \(N\) observations consisting of inputs (features):

and outputs (targets):

We want to learn the map (function) that connects the inputs to the outputs.
We say that we have a *regression* problem when the outputs are continuous quantities, e.g., mass, height, price.
When the outputs are discrete, e.g., colors, numbers, then we say that we have a *classification* problem.
In this lecture, the focus will be on regression.

To proceed, you need to make a model that connects the inputs to the outputs. The simplest such model is:

This is the linear model we saw in the previous lecture with \(w_0 = b\) and \(w_1 = a\).
The parameters \(w_0\) and \(w_1\) are called the regression *weights* and we need to find them using the observations \(x_{1:N}\) and \(y_{1:N}\).

In the previous lecture, we fitted the model by minimizing the sum of square errors:

Now we are going to express this equation using linear algebra. We do this for two reasons:

It is a lot of fun!

It is essential for formulating the fitting problem for more complicated models.

We will need the *design matrix*:

The design matrix \(\mathbf{X}\) is a \(N\times 2\) matrix with the first column being just one and the second column being the observed inputs.
We will also need, the *vector of observed outputs*:

and the *vector of weights*:

I hope that you remember how to do matrix-vector multiplication. Notice what we get when we multiply the design matrix \(\mathbf{X}\) with the weight vector \(\mathbf{w}\):

Wow! So, \(\mathbf{X}\mathbf{w}\) is an \(N\)-dimensional vector that contains the predictions of our linear model at the observed inputs. If we subtract this vector from the vector of observed outputs \(\mathbf{y}\), we get the prediction errors:

Okay. Now recall that the [Euclidian norm](https://en.wikipedia.org/wiki/Norm_(mathematics) \(\parallel\mathbf{v}\parallel\) of a vector is the square root of the sum of the squares of its components. Hmm. Let’s take the square of the Euclidian norm of the error vector. It is:

But this is just sum of square errors, i.e., we have shown that:

We have managed to express the loss function using linear algebra. The mathematical problem of finding the best weight vector is now:

This form is much more convenient mathematically. Remember that to solve the minimization problem, we need to take the gradient of \(L(\mathbf{w})\) with respect to \(\mathbf{w}\) and set the result equal to zero. This form, allows us to take derivatives in a much easier way. But there is one more thing that we could do before we take the gradient. Notice that the Euclidian norm of a vector \(\mathbf{v}\) satisfies:

where we are thinking of \(\mathbf{v}\) as a column matrix and \(\mathbf{v}^T\) is the transpose of \(\mathbf{v}\), i.e., a row matrix. To prove the equality, we start from the right hand side:

Interesting! So we can rewrite the sum of square errors as:

Now, because \(\mathbf{w}^T\mathbf{X}^T\mathbf{y}\) is just a number (think about the dimensions \((1\times 2)\times (2\times N)\times (N \times 1) = 1\times 1\)), it is the same as its transpose, i.e.,

Using this fact, we can write:

Now we can take the gradient with respect to \(\mathbf{w}\).

Okay, I do admit that I did some derivative magic there. But the result is correct. If you really want to understand it, you would have to work out the gradient of the following \(\mathbf{w}^T\mathbf{u}\) and \(\mathbf{w}^T\mathbf{A}\mathbf{w}\) where \(\mathbf{u}\) is a 2-dimensional vector and \(\mathbf{A}\) is a \(2\times 2\) matrix.

Setting the gradient to zero, yields the following linear system:

The bottom line: to find the best \(\mathbf{w}\) you must solve this linear system! As I will show you in a while, for more complex models that remain linear in the parameters you basically have to do exactly the same thing but with a different design matrix. By the way, if you work out the analytical solution for a linear model with 2 parameters you will get exactly the expression with the correlation between \(X\) and \(Y\) we derived in the previous lecture.