Linear regression with one variable

Linear regression with one variable#

Assume that you have \(N\) pairs of observations \((x_i, y_i)\) from two random variables \(X\) and \(Y\). Now, you suspect that there is a linear relationship between these two random variables. Namely, you suspect that there exist coefficients \(a\) and \(b\) such that:

\[ Y = a X + b + U, \]

where \(U\) is some unobserved factor that we are going to treat as noise. How can you find \(a\) and \(b\) from the \(N\) observations we have? This is a so-called regression problem.

Let’s develop the simplest possible approach for solving the regression problem. It is called the least squares approach. The approach basically says that you should pick \(a\) and \(b\) so as to minimize the sum of square errors:

\[ L(a,b) = \sum_{i=1}^N(ax_i + b - y_i)^2. \]

The error here is the prediction you are making for \(y_i\) for given \(a\) and \(b\).

Okay, we proceed as usual. We take derivatives of \(L\) with respect to \(a\) and \(b\), set them equal to zero and solve for \(a\) and \(b\). Let’s do it:

\[\begin{split} \begin{split} \frac{\partial L(a,b)}{\partial a} &= \frac{\partial}{\partial a}\sum_{i=1}^N(ax_i + b - y_i)^2\\ &= \sum_{i=1}^N\frac{\partial}{\partial a}(ax_i + b - y_i)^2\\ &= \sum_{i=1}^N2(ax_i+b-y_i)x_i\\ &= 2\sum_{i=1}^N(ax_i^2 + bx_i - y_ix_i)\\ &= 2\left[a\sum_{i=1}^Nx_i^2 + b\sum_{i=1}^Nx_i - \sum_{i=1}^Nx_iy_i\right]. \end{split} \end{split}\]

Here is the other one:

\[\begin{split} \begin{split} \frac{\partial L(a,b)}{\partial b} &= \frac{\partial}{\partial b}\sum_{i=1}^N(ax_i + b - y_i)^2\\ &= \sum_{i=1}^N\frac{\partial}{\partial b}(ax_i + b - y_i)^2\\ &= \sum_{i=1}^N2(ax_i+b-y_i)\\ &= 2\sum_{i=1}^N(ax_i+b-y_i)\\ &= 2\left[a\sum_{i=1}^Nx_i + bN - \sum_{i=1}^Ny_i\right]. \end{split} \end{split}\]

Setting these two derivatives to zero, yields the following linear system:

\[\begin{split} \begin{split} a\sum_{i=1}^Nx_i^2 + b\sum_{i=1}^Nx_i &= \sum_{i=1}^Nx_iy_i,\\ a\sum_{i=1}^Nx_i + bN &= \sum_{i=1}^Ny_i. \end{split} \end{split}\]

This is a 2x2 system and we can solve it by substitution. Take the second equation and solve for \(b\):

\[\begin{split} \begin{split} b &= \frac{1}{N}\left[\sum_{i=1}^Ny_i - a\sum_{i=1}^Nx_i\right]\\ &= \frac{1}{N}\sum_{i=1}^Ny_i - a \frac{1}{N}\sum_{i=1}^Nx_i\\ &= \hat{\mu}_Y - a \hat{\mu}_X, \end{split} \end{split}\]

where \(\hat{\mu}_X\) and \(\hat{\mu}_Y\) are the empirical estimates of the mean of \(X\) and \(Y\), respectively. Now let’s take this and substitute in the first equation of the linear system. We have:

\[\begin{split} \begin{split} &a\sum_{i=1}^Nx_i^2 + (\hat{\mu}_Y - a \hat{\mu}_X)\sum_{i=1}^Nx_i = \sum_{i=1}^Nx_iy_i\\ &\Rightarrow a\left[\sum_{i=1}^Nx_i^2 - \hat{\mu}_X\sum_{i=1}^Nx_i\right] = \sum_{i=1}^Nx_iy_i - \hat{\mu}_Y\sum_{i=1}^Nx_i\\ &\Rightarrow a\frac{1}{N}\left[\sum_{i=1}^Nx_i^2 - \hat{\mu}_X\sum_{i=1}^Nx_i\right] = \frac{1}{N}\sum_{i=1}^Nx_iy_i - \frac{1}{N}\hat{\mu}_Y\sum_{i=1}^Nx_i\\ &\Rightarrow a\left[\frac{1}{N}\sum_{i=1}^Nx_i^2 - \hat{\mu}_X\frac{1}{N}\sum_{i=1}^Nx_i\right] = \frac{1}{N}\sum_{i=1}^Nx_iy_i - \hat{\mu}_Y\frac{1}{N}\sum_{i=1}^Nx_i\\ &\Rightarrow a\left[\frac{1}{N}\sum_{i=1}^Nx_i^2 - \hat{\mu}_X\cdot \hat{\mu}_X\right] = \frac{1}{N}\sum_{i=1}^Nx_iy_i - \hat{\mu}_Y\cdot \hat{\mu}_X\\ &\Rightarrow a\left[\frac{1}{N}\sum_{i=1}^Nx_i^2 - \hat{\mu}_X^2\right] = \frac{1}{N}\sum_{i=1}^Nx_iy_i - \hat{\mu}_Y\cdot \hat{\mu}_X \end{split} \end{split}\]

Now the magic starts… Notice that the coefficient of \(a\) is actually the empirical estimate of the variance of \(X\):

\[ \hat{\sigma}_X^2 = \frac{1}{N}\sum_{i=1}^N(x_i-\hat{\mu}_X)^2 = \dots = \frac{1}{N}\sum_{i=1}^Nx_i^2 - \hat{\mu}_X^2. \]

Similarly, the right-hand-side is the empirical estimate of the covariance of \(X\) and \(Y\):

\[ \hat{\sigma}_{X,Y} = \frac{1}{N}\sum_{i=1}^N(x_i-\hat{\mu}_X)(y_i-\hat{\mu}_Y) = \dots = \frac{1}{N}\sum_{i=1}^Nx_iy_i - \hat{\mu}_X\cdot \hat{\mu}_Y. \]

Putting everything together, gives us that:

\[ a\hat{\sigma}_X^2 = \hat{\sigma}_{X,Y}. \]

Solving for \(a\):

\[ a = \frac{\hat{\sigma}_{X,Y}}{\hat{\sigma}_X^2} = \frac{\hat{\sigma}_{X,Y}}{\hat{\sigma}_X\hat{\sigma}_Y}\frac{\hat{\sigma}_Y}{\hat{\sigma}_X}, \]

which gives a relationship between \(a\) and the correlation coefficient between \(X\) and \(Y\):

\[ \hat{a} = \hat{\rho}_{X,Y}\frac{\hat{\sigma}_Y}{\hat{\sigma}_X}. \]

This is very very interesting. Intuitively, dividing by \(\sigma_X\) removes the units of \(X\), multiplying with \(\sigma_Y\) puts units of \(Y\) and then you multiply with the correlation coefficient to get the direction right! Nice!

What about \(b\)? Let’s substitute back into the equation:

\[ \hat{b} = \hat{\mu}_Y - \hat{a} \hat{\mu}_X. \]

So, this is the difference between the mean of the data and the mean prediction of our fitted model.

Let’s end the theory by looking at the final square error. We will just plug in \(\hat{a}\) and \(\hat{b}\) in \(L\) and see what we get. This is the result:

\[ L(\hat{a}, \hat{b}) = N(1- \hat{\rho}_{X,Y}^2)\hat{\sigma}_Y^2. \]

What does this say? The following:

  • The error grows with the number of points. This makes sense. More points more error.

  • If the correlation coefficient is zero, then the error is maximized and it is \(N\) times the variance of \(Y\).

  • The error becomes zero if the correlation of coefficient is exactly one or minus one.

And this is the proof if you are interested:

\[\begin{split} \begin{split} L(\hat{a}, \hat{b}) &= \sum_{i=1}^N(\hat{a}x_i + \hat{b} - y_i)^2\\ &= \sum_{i=1}^N(\hat{a}x_i + \hat{\mu}_Y - \hat{a} \hat{\mu}_X - y_i)^2\\ &= \sum_{i=1}^N\left[\hat{a}(x_i-\hat{\mu}_X) - (y_i - \hat{\mu}_Y)\right]^2\\ &= \sum_{i=1}^N\left[\hat{a}^2(x_i-\hat{\mu}_X)^2 + (\hat{\mu}_Y - y_i)^2 - 2\hat{a}(x_i-\hat{\mu}_X)(\hat{\mu}_Y - y_i)\right]^2\\ &= \hat{a}^2\sum_{i=1}^N(x_i-\hat{\mu}_X)^2 + \sum_{i=1}^N(\hat{\mu}_Y - y_i)^2 - 2\hat{a}\sum_{i=1}^N(x_i-\hat{\mu}_X)(\hat{\mu}_Y - y_i)\\ &= \hat{a}^2N\hat{\sigma}^2_X + N\hat{\sigma}_Y^2 + 2\hat{a}N\hat{\sigma}_{X,Y}\\ &= N\left[\hat{\sigma}_Y^2 + \hat{\rho}_{X,Y}^2\frac{\hat{\sigma}_Y^2}{\hat{\sigma}_X^2}\hat{\sigma}^2_X -2\hat{\rho}_{X,Y}\frac{\hat{\sigma}_Y}{\hat{\sigma}_X}\hat{\sigma}_{X,Y}\right]\\ &= N\left[\hat{\sigma}_Y^2 + \hat{\rho}_{X,Y}^2\hat{\sigma}_Y^2 -2\hat{\rho}_{X,Y}^2\hat{\sigma}_Y^2\right]\\ &= N\left[\hat{\sigma}_Y^2 - \hat{\rho}_{X,Y}^2\hat{\sigma}_Y^2\right]\\ &= N(1- \hat{\rho}_{X,Y}^2)\hat{\sigma}_Y^2. \end{split} \end{split}\]

Example: Modeling hvac as a function of t_out#

Let’s now calculate the estimate we developed for the correlation coefficient in the smart buildings dataset. In particular, we are going to estimate the correlation coefficient between \(X=\)t_out and \(Y=\)hvac for heating and cooling.

Hide code cell source
MAKE_BOOK_FIGURES=False

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

def set_book_style():
    plt.style.use('seaborn-v0_8-white') 
    sns.set_style("ticks")
    sns.set_palette("deep")

    mpl.rcParams.update({
        # Font settings
        'font.family': 'serif',  # For academic publishing
        'font.size': 8,  # As requested, 10pt font
        'axes.labelsize': 8,
        'axes.titlesize': 8,
        'xtick.labelsize': 7,  # Slightly smaller for better readability
        'ytick.labelsize': 7,
        'legend.fontsize': 7,
        
        # Line and marker settings for consistency
        'axes.linewidth': 0.5,
        'grid.linewidth': 0.5,
        'lines.linewidth': 1.0,
        'lines.markersize': 4,
        
        # Layout to prevent clipped labels
        'figure.constrained_layout.use': True,
        
        # Default DPI (will override when saving)
        'figure.dpi': 600,
        'savefig.dpi': 600,
        
        # Despine - remove top and right spines
        'axes.spines.top': False,
        'axes.spines.right': False,
        
        # Remove legend frame
        'legend.frameon': False,
        
        # Additional trim settings
        'figure.autolayout': True,  # Alternative to constrained_layout
        'savefig.bbox': 'tight',    # Trim when saving
        'savefig.pad_inches': 0.1   # Small padding to ensure nothing gets cut off
    })

def save_for_book(fig, filename, is_vector=True, **kwargs):
    """
    Save a figure with book-optimized settings.
    
    Parameters:
    -----------
    fig : matplotlib figure
        The figure to save
    filename : str
        Filename without extension
    is_vector : bool
        If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.
    **kwargs : dict
        Additional kwargs to pass to savefig
    """    
    # Set appropriate DPI and format based on figure type
    if is_vector:
        dpi = 1000
        ext = '.pdf'
    else:
        dpi = 600
        ext = '.tif'
    
    # Save the figure with book settings
    fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)


def make_full_width_fig():
    return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)

def make_half_width_fig():
    return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)

if MAKE_BOOK_FIGURES:
    set_book_style()
make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()

import numpy as np
import scipy.stats as st
   
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'
!curl -O $url

import pandas as pd

df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.head()
df_heating = df[df['t_out'] < 60]
df_cooling = df[df['t_out'] > 70]
df_off = df[(df['t_out'] >= 60) & (df['t_out'] <= 70)]
Hide code cell output
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  277k  100  277k    0     0  1075k      0 --:--:-- --:--:-- --:--:-- 1077k

First, we do the cooling:

xdata = df_cooling['t_out']
ydata = df_cooling['hvac']

# Estimate the statistics
mu_X = xdata.mean()
mu_Y = ydata.mean()
sigma_X = xdata.std()
sigma_Y = ydata.std()
rho_XY = np.corrcoef(xdata, ydata)[0, 1]

# The a coefficient
a = rho_XY * sigma_Y / sigma_X 
# The b coefficient
b = mu_Y - a * mu_X

# Print the results
print('Model coefficients for cooling: ')
print(f'a = {a:1.2f}')
print(f'b = {b:1.2f}')
Model coefficients for cooling: 
a = 3.78
b = -229.95

Here is how you can evaluate the model in arbitrary points:

xs = np.linspace(xdata.min(), xdata.max())
ys = a * xs + b

Let’s visualize the model we just built:

fig, ax = make_full_width_fig()
ax.scatter(df_cooling['t_out'], df_cooling['hvac'], label='Cooling')
ax.plot(xs, ys, 'r--', lw=2, label='Linear cooling model')
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best')
save_for_book(fig, 'ch14.fig4')
../_images/328e9532949b190f911431f49a6cb861415cc9eca14a3d22fad9ab47fbefd3b5.svg

Now for heating:

Hide code cell source
xdata = df_heating['t_out']
ydata = df_heating['hvac']

# Estimate the statistics
mu_X = xdata.mean()
mu_Y = ydata.mean()
sigma_X = xdata.std()
sigma_Y = ydata.std()
rho_XY = np.corrcoef(xdata, ydata)[0, 1]

# The a coefficient
a = rho_XY * sigma_Y / sigma_X 
# The b coefficient
b = mu_Y - a * mu_X

# Print the results
print('Model coefficients for heating: ')
print('a = {0:1.2f}'.format(a))
print('b = {0:1.2f}'.format(b))

# Visualize
xs = np.linspace(xdata.min(), xdata.max())
ys = a * xs + b
fig, ax = make_full_width_fig()
ax.scatter(df_heating['t_out'], df_heating['hvac'], label='Cooling')
ax.plot(xs, ys, 'r--', lw=2, label='Linear cooling model')
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best')
save_for_book(fig, 'ch14.fig5')
Model coefficients for heating: 
a = -4.99
b = 296.26
../_images/da76781a2e86ecaaf840c2dc2d48a3e51556a6e29ccd19ac4d27a3d0487c8515.svg