Correlation between two random variables

The covariance between two random variables \(X\) and \(Y\), \(\mathbf{C}[X,Y]\), is not an absolute measure. As a matter of fact the covariance has units of \(X\) times \(Y\). So, if you change the units of \(X\) and \(Y\), the covariance will change. Changing units of \(X\) is like defining a new random variable:

\[ X' = \lambda X. \]

The covariance between \(X'\) and \(Y\) would be:

\[ \mathbf{C}[X',Y] = \mathbf{E}[(X'-\mu_{X'})(Y-\mu_Y)] = \mathbf{E}[(\lambda X - \lambda \mu_X)(Y-\mu_Y)] = \lambda \mathbf{E}[(X-\mu_X)(Y-\mu_Y)] = \lambda \mathbf{C}[X,Y]. \]

As an example, imagine that \(X\) was measured in meters and you wanted to change it units to centimeters. Then \(\lambda = 100\) and \(X'=100X\). The covariance between the new variable \(X'\) and \(Y\) would be 100 times bigger! As I said, the covariance is not an absolute measure.

How can we fix this? Well, we fix it using the concept of correlation. The correlation betwene two random variables \(X\) and \(Y\) is defined by:

Mathematical definition of the correlation coefficient

The correlation coefficient between two random variables \(X\) and \(Y\) is defined to be the covariance between the two random variables divided by the product of their standard deviations, i.e,

\[ \rho(X,Y) = \frac{\mathbf{C}[X,Y]}{\sigma_X\sigma_Y}. \]

where

\[ \sigma_X = \sqrt{\mathbf{V}[X]}, \]

and

\[ \sigma_Y = \sqrt{\mathbf{V}[Y]}. \]

Alright, let’s see why this is a good measure by looking closely at some of its properties.

Property 1: The correlation coefficient remains unchanged when you change the units of the random variables

Okay, this sounds good. Take the example we gave above:

\[ X' = \lambda X. \]

We have already shown that:

\[ \mathbf{C}[X', Y] = \mathbf{C}[\lambda X, Y] = \lambda\mathbf{C}[X,Y]. \]

Notice that:

\[ \sigma_{X'} = \sqrt{\mathbf{V}[X']} = \sqrt{\mathbf{V}[\lambda X]} = \sqrt{\lambda^2\mathbf{V}[X]} = \lambda\sqrt{\mathbf{V}[X]} = \lambda\sigma_X. \]

So, we have:

\[\begin{split} \begin{split} \rho(X', Y) &= \frac{\mathbf{C}[X',Y]}{\sigma_{X'}\sigma_Y}\\ &= \frac{\lambda \mathbf{C}[X,Y]}{\lambda\sigma_{X}\sigma_Y}\\ &= \frac{\mathbf{C}[X,Y]}{\sigma_{X}\sigma_Y}\\ &= \rho(X,Y). \end{split} \end{split}\]

That’s it. The correlation coefficient has absolute meaning.

Property 2: Two independent random variables have zero correlation

This follows directly from the fact that the covariance between two independent random variables is zero.

Property 3: The maximum possible correlation between two random variables is one

This is nice! Why does it hold? Take a random variable \(X\). What is the random variable \(Y\) that is the most correlated with \(X\)? Well, can you think of something more correlated than \(Y=X\)? I don’t think so… Let’s see what correlation coefficient we get when we plug in \(Y=X\):

\[\begin{split} \begin{split} \rho(X,X) &= \frac{\mathbf{C}[X,X]}{\sigma_X\sigma_X}\\ &= \frac{\mathbf{E}[(X-\mu_X)(X-\mu_X)]}{\sigma_X^2}\\ &= \frac{\mathbf{E}[(X-\mu_X)^2]}{\sigma_X^2}\\ &= \frac{\mathbf{V}[X]}{\sigma_X^2}\\ &= \frac{\sigma_X^2}{\sigma_X^2}\\ &= 1. \end{split} \end{split}\]

Great! By the way, notice that we also showed that the covariance of a random variable with itself is the variance…

Property 4: The minimum possible correlation between two random variables is minus one

This proved in a similar manner as the previous property. What is the random variable \(Y\) that is most negatively correlated with \(X\)? It is \(Y=-X\). If you plug this in the correlation formula, you will get:

\[ \rho(X,-X) = -1. \]

Summarizing the properties of the correlation

So, the correlation is a much better measure than the covariance when you want to assess how tow random variables vary together. You have the following possibilities.

  • If the correlation is zero, then the two are uncorrelated. This doesn’t mean that they are independent though. It just means that they may be indpendent. We will elaborate on this later.

  • The closer the correlation coefficient is to plus one, the more positively correlated the random variables are.

  • The clsoer the correlation coefficient is to minus one, the more negatively correlated the random variables are.

Empirical estimation of the correlation coefficient

I have already told you how you can estimate the covariance. We can already estimate the standard deviations with averages:

\[ \hat{\sigma}_X = \sqrt{\frac{1}{N}(x_i-\hat{\mu}_X)^2}, \]

and

\[ \hat{\sigma}_Y = \sqrt{\frac{1}{N}(y_i-\hat{\mu}_Y)^2}. \]

So, our estimate for the correlation coefficient is:

\[ \hat{\rho}_{X,Y} = \frac{\hat{\sigma}_{X,Y}}{\hat{\sigma}_X\hat{\sigma}_Y}. \]

Example: Correlation between t_out and hvac during heating, cooling, and off

Let’s now calculate the estimate we developed for the correlation coefficient in the smart buildings dataset. In particular, we are going to estimate the correlation coefficient between \(X=\)t_out and \(Y=\)hvac for the three regions considered.

Again, we do not have to calculate it by hand. We can use built-in functionality of np.corrcoef. Here is how. We start with cooling.

import numpy as np
import scipy.stats as st
import requests
import os
def download(url, local_filename=None):
    """
    Downloads the file in the ``url`` and saves it in the current working directory.
    """
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
        fd.write(data.content)
   
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'
download(url)
import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.head()
df_heating = df[df['t_out'] < 60]
df_cooling = df[df['t_out'] > 70]
df_off = df[(df['t_out'] >= 60) & (df['t_out'] <= 70)]
rho = np.corrcoef(df_cooling['t_out'], df_cooling['hvac'])
rho
array([[1.        , 0.24617791],
       [0.24617791, 1.        ]])

Notice that this is a matrix as well. It has the same format as the matrix returned by np.cov:

  • rho[0, 0] is the correlation coefficient between the first input (0 = t_out) and the first input (0 = t_out). So it has to be one always.

  • rho[1, 1] is the correlation coefficient between the second input (1 = hvac) and the second input (1 = hvac). Again, it is always one.

  • rho[0, 1] is the correlation coefficient between the first input (0 = t_out) and the second input (1 = hvac).

  • rho[1, 0] is the correlation coefficient between the second input (1 = hvac) and the first input (0 = t_out). This is because the correlation coefficient is also symmetric.

Okay. Here is what we were after:

print("rho['t_out', 'hvac'|cooling] = {0:1.2f}".format(rho[0, 1]))
rho['t_out', 'hvac'|cooling] = 0.25

Nice and positive. But not very close to one.

Now let’s do heating:

rho = np.corrcoef(df_heating['t_out'], df_heating['hvac'])
print("rho['t_out', 'hvac'|heating] = {0:1.2f}".format(rho[0, 1]))
rho['t_out', 'hvac'|heating] = -0.45

This is negative as expected. And it is much closer to the minimum possible value (-1) than the cooling correlation coefficient is to the maximum value (+1).

Finally, let’s do the off setting:

rho = np.corrcoef(df_off['t_out'], df_off['hvac'])
print("rho['t_out', 'hvac'|off] = {0:1.2f}".format(rho[0, 1]))
rho['t_out', 'hvac'|off] = -0.03

This is tiny and we are happy because it is indepenent of the units we are using!