Covariance between two random variables
Contents
Covariance between two random variables¶
The concept of covariance summarizes with a single number how two random variables \(X\) and \(Y\) vary together. And there are three possibilities:
if \(X\) is increased, then \(Y\) will likely increase,
if \(Y\) is decreased, then \(Y\) will likely decrease, and
\(X\) and \(Y\) are not linked.
Before defining these concepts exactly, let’s load the smart buildings dataset which will help us demonstrate the concept. Here we go:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import numpy as np
import scipy.stats as st
import requests
import os
def download(url, local_filename=None):
"""
Downloads the file in the ``url`` and saves it in the current working directory.
"""
data = requests.get(url)
if local_filename is None:
local_filename = os.path.basename(url)
with open(local_filename, 'wb') as fd:
fd.write(data.content)
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'
download(url)
import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.head()
household | date | score | t_out | t_unit | hvac | |
---|---|---|---|---|---|---|
0 | a1 | 2018-01-07 | 100.0 | 4.283373 | 66.693229 | 246.473231 |
1 | a10 | 2018-01-07 | 100.0 | 4.283373 | 66.356134 | 5.492116 |
2 | a11 | 2018-01-07 | 58.0 | 4.283373 | 71.549132 | 402.094327 |
3 | a12 | 2018-01-07 | 64.0 | 4.283373 | 73.429514 | 211.692244 |
4 | a13 | 2018-01-07 | 100.0 | 4.283373 | 63.923937 | 0.850536 |
Here is the scatter plot of hvac
(consumed HVAC energy in kWh) and t_out
(external temperature in degrees F):
fig, ax = plt.subplots()
ax.scatter(df['t_out'], df['hvac'])
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)');
We see three clear regions here, heating, cooling, and off. Let me separate the data in different dataframes corresponding to these three regions.
df_heating = df[df['t_out'] < 60]
df_cooling = df[df['t_out'] > 70]
df_off = df[(df['t_out'] >= 60) & (df['t_out'] <= 70)]
fig, ax = plt.subplots()
ax.scatter(df_heating['t_out'], df_heating['hvac'], label='Heating')
ax.scatter(df_cooling['t_out'], df_cooling['hvac'], label='Cooling')
ax.scatter(df_off['t_out'], df_off['hvac'], label='Off')
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best');
The covariance and the correlation will allow us to characterize the relationship between \(X=\)t_out
and \(Y=\)hvac
in each one of these regions with a single number.
Depending on the sign of this number (positive, negative, or zero), we can say how the relationship between \(X\) and \(Y\) goes.
In these three regions we find:
Heating region (
t_out
< 60 F): In this regime, increasing \(X=\)t_out
decreases energy consumption \(Y=\)hvac
because you use less heating. In the mathematical jargon, we say that \(X\) and \(Y\) are negatively correlated.Cooling region (
t_out
> 70 F): In this regime, increase \(X=\)t_out
increases energy consumption \(Y=\)hvac
because you use more cooling. In the mathematical jargon, we say that \(X\) adn \(Y\) are positively correlated.Off region (60 F <=
t_out
<= 70 F): In this regime, the \(X=\)t_out
does not affect energy consumption because the HVAC is most likely off. In the mathematical jargon, we say that \(X\) and \(Y\) are uncorrelated.
Okay, this is good. We are going to do two things next. I will first give you the mathematical definition of covariance and correlation and second I will show you how to estimate them from the data we have. Let’s go.
Mathematical definition of covariance¶
Let \(p(x,y)\) be the joint PDF of the random variables \(X\) and \(Y\). We may or we may not know this, but it certainly exists. Now, let
be the mean of \(X\) and
be the mean of \(Y\). The covariance of \(X\) and \(Y\) is defined to be:
So, it is the expectation of the product \((X-\mu_X)(Y-\mu_Y)\). Why is this a good definition of how \(X\) and \(Y\) vary together? To develop your intuition about it, let’s look at what the covariance turns out to be in three specific cases:
Case 1: If \(X\) and \(Y\) are independent, then the covariance is zero¶
Let’s assume that \(X\) and \(Y\) are independent. Then, their joint PDF would factorize:
This can be exploited to show that \(\mathbf{C}[X,Y]\) would be exactly zero. Here it is:
Case 2: If \(Y=aX+b\) for some positive constant \(a\), then the covariance is positive¶
Let’s assume that there is a very simple relationship between \(X\) and \(Y\):
for some \(a\) positive, and an arbitrary \(b\). This is the simplest way in which an increase in \(X\) would yield and increase in \(Y\). Let’s see what covariance we get in this case. Notice that the mean of \(Y\) is now:
So, the covariance is:
which is, of course, positive because both \(a\) and the variance of \(X\) are positive.
Case 3: If \(Y=-aX+b\) for some positive constant \(a\), then the covariance is negative¶
Let’s assume that there is a very simple relationship between \(X\) and \(Y\):
for some \(a\) positive, and an arbitrary \(b\). This is the simplest way in which an increase in \(X\) would yield and decrease in \(Y\). In exactly the same way as before, we can show that:
which is a negative number.
Empirical estimation of the covariance¶
Alright, so the covariance does have the intuitive meaning that we want. But, how can we find it if we do not know the joint PDF \(p(x,y)\). We will show how you can estimate it from samples of \(X\) and \(Y\)? So, let’s say that we have \(N\) measurements of \(X\) and \(Y\), say \((x_i, y_i)\) for \(i=1,\dots,N\). We need the means, which we already know how to estimate:
and
Okay, we need to estimate one more expectation. Let’s do it:
So, here is our estimate of the covariance:
Note
The standard estimate of covariance differs a bit from what I have above. It is usually estimated by:
This is a so-called unbiased estimator. However, if \(N\) is big enough the difference is negligible and we don’t have to worry about it.
Example: Covariance between t_out
and hvac
during heating, cooling, and off¶
Let’s now calculate the estimate we developed for the covariance in the smart buildings dataset.
In particular, we are going to estimate the covariance between \(X=\)t_out
and \(Y=\)hvac
for the three regions considered.
Fortunately, we do not have to calculate it by hand.
We can use built-in functionality of np.cov
.
Here is how.
Let’s do first cooling.
C = np.cov(df_cooling['t_out'], df_cooling['hvac'])
print(C)
[[ 9.74568849 36.85230047]
[ 36.85230047 2299.42112855]]
Let me explain to you what np.cov()
returns in our case.
First, you notice we have returns a 2 x 2 matrix \(C\).
The diagonal of that matrix include the variances of the two datasets.
So, here C[0,0]
is the variance of df_cooling['t_out']
.
Check this out:
print("Variance of df_cooling['t_out'] = {0:1.2f}".format(df_cooling['t_out'].var()))
print('Compare to C[0, 0] = {0:1.2f}'.format(C[0, 0]))
Variance of df_cooling['t_out'] = 9.75
Compare to C[0, 0] = 9.75
Similarly, C[1,1]
is the variance of df_cooling['hvac']
:
print("Variance of df_cooling['hvac'] = {0:1.2f}".format(df_cooling['hvac'].var()))
print('Compare to C[1, 1] = {0:1.2f}'.format(C[1, 1]))
Variance of df_cooling['hvac'] = 2299.42
Compare to C[1, 1] = 2299.42
Okay.
Now C[0, 1]
is the covariance between the first input (0 = t_out
) and the second input (1 = hvac
).
Here it is:
print("C['t_out', 'hvac'|cooling] = {0:1.2f}".format(C[0, 1]))
C['t_out', 'hvac'] = 36.85
This is positive for cooling as we expected.
Increasing t_out
results in increasing hvac
.
But what is C[1, 0]
. Well, this is the covariance between the second input (1 = hvac
) and the first input (0 = t_out
):
print("C['hvac', 't_out'|cooling] = {0:1.2f}".format(C[1, 0]))
C['hvac', 't_out'|cooling] = 36.85
This is exactly the same as C[0, 1]
. Of course, this is not an accident.
The covariance between two random variables is a symmetric operator, i.e.,
The proof is trivial. Just look at the definition of covariance.
Alright, let’s now look at the heating covariance:
C = np.cov(df_heating['t_out'], df_heating['hvac'])
print(C)
[[ 105.27235776 -525.43752907]
[ -525.43752907 12967.67912181]]
print("C['hvac', 't_out'|heating] = {0:1.2f}".format(C[1, 0]))
C['hvac', 't_out'|heating] = -525.44
It is a nice negative number.
Again, this is compatible with our intuition.
Negative means that if t_out
is increased, hvac
decreases.
That’s exactly what should be happening during heating.
Let’s do the off regime:
C = np.cov(df_off['t_out'], df_off['hvac'])
print(C)
[[ 8.08975479 -2.65681839]
[ -2.65681839 1306.35076875]]
print("C['hvac', 't_out'|heating] = {0:1.2f}".format(C[1, 0]))
C['hvac', 't_out'|heating] = -2.66
This is smaller in absolute value than any of the other covariance. But it is still negative… Is this -2.66 negligible? Or is it big? How do we know?
Well, that is what the correlation coefficient is going to help us decide…