Histograms

Contents

Histograms

Histograms offer a nice way to summarize the uncertainty/variability in scalar variables. I am assuming that you have seen histograms in the past. They work as follows:

  • you split the interval in which your variable takes values in bins.

  • you count how many times the variable falls inside its bin.

Let’s look at some examples. First, set up our environment as usual, download and clean the dataset we introduced in The Python data analysis library:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import numpy as np
import requests
import os
def download(url, local_filename=None):
    """
    Downloads the file in the ``url`` and saves it in the current working directory.
    """
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
        fd.write(data.content)
   
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temp_price.csv'
download(url)

import pandas as pd
temp_price = pd.read_csv('temp_price.csv')
clean_data = temp_price.dropna(axis=0).rename(columns={'Price per week': 'week_price',
                                                       'Price per day': 'daily_price'})
clean_data.head()
household date score t_out t_unit hvac price week_price daily_price
0 a1 2019-01-06 85 38.599231 71.580704 35.113758 0.17303 6.075734 0.867962
1 a10 2019-01-06 70 38.599231 73.286260 63.949057 0.17303 11.065105 1.580729
2 a11 2019-01-06 61 38.599231 74.252046 147.612108 0.17303 25.541323 3.648760
3 a12 2019-01-06 65 38.599231 73.708482 74.394518 0.17303 12.872483 1.838926
4 a13 2019-01-06 66 38.599231 73.549554 173.095836 0.17303 29.950772 4.278682

Let’s do the histogram of t_unit first:

fig, ax = plt.subplots()
ax.hist(clean_data['t_unit'])
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Counts');
../_images/python-histograms_3_0.svg

It is straightforward to read this. Each bar gives you the number of households with internal temperature that fall with a bin.

Some times, we want to normalize the height of the bars so that the total area covered by the histogram is one. To do this, you need to divide by the total number of observations and by the width of each bin. What we get is a density. We will see in later lectures that this is an approximation of a probability density of a random variable. To get the density you need to use the keyword density=True in hist. Here is how:

fig, ax = plt.subplots()
ax.hist(clean_data['t_unit'], density=True)
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency');
../_images/python-histograms_5_0.svg

You can also change the bin number. The default is 10. Let’s make it 5.

fig, ax = plt.subplots()
ax.hist(clean_data['t_unit'], density=True, bins=5)
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency');
../_images/python-histograms_7_0.svg

Alternatively, you can also specify the bins on your own. You just have to provide the bin edges. Let’s pick: \((65, 72, 76, 82)\). Here we go:

fig, ax = plt.subplots()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency');
../_images/python-histograms_9_0.svg

Let’s plot a few more things on our histogram. For example, let’s plot the raw data as points on the x-axis. Here is how we can do that:

# This is the histogram
fig, ax = plt.subplots()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
# Add a plot of points with x axis being the temperatures and the
# y axis being zeros
ax.plot(clean_data['t_unit'], np.zeros(clean_data.shape[0]), '.')
# Move the plotting range a bit to the negative so that we can see the points
ax.set_ylim(-0.005, 0.16);
../_images/python-histograms_11_0.svg

This is nice. Let’s add some more information here. What about using a big red cross for marking the avarege temperature? Let’s do it!

average_unit_T = clean_data['t_unit'].mean()

# Same us before
fig, ax = plt.subplots()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
ax.plot(clean_data['t_unit'], np.zeros(clean_data.shape[0]), '.')
ax.set_ylim(-0.005, 0.16)

# But now I am adding the red cross at the average
ax.plot(average_unit_T, 0, 'rx', markersize=10, markeredgewidth=2);
../_images/python-histograms_13_0.svg

There is no limit to what you can plot with matplotlib!

Questions

Write some code to draw the histogram of the score.

# your code here