Histograms#
Histograms offer a nice way to summarize the uncertainty/variability in scalar variables. I am assuming that you have seen histograms in the past. They work as follows:
you split the interval in which your variable takes values in bins.
you count how many times the variable falls inside its bin.
Let’s look at some examples. First, set up our environment as usual, download and clean the dataset we introduced in The Python Data Analysis Library:
!curl -O 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-239-intro-to-data-science/master/data/temp_price.csv'
import pandas as pd
temp_price = pd.read_csv('temp_price.csv')
clean_data = temp_price.dropna(axis=0).rename(columns={'Price per week': 'week_price',
'Price per day': 'daily_price'})
clean_data.head().round(2)
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 4126 100 4126 0 0 16835 0 --:--:-- --:--:-- --:--:-- 16840
| household | date | score | t_out | t_unit | hvac | price | week_price | daily_price | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | a1 | 2019-01-06 | 85 | 38.6 | 71.58 | 35.11 | 0.17 | 6.08 | 0.87 |
| 1 | a10 | 2019-01-06 | 70 | 38.6 | 73.29 | 63.95 | 0.17 | 11.07 | 1.58 |
| 2 | a11 | 2019-01-06 | 61 | 38.6 | 74.25 | 147.61 | 0.17 | 25.54 | 3.65 |
| 3 | a12 | 2019-01-06 | 65 | 38.6 | 73.71 | 74.39 | 0.17 | 12.87 | 1.84 |
| 4 | a13 | 2019-01-06 | 66 | 38.6 | 73.55 | 173.10 | 0.17 | 29.95 | 4.28 |
Let’s do the histogram of t_unit first:
fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'])
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Counts')
save_for_book(fig, 'ch4.fig7')
It is straightforward to read this. Each bar gives you the number of households with internal temperature that fall with a bin.
Some times, we want to normalize the height of the bars so that the total area covered by the histogram is one.
To do this, you need to divide by the total number of observations and by the width of each bin.
What we get is a density.
We will see in later lectures that this is an approximation of a probability density of a random variable.
To get the density you need to use the keyword density=True in hist.
Here is how:
fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True)
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
save_for_book(fig, 'ch4.fig8')
You can also change the bin number. The default is 10. Let’s make it 5.
fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=5)
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
save_for_book(fig, 'ch4.fig9')
Alternatively, you can also specify the bins on your own. You just have to provide the bin edges. Let’s pick: \((65, 72, 76, 82)\). Here we go:
fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
save_for_book(fig, 'ch4.fig10')
Let’s plot a few more things on our histogram. For example, let’s plot the raw data as points on the x-axis. Here is how we can do that:
import numpy as np
fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
# Add a plot of points with x axis being the temperatures and the
# y axis being zeros
ax.plot(clean_data['t_unit'], np.zeros(clean_data.shape[0]), '.')
# Move the plotting range a bit to the negative so that we can see the points
ax.set_ylim(-0.005, 0.16)
(-0.005, 0.16)
This is nice. Let’s add some more information here. What about using a big red cross for marking the avarege temperature? Let’s do it!
average_unit_T = clean_data['t_unit'].mean()
# Same us before
fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
ax.plot(clean_data['t_unit'], np.zeros(clean_data.shape[0]), '.')
ax.set_ylim(-0.005, 0.16)
# But now I am adding the red cross at the average
ax.axvline(clean_data['t_unit'].mean(), color='k', linestyle='--')
save_for_book(fig, 'ch4.fig11')
There is no limit to what you can plot with matplotlib!
Questions#
Write some code to draw the histogram of the score.
# YOUR CODE HERE