Histograms

Histograms#

Histograms offer a nice way to summarize the uncertainty/variability in scalar variables. I am assuming that you have seen histograms in the past. They work as follows:

you split the interval in which your variable takes values in bins.
you count how many times the variable falls inside its bin.

Let’s look at some examples. First, set up our environment as usual, download and clean the dataset we introduced in The Python Data Analysis Library:

Show code cell source Hide code cell source

MAKE_BOOK_FIGURES=False

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

def set_book_style():
    plt.style.use('seaborn-v0_8-white') 
    sns.set_style("ticks")
    sns.set_palette("deep")

    mpl.rcParams.update({
        # Font settings
        'font.family': 'serif',  # For academic publishing
        'font.size': 8,  # As requested, 10pt font
        'axes.labelsize': 8,
        'axes.titlesize': 8,
        'xtick.labelsize': 7,  # Slightly smaller for better readability
        'ytick.labelsize': 7,
        'legend.fontsize': 7,
        
        # Line and marker settings for consistency
        'axes.linewidth': 0.5,
        'grid.linewidth': 0.5,
        'lines.linewidth': 1.0,
        'lines.markersize': 4,
        
        # Layout to prevent clipped labels
        'figure.constrained_layout.use': True,
        
        # Default DPI (will override when saving)
        'figure.dpi': 600,
        'savefig.dpi': 600,
        
        # Despine - remove top and right spines
        'axes.spines.top': False,
        'axes.spines.right': False,
        
        # Remove legend frame
        'legend.frameon': False,
        
        # Additional trim settings
        'figure.autolayout': True,  # Alternative to constrained_layout
        'savefig.bbox': 'tight',    # Trim when saving
        'savefig.pad_inches': 0.1   # Small padding to ensure nothing gets cut off
    })

def save_for_book(fig, filename, is_vector=True, **kwargs):
    """
    Save a figure with book-optimized settings.
    
    Parameters:
    -----------
    fig : matplotlib figure
        The figure to save
    filename : str
        Filename without extension
    is_vector : bool
        If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.
    **kwargs : dict
        Additional kwargs to pass to savefig
    """    
    # Set appropriate DPI and format based on figure type
    if is_vector:
        dpi = 1000
        ext = '.pdf'
    else:
        dpi = 600
        ext = '.tif'
    
    # Save the figure with book settings
    fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)


def make_full_width_fig():
    return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)

def make_half_width_fig():
    return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)

if MAKE_BOOK_FIGURES:
    set_book_style()
make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()

!curl -O 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-239-intro-to-data-science/master/data/temp_price.csv'

import pandas as pd
temp_price = pd.read_csv('temp_price.csv')
clean_data = temp_price.dropna(axis=0).rename(columns={'Price per week': 'week_price',
                                                       'Price per day': 'daily_price'})
clean_data.head().round(2)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4126  100  4126    0     0  16835      0 --:--:-- --:--:-- --:--:-- 16840

	household	date	score	t_out	t_unit	hvac	price	week_price	daily_price
0	a1	2019-01-06	85	38.6	71.58	35.11	0.17	6.08	0.87
1	a10	2019-01-06	70	38.6	73.29	63.95	0.17	11.07	1.58
2	a11	2019-01-06	61	38.6	74.25	147.61	0.17	25.54	3.65
3	a12	2019-01-06	65	38.6	73.71	74.39	0.17	12.87	1.84
4	a13	2019-01-06	66	38.6	73.55	173.10	0.17	29.95	4.28

Let’s do the histogram of t_unit first:

fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'])
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Counts')
save_for_book(fig, 'ch4.fig7')

../_images/18128269f2f2d4dc5e751136afb494b8e1b5b2bb5cdfc6b2bb4519dd87b40e97.svg

It is straightforward to read this. Each bar gives you the number of households with internal temperature that fall with a bin.

Some times, we want to normalize the height of the bars so that the total area covered by the histogram is one. To do this, you need to divide by the total number of observations and by the width of each bin. What we get is a density. We will see in later lectures that this is an approximation of a probability density of a random variable. To get the density you need to use the keyword density=True in hist. Here is how:

fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True)
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
save_for_book(fig, 'ch4.fig8')

../_images/6d2edd5d76673bc5a75493e99d9138d24227cdf7ad76c15cf7cc8e1ab322125f.svg

You can also change the bin number. The default is 10. Let’s make it 5.

fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=5)
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
save_for_book(fig, 'ch4.fig9')

../_images/19a2347ffb78345dd8c6c9074afa29d54d23d393f503101dfab9c2e84dd1f8e2.svg

Alternatively, you can also specify the bins on your own. You just have to provide the bin edges. Let’s pick: \((65, 72, 76, 82)\). Here we go:

fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
save_for_book(fig, 'ch4.fig10')

../_images/17966ce04bf0322aaaf0dc3671daf5e4f8fb2e265e389042cffc8ecbb22fd73d.svg

Let’s plot a few more things on our histogram. For example, let’s plot the raw data as points on the x-axis. Here is how we can do that:

import numpy as np

fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
# Add a plot of points with x axis being the temperatures and the
# y axis being zeros
ax.plot(clean_data['t_unit'], np.zeros(clean_data.shape[0]), '.')
# Move the plotting range a bit to the negative so that we can see the points
ax.set_ylim(-0.005, 0.16)

(-0.005, 0.16)

../_images/965773e345fc4d984f9efd78d5d1318ff6c68bb6e5f5c1e61f8164abe62ccf19.svg

This is nice. Let’s add some more information here. What about using a big red cross for marking the avarege temperature? Let’s do it!

average_unit_T = clean_data['t_unit'].mean()

# Same us before
fig, ax = make_full_width_fig()
ax.hist(clean_data['t_unit'], density=True, bins=(65, 70, 73, 76, 82))
ax.set_xlabel('Unit temperature (F)')
ax.set_ylabel('Frequency')
ax.plot(clean_data['t_unit'], np.zeros(clean_data.shape[0]), '.')
ax.set_ylim(-0.005, 0.16)

# But now I am adding the red cross at the average
ax.axvline(clean_data['t_unit'].mean(), color='k', linestyle='--')

save_for_book(fig, 'ch4.fig11')

../_images/45a5169531072bc33b7d58ab102900e871c5bca38490795039a52f80660f897b.svg

There is no limit to what you can plot with matplotlib!

Questions#

Write some code to draw the histogram of the score.

Histograms

Contents

Histograms#

Questions#