Estimating probabilities from data - Bootstrapping#
You can use the same idea we used in simulations to estimate probabilities from experiments. So, if \(I\) is the background information and \(A\) is a logical proposition that is experimentally testable, then
There is a catch here. The experiments must be independently done. This means that you should prepare any apparatus you are using in exactly the same way for all experiments and that no experiment should affect any other in any way. Most of the experiments we run in a lab are independent. However, this assumption may be wrong for data collected in the wild.
Example - Estimating the probability of excessive energy use#
Let’s try this in practice using the high-performance building dataset. I’m importing the libraries and loading the data below.
Show code cell source
MAKE_BOOK_FIGURES=False
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")
def set_book_style():
plt.style.use('seaborn-v0_8-white')
sns.set_style("ticks")
sns.set_palette("deep")
mpl.rcParams.update({
# Font settings
'font.family': 'serif', # For academic publishing
'font.size': 8, # As requested, 10pt font
'axes.labelsize': 8,
'axes.titlesize': 8,
'xtick.labelsize': 7, # Slightly smaller for better readability
'ytick.labelsize': 7,
'legend.fontsize': 7,
# Line and marker settings for consistency
'axes.linewidth': 0.5,
'grid.linewidth': 0.5,
'lines.linewidth': 1.0,
'lines.markersize': 4,
# Layout to prevent clipped labels
'figure.constrained_layout.use': True,
# Default DPI (will override when saving)
'figure.dpi': 600,
'savefig.dpi': 600,
# Despine - remove top and right spines
'axes.spines.top': False,
'axes.spines.right': False,
# Remove legend frame
'legend.frameon': False,
# Additional trim settings
'figure.autolayout': True, # Alternative to constrained_layout
'savefig.bbox': 'tight', # Trim when saving
'savefig.pad_inches': 0.1 # Small padding to ensure nothing gets cut off
})
def save_for_book(fig, filename, is_vector=True, **kwargs):
"""
Save a figure with book-optimized settings.
Parameters:
-----------
fig : matplotlib figure
The figure to save
filename : str
Filename without extension
is_vector : bool
If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.
**kwargs : dict
Additional kwargs to pass to savefig
"""
# Set appropriate DPI and format based on figure type
if is_vector:
dpi = 1000
ext = '.pdf'
else:
dpi = 600
ext = '.tif'
# Save the figure with book settings
fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)
def make_full_width_fig():
return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)
def make_half_width_fig():
return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)
if MAKE_BOOK_FIGURES:
set_book_style()
make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
# The url of the file we want to download
!curl -O 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-239-intro-to-data-science/master/data/temperature_raw.xlsx'
import numpy as np
import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')
Show code cell output
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 277k 100 277k 0 0 959k 0 --:--:-- --:--:-- --:--:-- 961k
The background information \(I\) is as follows:
A random household is picked on a random week during the heating season. The heating season is defined to be the time of the year during which the weekly average of the external temperature is less than 55 degrees F.
The logical proposition \(A\) is:
The weekly HVAC energy consumption of the household exceeds 400 kWh.
First, we start by selecting the subset of the data that pertains to the heating season.
df_heating = df[df['t_out'] < 55]
df_heating.round(2)
household | date | score | t_out | t_unit | hvac | |
---|---|---|---|---|---|---|
0 | a1 | 2018-01-07 | 100.0 | 4.28 | 66.69 | 246.47 |
1 | a10 | 2018-01-07 | 100.0 | 4.28 | 66.36 | 5.49 |
2 | a11 | 2018-01-07 | 58.0 | 4.28 | 71.55 | 402.09 |
3 | a12 | 2018-01-07 | 64.0 | 4.28 | 73.43 | 211.69 |
4 | a13 | 2018-01-07 | 100.0 | 4.28 | 63.92 | 0.85 |
... | ... | ... | ... | ... | ... | ... |
5643 | c44 | 2020-02-25 | 59.0 | 43.64 | 76.49 | 19.14 |
5644 | c45 | 2020-02-25 | 87.0 | 43.64 | 71.17 | 30.79 |
5646 | c47 | 2020-02-25 | 97.0 | 43.64 | 68.60 | 5.34 |
5647 | c48 | 2020-02-25 | 92.0 | 43.64 | 73.43 | 18.04 |
5649 | c50 | 2020-02-25 | 59.0 | 43.64 | 77.72 | 14.41 |
2741 rows × 6 columns
We have 2741 such measurements. Now we want to pick a random household on a random week. Unfortunately, in this dataset this is not exactly equivalent to picking a row at random because there are some missing data. So, if we wanted to be very picky we would have to find the number of weeks during which we have data from all households. However, we won’t be so picky. The result would not be far off what we estimate by randomly picking rows.
Okay, so here is what we are going to do. We will pick \(N\) rows at random. For each one of the rows we are going to test if the logical proposition \(A\) is True. Finally, we are going to divide the number of times \(A\) is True with \(N\). Alright, let’s do it.
# The number of rows to pick
N = 500
# Each row corresponds to an integer from 0 to 2740. Pick
# N such integers at random
rows = np.random.randint(0, df_heating.shape[0], size=N)
rows
Show code cell output
array([2064, 1707, 2158, 2333, 364, 1753, 1175, 1040, 1799, 90, 64,
2542, 1198, 151, 2146, 2537, 1999, 2517, 2435, 2260, 158, 1368,
2687, 975, 1107, 240, 1830, 457, 2329, 2310, 1545, 1268, 1445,
362, 543, 429, 221, 1082, 594, 612, 2600, 1961, 2446, 2508,
254, 1148, 2590, 542, 28, 80, 1269, 1490, 573, 2029, 874,
2187, 115, 2084, 1988, 2157, 1173, 258, 2566, 479, 43, 1298,
1557, 1962, 2584, 2088, 221, 708, 160, 248, 1230, 1189, 2251,
1983, 2642, 921, 2224, 2327, 62, 1624, 2727, 1321, 1400, 622,
2706, 2614, 1587, 596, 2358, 300, 2233, 1770, 2483, 1724, 736,
1529, 2673, 2324, 529, 1096, 685, 1611, 1594, 1936, 1890, 2420,
1009, 358, 2567, 2683, 1362, 1695, 2632, 573, 1992, 1486, 806,
1834, 262, 805, 1368, 108, 2395, 1193, 33, 233, 942, 2090,
1022, 2164, 354, 929, 1400, 343, 1526, 1755, 1323, 389, 1507,
595, 2248, 63, 566, 746, 2600, 2499, 914, 854, 1834, 1247,
2332, 2308, 1171, 681, 1439, 2669, 1318, 1168, 364, 1326, 1740,
866, 1814, 1728, 2465, 1244, 2330, 1139, 1200, 2633, 1970, 2632,
649, 1680, 332, 2702, 308, 2602, 2612, 416, 1582, 1082, 2687,
327, 1738, 295, 754, 1499, 309, 47, 1496, 966, 2159, 1429,
1626, 318, 26, 2429, 1742, 2372, 2560, 127, 433, 1165, 2434,
1774, 2221, 957, 1105, 1411, 211, 454, 278, 1846, 304, 1300,
2001, 2008, 1359, 777, 2187, 1390, 974, 1443, 2149, 252, 1008,
1452, 2511, 2066, 1604, 276, 339, 1552, 355, 2490, 1894, 773,
493, 748, 2522, 1235, 2658, 2230, 1257, 1721, 1886, 939, 1198,
528, 1395, 1968, 753, 2441, 332, 401, 516, 63, 2119, 2460,
2449, 234, 969, 1716, 1993, 396, 1995, 2040, 2217, 1855, 223,
424, 114, 1799, 2271, 715, 1006, 1398, 985, 2337, 2095, 2215,
1115, 2067, 2714, 314, 145, 2177, 913, 1807, 1638, 1139, 474,
1651, 752, 942, 1137, 707, 868, 1745, 696, 1657, 439, 1440,
2317, 2304, 1890, 2193, 97, 56, 28, 688, 1480, 1176, 348,
1366, 2071, 742, 1363, 2083, 788, 1625, 2402, 2504, 1306, 652,
2672, 1223, 1170, 38, 2303, 2356, 2346, 1745, 1334, 2353, 2692,
2421, 666, 1523, 960, 138, 2120, 1971, 381, 437, 2005, 904,
333, 1217, 1569, 630, 454, 2133, 824, 197, 226, 2114, 1288,
1442, 1929, 2617, 353, 362, 2372, 2149, 1448, 79, 375, 1152,
1603, 1792, 2215, 2348, 1677, 2478, 2527, 748, 2192, 84, 1019,
704, 776, 2436, 1406, 2199, 490, 2703, 1595, 1823, 752, 2487,
1798, 2353, 2702, 1509, 1763, 2373, 93, 389, 1418, 722, 2286,
180, 1260, 1192, 2089, 247, 1249, 2166, 513, 2699, 2412, 2271,
1607, 2137, 1957, 619, 2118, 2260, 417, 554, 601, 414, 645,
690, 1284, 2245, 2541, 1106, 882, 12, 2156, 1263, 10, 2501,
1677, 2284, 1062, 725, 2035, 160, 1327, 1311, 531, 2616, 2647,
2258, 2712, 623, 1340, 105, 1094, 764, 2479, 1805, 2154, 111,
2587, 528, 1156, 424, 2527, 1957, 1707, 1716, 555, 1767, 1665,
2254, 2235, 410, 394, 2355, 1044, 462, 2669, 555, 1728, 1672,
2202, 1275, 634, 2156, 2229, 1403, 2225, 1620, 2659, 1464, 2342,
2420, 1376, 743, 2074, 2572])
Now we need to pick the rows of df_heating
that have those indices.
We can do it like this:
df_heating_exp = df_heating.loc[df_heating.index[rows]]
df_heating_exp.round(2)
household | date | score | t_out | t_unit | hvac | |
---|---|---|---|---|---|---|
4805 | a14 | 2019-11-10 | 72.0 | 40.18 | 72.03 | 6.02 |
3109 | a3 | 2019-03-17 | 71.0 | 44.42 | 73.83 | 66.78 |
4917 | b18 | 2019-11-24 | 71.0 | 39.33 | 72.77 | 154.92 |
5134 | c35 | 2019-12-22 | 98.0 | 29.01 | 69.68 | 127.01 |
400 | a1 | 2018-03-04 | 90.0 | 45.47 | 71.80 | 24.55 |
... | ... | ... | ... | ... | ... | ... |
5244 | c45 | 2020-01-05 | 65.0 | 42.50 | 72.08 | 81.80 |
2742 | c43 | 2019-01-20 | 58.0 | 30.01 | 76.89 | 158.46 |
823 | b24 | 2018-04-29 | 96.0 | 53.95 | 74.44 | 51.82 |
4816 | b17 | 2019-11-10 | 86.0 | 40.18 | 69.42 | 185.45 |
5434 | c35 | 2020-02-02 | 97.0 | 33.66 | 69.86 | 89.17 |
500 rows × 6 columns
Now let’s evaluate the value of the logical proposition \(A\) for each one of these rows.
df_heating_exp_A = df_heating_exp['hvac'] > 300
df_heating_exp_A
4805 False
3109 False
4917 False
5134 False
400 False
...
5244 False
2742 False
823 False
4816 False
5434 False
Name: hvac, Length: 500, dtype: bool
Now, we need to count the number of times the logical proposition was True
.
We can do this either with the function pandas.Dataframe.value_counts():
df_heating_exp_A_counts = df_heating_exp_A.value_counts()
df_heating_exp_A_counts
hvac
False 466
True 34
Name: count, dtype: int64
This returned both the True
and the False
counts.
To get just the True
counts:
number_of_A_true = df_heating_exp_A_counts[True]
number_of_A_true
34
And now we can estimate the probability by dividing the number of times \(A\) was True
with the number of randomly selected rows \(N\).
We get this:
p_A_g_I = number_of_A_true / N
print(f'p(A|I) ~= {p_A_g_I:1.2f}')
p(A|I) ~= 0.07
Nice! This was easy. Now, you may say why didn’t you pick all rows? I could have, and if I had a really really big number of rows I would be getting a very good estimate of the probability. But you never know if you have enough data. It is not like the simulated experiment where we could do as many runs as we liked. Most of the times, we have a finite amount of data that is not good enough for an accurate estimate. In other words, there is a bit of epistemic uncertainty in our estimate of the probability. There is something we can do to estimate this epistemic uncertainty. Let me show you.
First, put everything we did above in a nice function:
def estimate_p_A_g_I(N, df_heating):
"""Estimates the probability of A given I by randomly picking N rows
from the data frame df_heating.
Arguments:
N - The number of rows to pick at random.
df_heating - The data frame containing the heating data.
Returns: The number of rows with A True divided by N.
"""
rows = np.random.randint(0, df_heating.shape[0], size=N)
df_heating_exp = df_heating.loc[df_heating.index[rows]]
df_heating_exp_A = df_heating_exp['hvac'] > 300
df_heating_exp_A_counts = df_heating_exp_A.value_counts()
number_of_A_true = df_heating_exp_A_counts[True] if True in df_heating_exp_A_counts.keys() else 0
p_A_g_I = number_of_A_true / N
return p_A_g_I
Now we can call this function as many times as we want. Each time we get an estimate of the probability of A given I. It is going to be a different estimate every time because the rows are selected at random. Here it is 10 times:
for i in range(10):
p_A_g_I = estimate_p_A_g_I(500, df_heating)
print(f'{i+1:d} estimate of p(A|I): {p_A_g_I:1.3f}')
1 estimate of p(A|I): 0.092
2 estimate of p(A|I): 0.062
3 estimate of p(A|I): 0.058
4 estimate of p(A|I): 0.072
5 estimate of p(A|I): 0.062
6 estimate of p(A|I): 0.082
7 estimate of p(A|I): 0.056
8 estimate of p(A|I): 0.050
9 estimate of p(A|I): 0.064
10 estimate of p(A|I): 0.068
Alright, every time a different number. To get a sense of the epistemic uncertainty, we can do it many many times, say 1000 times, and plot a histogram of our estimates. Like this:
# A place to store the estimates
p_A_g_Is = []
# The number of rows we sample every time
N = 500
# Put 1000 estimates in there
for i in range(1000):
p_A_g_I = estimate_p_A_g_I(N, df_heating)
p_A_g_Is.append(p_A_g_I)
# And now do the histogram
fig, ax = make_full_width_fig()
ax.hist(p_A_g_Is)
ax.set_xlabel('$p(A|I)$')
ax.set_ylabel('Count')
ax.set_title(f'Bootstrap estimate $N = {N:d}$');
save_for_book(fig, 'ch7.fig2')
So looking at this, we can say that the \(p(A|I)\) is around 7%, but it could be as low as 5% or as high as 9%. This way of estimating the uncertainty of statistics by resampling the data is called bootstrapping.
Questions#
Rerun the bootstrapping estimate of \(p(A|I)\) using \(N=100\) (very small). What happens to the range of possibilities?
Rerun the bootstrapping estimate of \(p(A|I)\) using \(N=1000\) (reasonable). Did the range of possibilities change compared to \(N=500\)?
Rerun the bootstrapping estimate of \(p(A|I)\) using \(N=2741\) (the maximum, and reasonable). What about now? How does the range look-like now? Why, didn’t the uncertainty collapse completly? Hint: Think about how we sample the rows of the data frame. Is there a chance that we miss some?
What if you go to \(N=5000\)?