Diagnostics for Classification#
Let’s repeat what we did for the HMX example, but after splitting the dataset into training and test subsets. We will be making predictions on the test subset.
Show code cell source
MAKE_BOOK_FIGURES=False
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")
def set_book_style():
plt.style.use('seaborn-v0_8-white')
sns.set_style("ticks")
sns.set_palette("deep")
mpl.rcParams.update({
# Font settings
'font.family': 'serif', # For academic publishing
'font.size': 8, # As requested, 10pt font
'axes.labelsize': 8,
'axes.titlesize': 8,
'xtick.labelsize': 7, # Slightly smaller for better readability
'ytick.labelsize': 7,
'legend.fontsize': 7,
# Line and marker settings for consistency
'axes.linewidth': 0.5,
'grid.linewidth': 0.5,
'lines.linewidth': 1.0,
'lines.markersize': 4,
# Layout to prevent clipped labels
'figure.constrained_layout.use': True,
# Default DPI (will override when saving)
'figure.dpi': 600,
'savefig.dpi': 600,
# Despine - remove top and right spines
'axes.spines.top': False,
'axes.spines.right': False,
# Remove legend frame
'legend.frameon': False,
# Additional trim settings
'figure.autolayout': True, # Alternative to constrained_layout
'savefig.bbox': 'tight', # Trim when saving
'savefig.pad_inches': 0.1 # Small padding to ensure nothing gets cut off
})
def save_for_book(fig, filename, is_vector=True, **kwargs):
"""
Save a figure with book-optimized settings.
Parameters:
-----------
fig : matplotlib figure
The figure to save
filename : str
Filename without extension
is_vector : bool
If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.
**kwargs : dict
Additional kwargs to pass to savefig
"""
# Set appropriate DPI and format based on figure type
if is_vector:
dpi = 1000
ext = '.pdf'
else:
dpi = 600
ext = '.tif'
# Save the figure with book settings
fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)
def make_full_width_fig():
return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)
def make_half_width_fig():
return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)
if MAKE_BOOK_FIGURES:
set_book_style()
make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
import numpy as np
import scipy.stats as st
Show code cell source
# Download the data file:
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-239-intro-to-data-science/master/data/hmx_data.csv'
!curl -O $url
# Load the data using pandas
import pandas as pd
data = pd.read_csv('hmx_data.csv')
# Extract data for regression
# Heights as a numpy array
x = data['Height'].values
# The labels must be 0 and 1
# We will use a dictionary to indicate our labeling
label_coding = {'E': 1, 'N': 0}
y = np.array([label_coding[r] for r in data['Result']])
data['y'] = y
data.head()
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 456 100 456 0 0 1127 0 --:--:-- --:--:-- --:--:-- 1125
Height | Result | y | |
---|---|---|---|
0 | 40.5 | E | 1 |
1 | 40.5 | E | 1 |
2 | 40.5 | E | 1 |
3 | 40.5 | E | 1 |
4 | 40.5 | E | 1 |
Separate data into training and test subsets:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y)
Let me visualize these as well:
Show code cell source
fig, ax = plt.subplots()
ax.plot(x_train, y_train, 'x', label='Training data')
ax.plot(x_test, y_test, 'o', label='Test data')
plt.legend(loc='best');
Let’s do the logistic regression as before:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
poly = PolynomialFeatures(2)
Phi_train = poly.fit_transform(x_train[:, None])
model = LogisticRegression().fit(Phi_train, y_train)
Okay, the model is trained. Let’s now make predictions on the test set:
Phi_test = poly.fit_transform(x_test[:, None])
predictions = model.predict_proba(Phi_test)
Notice that our predictions are probabilistic.
Specifically, predictions
has two columns.
The first column is the probability that \(y=0\) (no explosion) and the second column is the probability that \(y=1\) (explosion).
Here it is:
predictions
array([[0.97248367, 0.02751633],
[0.97248367, 0.02751633],
[0.9163132 , 0.0836868 ],
[0.12492673, 0.87507327],
[0.12492673, 0.87507327],
[0.76695308, 0.23304692],
[0.43793608, 0.56206392],
[0.97248367, 0.02751633],
[0.97248367, 0.02751633],
[0.43793608, 0.56206392],
[0.76695308, 0.23304692],
[0.01945273, 0.98054727],
[0.01945273, 0.98054727],
[0.43793608, 0.56206392],
[0.43793608, 0.56206392]])
One of the easiest visualizations you can do is the following. In the same plot, show put a point either at 0 or 1 depending on what the actual prediction is and then draw a bar for the probability that \(y=1\). Here it is:
fig, ax = make_full_width_fig()
ax.bar(np.arange(y_test.shape[0]), predictions[:, 1], color='g', alpha=0.5, label='$p(y=1|x,w)$')
ax.plot(y_test, 'o', label='Observed test point')
ax.set_xlabel('Test point id')
plt.legend(loc='best')
save_for_book(fig, 'ch16.fig4')
So, in this plot, the closer the bar is to the observation, the better the prediction.
Now, sometimes you may want to make pointwise predictions. That is, you look at the probability reported by the model and you have to pick either 0 or 1. There is a whole theory of how to do this properly: the theory of decision-making. If you want to learn more about it, you can take a course like ME 597 “Decision-making in Engineering Systems”. Here, we are going to do something very simple. We will just pick the label that has the maximum probability. Here is how:
y_pred = np.argmax(predictions, axis=1)
print(y_pred)
[0 0 0 1 1 0 1 0 0 1 0 1 1 1 1]
Let’s repeat the figure above but now add this point prediction as a red cross.
fig, ax = make_full_width_fig()
ax.bar(np.arange(y_test.shape[0]), predictions[:, 1], color='g', alpha=0.5, label='$p(y=1|x,w)$')
ax.plot(y_test, 'o', label='Observed test point')
ax.plot(y_pred, 'x', label='Point prediction')
ax.set_xlabel('Test point id')
plt.legend(loc='best')
save_for_book(fig, 'ch16.fig5')
Nice. So, the model is pretty good. Most of the times we make the correct prediction. You can summarize how good the predictions are using the accuracy score. The accuracy score is:
Here is how we can calculate it:
from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)
print(f'HMX Accuracy = {acc * 100:1.2f} %')
HMX Accuracy = 73.33 %
Not bad.
Note
The accuracy score can be very misleading. The problem is how well-balanced your dataset is. As an extreme example, imagine that we had an dataset in which explosions are very rare, say only 1%. Then a stupid model that predicts that there will never be an explosion would have 99% accuracy. This is clearly ridiculous… The situation is remedied by the concept of balanced accuracy.
Finally, a nice way to visualize the predictions of your model is to use the confusion matrix. It is self-evident what it is when I plot it for you. Here you go:
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
fig, ax = make_full_width_fig()
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(ax=ax)
save_for_book(fig, 'ch16.fig6')
Questions#
Repeat the analysis above with a higher degree polynomial, say 5. Is the result better or worse? Why?