Diagnostics for Classification

Diagnostics for Classification#

Let’s repeat what we did for the HMX example, but after splitting the dataset into training and test subsets. We will be making predictions on the test subset.

Show code cell source

Hide code cell source

MAKE_BOOK_FIGURES=False

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

def set_book_style():
    plt.style.use('seaborn-v0_8-white') 
    sns.set_style("ticks")
    sns.set_palette("deep")

    mpl.rcParams.update({
        # Font settings
        'font.family': 'serif',  # For academic publishing
        'font.size': 8,  # As requested, 10pt font
        'axes.labelsize': 8,
        'axes.titlesize': 8,
        'xtick.labelsize': 7,  # Slightly smaller for better readability
        'ytick.labelsize': 7,
        'legend.fontsize': 7,
        
        # Line and marker settings for consistency
        'axes.linewidth': 0.5,
        'grid.linewidth': 0.5,
        'lines.linewidth': 1.0,
        'lines.markersize': 4,
        
        # Layout to prevent clipped labels
        'figure.constrained_layout.use': True,
        
        # Default DPI (will override when saving)
        'figure.dpi': 600,
        'savefig.dpi': 600,
        
        # Despine - remove top and right spines
        'axes.spines.top': False,
        'axes.spines.right': False,
        
        # Remove legend frame
        'legend.frameon': False,
        
        # Additional trim settings
        'figure.autolayout': True,  # Alternative to constrained_layout
        'savefig.bbox': 'tight',    # Trim when saving
        'savefig.pad_inches': 0.1   # Small padding to ensure nothing gets cut off
    })

def save_for_book(fig, filename, is_vector=True, **kwargs):
    """
    Save a figure with book-optimized settings.
    
    Parameters:
    -----------
    fig : matplotlib figure
        The figure to save
    filename : str
        Filename without extension
    is_vector : bool
        If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.
    **kwargs : dict
        Additional kwargs to pass to savefig
    """    
    # Set appropriate DPI and format based on figure type
    if is_vector:
        dpi = 1000
        ext = '.pdf'
    else:
        dpi = 600
        ext = '.tif'
    
    # Save the figure with book settings
    fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)


def make_full_width_fig():
    return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)

def make_half_width_fig():
    return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)

if MAKE_BOOK_FIGURES:
    set_book_style()
make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()

import numpy as np
import scipy.stats as st

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   456  100   456    0     0   1127      0 --:--:-- --:--:-- --:--:--  1125

	Height	Result	y
0	40.5	E	1
1	40.5	E	1
2	40.5	E	1
3	40.5	E	1
4	40.5	E	1

Separate data into training and test subsets:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y)

Let me visualize these as well:

../_images/e4923a1e5842d345db3cae1ef7f9727bc50d42203a6ea4f0783f87fda5f80113.svg

Let’s do the logistic regression as before:

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
poly = PolynomialFeatures(2)
Phi_train = poly.fit_transform(x_train[:, None])
model = LogisticRegression().fit(Phi_train, y_train)

Okay, the model is trained. Let’s now make predictions on the test set:

Phi_test = poly.fit_transform(x_test[:, None])
predictions = model.predict_proba(Phi_test)

Notice that our predictions are probabilistic. Specifically, predictions has two columns. The first column is the probability that \(y=0\) (no explosion) and the second column is the probability that \(y=1\) (explosion). Here it is:

predictions

array([[0.97248367, 0.02751633],
       [0.97248367, 0.02751633],
       [0.9163132 , 0.0836868 ],
       [0.12492673, 0.87507327],
       [0.12492673, 0.87507327],
       [0.76695308, 0.23304692],
       [0.43793608, 0.56206392],
       [0.97248367, 0.02751633],
       [0.97248367, 0.02751633],
       [0.43793608, 0.56206392],
       [0.76695308, 0.23304692],
       [0.01945273, 0.98054727],
       [0.01945273, 0.98054727],
       [0.43793608, 0.56206392],
       [0.43793608, 0.56206392]])

One of the easiest visualizations you can do is the following. In the same plot, show put a point either at 0 or 1 depending on what the actual prediction is and then draw a bar for the probability that \(y=1\). Here it is:

fig, ax = make_full_width_fig()
ax.bar(np.arange(y_test.shape[0]), predictions[:, 1], color='g', alpha=0.5, label='$p(y=1|x,w)$')
ax.plot(y_test, 'o', label='Observed test point')
ax.set_xlabel('Test point id')
plt.legend(loc='best')
save_for_book(fig, 'ch16.fig4')

../_images/6b63ca9d6aba92f845b3b5ae7ca3cd0456a8a984f0997e60ece01c5f0cdd4c4c.svg

So, in this plot, the closer the bar is to the observation, the better the prediction.

Now, sometimes you may want to make pointwise predictions. That is, you look at the probability reported by the model and you have to pick either 0 or 1. There is a whole theory of how to do this properly: the theory of decision-making. If you want to learn more about it, you can take a course like ME 597 “Decision-making in Engineering Systems”. Here, we are going to do something very simple. We will just pick the label that has the maximum probability. Here is how:

y_pred = np.argmax(predictions, axis=1)
print(y_pred)

[0 0 0 1 1 0 1 0 0 1 0 1 1 1 1]

Let’s repeat the figure above but now add this point prediction as a red cross.

fig, ax = make_full_width_fig()
ax.bar(np.arange(y_test.shape[0]), predictions[:, 1], color='g', alpha=0.5, label='$p(y=1|x,w)$')
ax.plot(y_test, 'o', label='Observed test point')
ax.plot(y_pred, 'x', label='Point prediction')
ax.set_xlabel('Test point id')
plt.legend(loc='best')
save_for_book(fig, 'ch16.fig5')

../_images/e375f4891e8eb7aa23183de1309c710bd9c68db31f2fb642c510ea7f8ad4719a.svg

Nice. So, the model is pretty good. Most of the times we make the correct prediction. You can summarize how good the predictions are using the accuracy score. The accuracy score is:

\[ \text{accuracy score} = \frac{\text{number of correct predictions}}{\text{total number of test samples}}. \]

Here is how we can calculate it:

from sklearn.metrics import accuracy_score
acc = accuracy_score(y_test, y_pred)
print(f'HMX Accuracy = {acc * 100:1.2f} %')

HMX Accuracy = 73.33 %

Not bad.

Note

The accuracy score can be very misleading. The problem is how well-balanced your dataset is. As an extreme example, imagine that we had an dataset in which explosions are very rare, say only 1%. Then a stupid model that predicts that there will never be an explosion would have 99% accuracy. This is clearly ridiculous… The situation is remedied by the concept of balanced accuracy.

Finally, a nice way to visualize the predictions of your model is to use the confusion matrix. It is self-evident what it is when I plot it for you. Here you go:

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
fig, ax = make_full_width_fig()
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(ax=ax)
save_for_book(fig, 'ch16.fig6')

../_images/195be788f45c2d820d84a2c1cefca01fd40657fa506db81d8d03fff4c3afc849.svg

Questions#

Repeat the analysis above with a higher degree polynomial, say 5. Is the result better or worse? Why?

Diagnostics for Classification

Contents

Diagnostics for Classification#

Questions#