Predictive checking

We saw how we can build models out of data using the maximum likelihood principle. How do we know that these models are good? Well, this is a very tough question to answer. And, to some extent, it is unanswerable. It is much easier to see when your model is wrong… Predictive checking is a way to do exactly that.

Assume that we have built a model using some data, say \(x_{1:N}\). Now, assume that you do the experiment again under the same conditions. What data does your model tell you that you would observe? The idea of predictive checking is to:

  • use your fitted model to replicate the experimental data, and

  • compare their characteristics to the real data.

You may reject a model that performs very poorly under predictive checking. However, you cannot accept a model that performs very well. There are other methods for doing this which are more advanced. Predictive checking is good for identifying bugs in your code or coming up with ideas to extend the models in a way that better matches the data.

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import numpy as np
import scipy.stats as st

Visual inspections of replicated data

The idea here is to simply sample \(x^{\text{rep}}_{1:N}\) from your fitted model and compare it visually to \(x_{1:N}\). Let’s see this on the coin toss example.

Consider \(N\) coin-tosses with probability of heads (\(1\)) \(\theta\), i.e.,

\[ X_{i}|\theta \sim \text{Bernoulli}(\theta). \]

The maximum likelihood estimate of \(\theta\) is:

\[ \hat{\theta} = \frac{1}{N}\sum_{i=1}^N. \]

Now, here is how we can replicate the experiment using our fitted model:

  • Sample \(x^{\text{rep}_{1:N}}\) independently from a Bernoulli with parameter \(\hat{\theta}\) (this is one replication of the experiment).

  • Repeat as many times as you want. Each time you get a new replication of the experimental data.

Ok, let’s start with a dataset that is 100% known that it is compatible with the model. Generate the ground truth data from a fair coin:

theta_true = 0.5
X = st.bernoulli(theta_true)
N = 50
xdata = X.rvs(size=N)
xdata
array([1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0,
       0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 1, 0])

Now use MLE for \(\theta\):

hat_theta = xdata.mean()
print('hat theta = {0:1.2f}'.format(hat_theta))
hat theta = 0.42

Now that we have fitted the model, let’s draw many replicated data \(x^{\text{rep}}_{1:N}\) from it and compare them visually to the original dataset.

# How many replications you want to do:
N_rep = 9
# And array to store the replicated experiments
# each row is an replication of the experimental data
x_rep = np.ndarray((N_rep, N))
for i in range(N_rep):
    x_rep[i, :] = st.bernoulli(hat_theta).rvs(size=N)
x_rep
array([[0., 1., 1., 0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 0.,
        0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 1., 1., 1.,
        1., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1., 1., 1., 1., 0.,
        0., 0.],
       [0., 1., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 1., 1., 0., 1.,
        1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1.,
        0., 0., 0., 1., 0., 0., 1., 1., 1., 1., 0., 0., 0., 0., 0., 1.,
        1., 0.],
       [0., 0., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0.,
        1., 0., 0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 1., 1.,
        0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1.,
        0., 1.],
       [1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1.,
        0., 0., 1., 1., 1., 1., 1., 1., 0., 0., 1., 0., 0., 1., 0., 0.,
        0., 1., 1., 1., 1., 0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 1.,
        1., 1.],
       [0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 0., 1., 0., 0., 1., 0.,
        1., 0., 0., 0., 0., 0., 0., 1., 1., 0., 1., 1., 0., 1., 1., 0.,
        1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
        0., 0.],
       [1., 0., 0., 1., 0., 0., 0., 1., 1., 0., 0., 0., 0., 0., 0., 0.,
        1., 0., 1., 0., 1., 1., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0.,
        0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 1., 1., 1., 0., 0., 0.,
        0., 1.],
       [0., 0., 0., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1., 0., 1., 1.,
        1., 0., 1., 1., 1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 0., 0.,
        0., 0., 0., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 1., 0., 0.,
        0., 0.],
       [0., 1., 1., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1., 1., 1., 0.,
        1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 1.,
        1., 1., 0., 0., 0., 1., 0., 1., 0., 1., 1., 0., 0., 0., 1., 1.,
        0., 0.],
       [0., 1., 0., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 0., 1.,
        1., 0., 1., 1., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0., 1., 0.,
        0., 1., 0., 1., 1., 1., 0., 1., 1., 0., 0., 1., 1., 1., 1., 0.,
        1., 0.]])

We can visualize the data as images. The first row of pixels is are the observed data.

fig, ax = plt.subplots()
ax.imshow(np.vstack([xdata, x_rep]));
../_images/predictive-checking_9_0.svg

This visual inspection does not reveal anything strange. Let’s now repeat this excersize with a dataset that is completely artificial and does not match the model. As a matter of fact, I just picked the dataset by hand! I just tried to pick an equal number of 0’s and 1’s. Can predictive checking reveal that this dataset was not generated by a binary experiment? Let’s see…

Here is the data I came up with:

xdata_2 = np.array([0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0])

My maximum likelihood estimator for this new coin:

hat_theta2 = xdata_2.mean()
print('hat theta 2 = {0:1.2f}'.format(hat_theta2))
hat theta 2 = 0.50

Let’s replicate the experiment 9 times:

x_rep_2 = np.ndarray((N_rep, N))
for i in range(N_rep):
    x_rep_2[i, :] = st.bernoulli(hat_theta2).rvs(size=N)
x_rep_2
array([[1., 1., 0., 0., 1., 0., 1., 1., 0., 1., 1., 1., 0., 0., 1., 1.,
        0., 1., 1., 1., 0., 1., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0.,
        0., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 0., 0.,
        1., 0.],
       [1., 1., 1., 0., 1., 0., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1.,
        1., 0., 1., 0., 1., 1., 1., 1., 1., 1., 1., 1., 1., 0., 0., 1.,
        1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 1., 0., 1.,
        1., 1.],
       [0., 1., 0., 0., 1., 1., 0., 0., 1., 1., 0., 1., 0., 0., 1., 0.,
        0., 0., 1., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 1., 0., 1.,
        1., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 0., 1., 1., 1., 1.,
        1., 1.],
       [0., 1., 0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 0.,
        1., 1., 1., 0., 0., 0., 1., 1., 1., 1., 1., 0., 1., 1., 1., 0.,
        0., 0., 1., 1., 1., 1., 0., 0., 1., 0., 1., 0., 0., 0., 0., 1.,
        0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1., 0., 0.,
        1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0., 0.,
        0., 0., 1., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1., 1.,
        0., 1.],
       [0., 1., 0., 1., 1., 1., 0., 1., 0., 0., 0., 0., 1., 1., 1., 0.,
        0., 1., 0., 0., 1., 1., 0., 1., 1., 1., 0., 1., 1., 1., 1., 1.,
        0., 0., 0., 0., 0., 1., 0., 1., 1., 0., 1., 0., 1., 1., 0., 0.,
        1., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1.,
        0., 1., 1., 0., 1., 1., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1.,
        0., 0., 1., 0., 0., 0., 1., 1., 1., 1., 0., 1., 0., 1., 1., 0.,
        1., 1.],
       [0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 0., 0., 1., 0., 1.,
        0., 1., 0., 1., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0., 1., 1.,
        1., 1., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 1.,
        1., 0.],
       [0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 1., 1., 0., 0., 1.,
        0., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0., 1., 1., 0., 0.,
        1., 0., 0., 0., 0., 0., 0., 0., 1., 1., 1., 1., 1., 1., 1., 0.,
        0., 0.]])

And let’s visualize the results:

fig, ax = plt.subplots()
ax.imshow(np.vstack([xdata_2, x_rep_2]));
../_images/predictive-checking_17_0.svg

Does the first row (original data) look similar to the rest of the rows (replicated experiments). Well, if you pay close attention you will notice that the data I picked by hand has more transitions from heads to tails than the replicated data. In other words, the replicated data seem to have longer consecutive series of either heads or tails. How can we see this more clearly?

Test quantities and visual inspections

We use test quantities to characterize the discrepancy between the model and the data. In other words, test quantities help us zoom into the characteristsics of the data that are of particular interest to us. Mathematically, a test quantity is a scalar function of the data and the model parameters \(T(x_{1:n})\). There are some general recipies for creating test quantities for regression and classification. However, in general, you must use common sense in selecting them. What are the important characteristics of the data that your model should be capturing? We will be seeing specific examples below. Now, assume that you have selected one, or more, test quantities. What do you do with them? Well, the easiest thing to do is a visual comparison of the histogram of the test quantity over replicated data, and compare it to the observe value \(T(x_{1:n},\theta)\). In these plots you are trying to see how likely or unlikely is the observed test quantity under the assumption that your model is correct.

What are some good test quantities that we can pick for the coin-toss example. An obvious one is the number of heads. This is only a function of the data. It is:

\[ T_{h}(x_{1:n}) = \sum_{i=1}^nx_i. \]

Let’s implement this as a Python function of the data:

def T_h(x):
    """
    This is an implementation of a 
    """
    return x.sum()

We are going to try two datasets:

  • one that was generated from the correct model, and

  • one I picked by hand.

We will see the results that we get from both. For the first dataset (the one generated by the model) we get:

# The observed test quantity
T_h_obs = T_h(xdata)
print('The observed test quantity is {0:d}'.format(T_h_obs))
# Draw replicated data
N_rep = 1000
x_rep = np.ndarray((N_rep, N))
for i in range(N_rep):
    x_rep[i, :] = st.bernoulli(hat_theta).rvs(size=N)
# Evaluate the test quantity
T_h_rep = np.ndarray(x_rep.shape[0])
for i in range(x_rep.shape[0]):
    T_h_rep[i] = T_h(x_rep[i, :])
# Do the plot
fig, ax = plt.subplots()
tmp = ax.hist(T_h_rep, density=True, alpha=0.25, label='Replicated test quantity')[0]
ax.plot(T_h_obs * np.ones((50,)), np.linspace(0, tmp.max(), 50), 'k', label='Observed test quantity')
plt.legend(loc='best')
plt.title('Real data, test quantity: number of heads');
The observed test quantity is 21
../_images/predictive-checking_22_1.svg

Now, let’s look also at the other dataset (the one I picked by hand):

# The observed test quantity
T_h_obs = T_h(xdata_2)
print('The observed test quantity is {0:d}'.format(T_h_obs))
# Draw replicated data
N_rep = 1000
x_rep = np.ndarray((N_rep, N))
for i in range(N_rep):
    x_rep[i, :] = st.bernoulli(hat_theta2).rvs(size=N)
# Evaluate the test quantity
T_h_rep = np.ndarray(x_rep.shape[0])
for i in range(x_rep.shape[0]):
    T_h_rep[i] = T_h(x_rep[i, :])
# Do the plot
fig, ax = plt.subplots()
tmp = ax.hist(T_h_rep, density=True, alpha=0.25, label='Replicated test quantity')[0]
ax.plot(T_h_obs * np.ones((50,)), np.linspace(0, tmp.max(), 50), 'k', label='Observed test quantity')
plt.legend(loc='best')
plt.title('Hand-picked data, test quantity: number of heads');
The observed test quantity is 25
../_images/predictive-checking_24_1.svg

It looks about the same. This just means that I was able to replicate this particular test quantity when I picked values by hand. It is not possible to see the difference with this statistic.

Can we find a better test statistic? Something that can discriminate between the two data generating processes? Remember our observation when we plotted the replicated data vs the true data for the second case. We observed that my hand-picked data included more transitions from heads to tails. I was trying hard to make them look random. Let’s build a statistic that captures that. We are going to take this:

\[ T_s(x) = \text{# number of switches from 0 and 1 in the sequence}\;x. \]

This is not easy to write in an analytical form, but we can program it:

def T_s(x):
    s = 0
    for i in range(1, x.shape[0]):
        if x[i] != x[i-1]:
            s += 1
    return s

Let’s look first at the original dataset:

# The observed test quantity
T_s_obs = T_s(xdata)
print('The observed test quantity is {0:d}'.format(T_s_obs))
# Draw replicated data
N_rep = 1000
x_rep = np.ndarray((N_rep, N))
for i in range(N_rep):
    x_rep[i, :] = st.bernoulli(hat_theta).rvs(size=N)
# Evaluate the test quantity
T_s_rep = np.ndarray(x_rep.shape[0])
for i in range(x_rep.shape[0]):
    T_s_rep[i] = T_s(x_rep[i, :])
# Do the plot
fig, ax = plt.subplots()
tmp = ax.hist(T_s_rep, density=True, alpha=0.25, label='Replicated test quantity')[0]
ax.plot(T_s_obs * np.ones((50,)), np.linspace(0, tmp.max(), 50), 'k', label='Observed test quantity')
plt.legend(loc='best')
plt.title('Real data, test quantity: number of transitions from head to tails');
The observed test quantity is 25
../_images/predictive-checking_28_1.svg

This looks okay.

Let’s now look at the one I picked by hand:

# The observed test quantity
T_s_obs = T_s(xdata_2)
print('The observed test quantity is {0:d}'.format(T_s_obs))
# Draw replicated data
N_rep = 1000
x_rep = np.ndarray((N_rep, N))
for i in range(N_rep):
    x_rep[i, :] = st.bernoulli(hat_theta2).rvs(size=N)
# Evaluate the test quantity
T_s_rep = np.ndarray(x_rep.shape[0])
for i in range(x_rep.shape[0]):
    T_s_rep[i] = T_s(x_rep[i, :])
# Do the plot
fig, ax = plt.subplots()
tmp = ax.hist(T_s_rep, density=True, alpha=0.25, label='Replicated test quantity')[0]
ax.plot(T_s_obs * np.ones((50,)), np.linspace(0, tmp.max(), 50), 'k', label='Observed test quantity')
plt.legend(loc='best')
plt.title('Hand-picked data, test quantity: number of transitions from head to tails');
The observed test quantity is 32
../_images/predictive-checking_30_1.svg

Oops… The data are highly unlikely under the assumptions of this model. You see how predictive checking can help you reveal that there is something fishy with your data.

Note

What we did here has connections to the concept of p-values. However, p-values can be easily misused. The problem is that scientists and engineers tend to use them to automatically reject/accept hypotheses. Things are a bit more nuanced that the common way people use them. To do a good job teaching p-values I should devote at least one lecture on them, and this would be at the expense of another topic. Furthermore, you can do excellent data science without ever using p-values! These are the reasons why I have chosen not to talk about them. If you want to learn about p-values, I suggest you take a core statistics course.