Homework 16¶

• Type your name and email in the “Student details” section below.

• Develop the code and generate the figures you need to solve the problems using this notebook.

• For the answers that require a mathematical proof or derivation you can either:

• Type the answer using the built-in latex capabilities. In this case, simply export the notebook as a pdf and upload it on gradescope; or

• You can print the notebook (after you are done with all the code), write your answers by hand, scan, turn your response to a single pdf, and upload on gradescope.

• The total homework points are 100. Please note that the problems are not weighed equally.

Note

• This is due before the beginning of the next lecture.

• Please match all the pages corresponding to each of the questions when you submit on gradescope.

Student details¶

• First Name:

• Last Name:

• Email:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import numpy as np
import scipy.stats as st
import pandas as pd
import requests
import os
"""
Downloads the file in the url and saves it in the current working directory.
"""
data = requests.get(url)
if local_filename is None:
local_filename = os.path.basename(url)
with open(local_filename, 'wb') as fd:
fd.write(data.content)


Problem 1 - Explaining the Challenger disaster¶

On January 28, 1986, the Space Shuttle Challenger disintegrated after 73 seconds from launch. The failure can be traced on the rubber O-rings which were used to seal the joints of the solid rocket boosters (required to force the hot, high-pressure gases generated by the burning solid propelant through the nozzles thus producing thrust).

It turns out that the performance of the O-ring material was particularly sensitive on the external temperature during launch. This dataset contains records of different experiments with O-rings recorded at various times between 1981 and 1986. Download the data the usual way (either put them on Google drive or run the code cell below).

url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/challenger_data.csv'


Even though this is a csv file, you should load it with pandas because it contains some special characters.

raw_data = pd.read_csv('challenger_data.csv')
raw_data

Date Temperature Damage Incident
0 04/12/1981 66 0
1 11/12/1981 70 1
2 3/22/82 69 0
3 6/27/82 80 NaN
4 01/11/1982 68 0
5 04/04/1983 67 0
6 6/18/83 72 0
7 8/30/83 73 0
8 11/28/83 70 0
9 02/03/1984 57 1
10 04/06/1984 63 1
11 8/30/84 70 1
12 10/05/1984 78 0
13 11/08/1984 67 0
14 1/24/85 53 1
15 04/12/1985 67 0
16 4/29/85 75 0
17 6/17/85 70 0
18 7/29/85 81 0
19 8/27/85 76 0
20 10/03/1985 79 0
21 10/30/85 75 1
22 11/26/85 76 0
23 01/12/1986 58 1
24 1/28/86 31 Challenger Accident

The first column is the date of the record. The second column is the external temperature of that day in degrees F. The third column labeled Damage Incident is has a binary coding (0=no damage, 1=damage). The very last row is the day of the Challenger accident.

We are going to use the first 23 rows to solve a binary classification problem that will give us the probability of an accident conditioned on the observed external temperature in degrees F. Before we proceed to the analysis of the data, let’s clean the data up.

First, we drop all the bad records:

# Your code here


We also don’t need the last record. Just remember that the temperature the day of the Challenger accident was 31 degrees F. Remove the last record from the dataframe.

# Your code here
clean_data =


Let’s extract the features and the labels:

x = clean_data['Temperature'].values
x

y = clean_data['Damage Incident'].values.astype(np.int)
y


Part A - Perform logistic regression¶

Perform logistic regression between the temperature ($$x$$) and the damage label ($$y$$). Do not bother doing a validation because there are not a lot of data. Just use a very simple model so that you don’t overfit.

Answer: This is one of the cases, where we don’t have a lot of data. So we are going to use everything for training. To avoid overfitting, we will use the simplest possible model. The model is: $$$p(y|x,w) = \operatorname{sigm}(w_0 + w_1 x),$$$$where$$w_0$$and$$w_1$ are parameters to be determined by data.

# Your code here


This is it. Let’s take a look at the parameters that that were found:

# Your code here


We observe a negative correlation between temperature and damage. Damage becomes more probable as temperature decreases.

Part B - Plot the probability of damage as a function of temperature¶

Plot the probability of damage as a function of temperature.

# Your code here


Part C - Decide whether or not to launch¶

The temperature the day of the Challenger accident was 31 degrees F. Would you go ahead with the launch or not? Hint: Start by calculating the probability of damage at 31 degrees F.

Your answer here. As many code and text blocks as you need.