Homework 12
Contents
Homework 12¶
Type your name and email in the “Student details” section below.
Develop the code and generate the figures you need to solve the problems using this notebook.
For the answers that require a mathematical proof or derivation you can either:
Type the answer using the built-in latex capabilities. In this case, simply export the notebook as a pdf and upload it on gradescope; or
You can print the notebook (after you are done with all the code), write your answers by hand, scan, turn your response to a single pdf, and upload on gradescope.
The total homework points are 100. Please note that the problems are not weighed equally.
Note
This is due before the beginning of the next lecture.
Please match all the pages corresponding to each of the questions when you submit on gradescope.
Student details¶
First Name:
Last Name:
Email:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import numpy as np
import scipy.stats as st
Problem 1 - Comparing performance of robotic systems¶
You are considering purchasing a robotic system for manufacturing masks. There are two possibilities, say \(A\) and \(B\). They both produce the same number of masks per day, they cost the same to purchase, and the have the same power and supply costs. However, they are not identical. They have different faulty mask rates. Let \(X_A\) and \(X_B\) be the number of faulty masks you get from each system, respectively, in a given day. For each of the possibilities below:
Use
scipy.stats
to make two Normal random variables \(X_A\) and \(X_B\) with the right mean and variance.Plot the PDF of the random variables in the same figure.
Find a 95% central credible interval.
Indicate which robotic system you would buy and why (three choices \(A\), \(B\), and ``I cannot choose’’).
Case 1: \(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.1\) and \(\mathbf{E}[X_B] = 1, \mathbf{V}[X_B] = 0.2\).
Answer:
# Make the random variables here
XA = st.norm(loc=#YOUR CHOICE,
scale=#YOUR CHOICE)
XB = # YOUR CODE
# Plot the PDF's here
# Find the 95% central credible interval for XA here
# Find the 95% central credible interval for XB here
Your answer to which one you would buy and why here.
Case 2: \(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.1\) and \(\mathbf{E}[X_B] = 2, \mathbf{V}[X_B] = 0.1\).
Answer:
# Make the random variables here
# Plot the PDF's here
# Find the 95% central credible interval for XA here
# Find the 95% central credible interval for XB here
Your answer to which one you would buy and why here.
Case 3: \(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.3\) and \(\mathbf{E}[X_B] = 1.1, \mathbf{V}[X_B] = 0.1\).
Answer:
# Make the random variables here
# Plot the PDF's here
# Find the 95% central credible interval for XA here
# Find the 95% central credible interval for XA here
Your answer to which one you would buy and why here.
Problem 2 - Figuring out which household conserves less energy¶
In this homework problem, we are going to look at a dataset for which the Normal is not a good fit. In particular, we are going to look at HVAC energy consumption in our high-performance building data.
import requests
import os
def download(url, local_filename=None):
"""
Downloads the file in the ``url`` and saves it in the current working directory.
"""
data = requests.get(url)
if local_filename is None:
local_filename = os.path.basename(url)
with open(local_filename, 'wb') as fd:
fd.write(data.content)
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'
download(url)
import numpy as np
import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')
df
Extract the
hvac
column for householda5
:
hvac_a5 = df[df['household'] = <your-code>][<your-code>] # This does not work, you need to write your own code
Do the histogram of
hvac
for householda5
:
# Your code here
Use the method of moments to fit a Normal distribution to the
hvac
data for householda5
:
# Your code here
HVAC_a5 = st.norm(loc=<your-code>, scale=<your-code>) # This does not work, you need to write your own code
In the same figure, show the histogram of
hvac
data for householda5
(usedensity=True
to make sure itis normalized) and the PDF of the Normal you just fitted. Is this a good fit? Why do you think we do not get a good fit?
# Your code here
Your answer here.
Now I am asking you to transform the data in a way that will make them look more Normal. Do the histogram of the logarithm of the
hvac
data for householda5
:
# your code here
Fit a Normal to the logarithm of the
hvac
data for householda5
:
# your code here
In the same figure, show the histogram of the logarithm of the
hvac
data for householda5
and the PDF of the Normal you just fitted. Does this look like a good fit?
# Your code here
Your answer here.
Now do exactly the same thing as the previous bullet point for household
a3
.
# Your code here - as many blocks as you like
Which household consumes more energy,
a5
ora3
?
# Your code here - if needed
Your answer here.
Problem 3 - Introducing the Log-Normal distribution¶
In the previous problem, we took the logarithm of hvac
in order to obtain a better fit to the Normal.
It turns out that this is a very common practice whenever you have a positive dataset that is skewed in the way we noticed above.
As a matter of fact, there is a distribution called the Log-Normal distribution which is designed to do exactly that.
Below, I show you could have fitted a Log-Normal distribution directly on the hvac
data.
params = st.lognorm.fit(hvac_a5) # This does something similar to the method of moments
HVAC_a5_ln = st.lognorm(*params) # This is the random varibale
# Here is how you can sample from it:
HVAC_a5_ln.rvs(size=10)
# Here is how you can evaluate its PDF:
HVAC_a5_ln.pdf(200)
In the same figure, plot the PDF of the Log-normal along with the histogram of the
hvac
data fora5
.
# Your code here - Plot the PDF of HVAC_a5_ln along with the histogram of hvac_a5.
Is the fit with Log-Normal good?
Your answer here.
Recall how in Quantiles of the standard Normal we used the
ppf()
function of a random variable to find quantiles. Use this function to find the 0.025-quantile of the Log-Normal we constructed above.
# Your code here
Now find the 95% central credible interval of the Log-Normal variable we constructed above.
# Your code here
Repeat as many code-blocks as you need to fit a Log-Normal to the
hvac
data for householda3
and then find the 95% central credible interval of the resulting random variable.
# Your code here
By comparing the 95% central credible intervals constructed above, which household
a5
ora3
consumes more energy?
Your answer here.