Homework 12

  • Type your name and email in the “Student details” section below.

  • Develop the code and generate the figures you need to solve the problems using this notebook.

  • For the answers that require a mathematical proof or derivation you can either:

    • Type the answer using the built-in latex capabilities. In this case, simply export the notebook as a pdf and upload it on gradescope; or

    • You can print the notebook (after you are done with all the code), write your answers by hand, scan, turn your response to a single pdf, and upload on gradescope.

  • The total homework points are 100. Please note that the problems are not weighed equally.

Note

  • This is due before the beginning of the next lecture.

  • Please match all the pages corresponding to each of the questions when you submit on gradescope.

Student details

  • First Name:

  • Last Name:

  • Email:

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import numpy as np
import scipy.stats as st

Problem 1 - Comparing performance of robotic systems

You are considering purchasing a robotic system for manufacturing masks. There are two possibilities, say \(A\) and \(B\). They both produce the same number of masks per day, they cost the same to purchase, and the have the same power and supply costs. However, they are not identical. They have different faulty mask rates. Let \(X_A\) and \(X_B\) be the number of faulty masks you get from each system, respectively, in a given day. For each of the possibilities below:

  1. Use scipy.stats to make two Normal random variables \(X_A\) and \(X_B\) with the right mean and variance.

  2. Plot the PDF of the random variables in the same figure.

  3. Find a 95% central credible interval.

  4. Indicate which robotic system you would buy and why (three choices \(A\), \(B\), and ``I cannot choose’’).

  • Case 1: \(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.1\) and \(\mathbf{E}[X_B] = 1, \mathbf{V}[X_B] = 0.2\).

Answer:

# Make the random variables here
XA = st.norm(loc=#YOUR CHOICE,
             scale=#YOUR CHOICE)
XB = # YOUR CODE
# Plot the PDF's here
# Find the 95% central credible interval for XA here
# Find the 95% central credible interval for XB here

Your answer to which one you would buy and why here.

  • Case 2: \(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.1\) and \(\mathbf{E}[X_B] = 2, \mathbf{V}[X_B] = 0.1\).

Answer:

# Make the random variables here
# Plot the PDF's here
# Find the 95% central credible interval for XA here
# Find the 95% central credible interval for XB here

Your answer to which one you would buy and why here.

  • Case 3: \(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.3\) and \(\mathbf{E}[X_B] = 1.1, \mathbf{V}[X_B] = 0.1\).

Answer:

# Make the random variables here
# Plot the PDF's here
# Find the 95% central credible interval for XA here
# Find the 95% central credible interval for XA here

Your answer to which one you would buy and why here.

Problem 2 - Figuring out which household conserves less energy

In this homework problem, we are going to look at a dataset for which the Normal is not a good fit. In particular, we are going to look at HVAC energy consumption in our high-performance building data.

import requests
import os
def download(url, local_filename=None):
    """
    Downloads the file in the ``url`` and saves it in the current working directory.
    """
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
        fd.write(data.content)
   
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'
download(url)
import numpy as np
import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')
df
  • Extract the hvac column for household a5:

hvac_a5 = df[df['household'] = <your-code>][<your-code>] # This does not work, you need to write your own code
  • Do the histogram of hvac for household a5:

# Your code here
  • Use the method of moments to fit a Normal distribution to the hvac data for household a5:

# Your code here
HVAC_a5 = st.norm(loc=<your-code>, scale=<your-code>) # This does not work, you need to write your own code
  • In the same figure, show the histogram of hvac data for household a5 (use density=True to make sure itis normalized) and the PDF of the Normal you just fitted. Is this a good fit? Why do you think we do not get a good fit?

# Your code here

Your answer here.

  • Now I am asking you to transform the data in a way that will make them look more Normal. Do the histogram of the logarithm of the hvac data for household a5:

# your code here
  • Fit a Normal to the logarithm of the hvac data for household a5:

# your code here
  • In the same figure, show the histogram of the logarithm of the hvac data for household a5 and the PDF of the Normal you just fitted. Does this look like a good fit?

# Your code here

Your answer here.

  • Now do exactly the same thing as the previous bullet point for household a3.

# Your code here - as many blocks as you like
  • Which household consumes more energy, a5 or a3?

# Your code here - if needed

Your answer here.

Problem 3 - Introducing the Log-Normal distribution

In the previous problem, we took the logarithm of hvac in order to obtain a better fit to the Normal. It turns out that this is a very common practice whenever you have a positive dataset that is skewed in the way we noticed above. As a matter of fact, there is a distribution called the Log-Normal distribution which is designed to do exactly that. Below, I show you could have fitted a Log-Normal distribution directly on the hvac data.

params = st.lognorm.fit(hvac_a5) # This does something similar to the method of moments
HVAC_a5_ln = st.lognorm(*params) # This is the random varibale
# Here is how you can sample from it:
HVAC_a5_ln.rvs(size=10)
# Here is how you can evaluate its PDF:
HVAC_a5_ln.pdf(200)
  • In the same figure, plot the PDF of the Log-normal along with the histogram of the hvac data for a5.

# Your code here - Plot the PDF of HVAC_a5_ln along with the histogram of hvac_a5.
  • Is the fit with Log-Normal good?

Your answer here.

  • Recall how in Quantiles of the standard Normal we used the ppf() function of a random variable to find quantiles. Use this function to find the 0.025-quantile of the Log-Normal we constructed above.

# Your code here
  • Now find the 95% central credible interval of the Log-Normal variable we constructed above.

# Your code here
  • Repeat as many code-blocks as you need to fit a Log-Normal to the hvac data for household a3 and then find the 95% central credible interval of the resulting random variable.

# Your code here
  • By comparing the 95% central credible intervals constructed above, which household a5 or a3 consumes more energy?

Your answer here.