# Homework 12

## Contents

# Homework 12¶

Type your name and email in the “Student details” section below.

Develop the code and generate the figures you need to solve the problems using this notebook.

For the answers that require a mathematical proof or derivation you can either:

Type the answer using the built-in latex capabilities. In this case, simply export the notebook as a pdf and upload it on gradescope; or

You can print the notebook (after you are done with all the code), write your answers by hand, scan, turn your response to a single pdf, and upload on gradescope.

The total homework points are 100. Please note that the problems are not weighed equally.

Note

This is due before the beginning of the next lecture.

Please match all the pages corresponding to each of the questions when you submit on gradescope.

## Student details¶

**First Name:****Last Name:****Email:**

```
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import numpy as np
import scipy.stats as st
```

## Problem 1 - Comparing performance of robotic systems¶

You are considering purchasing a robotic system for manufacturing masks. There are two possibilities, say \(A\) and \(B\). They both produce the same number of masks per day, they cost the same to purchase, and the have the same power and supply costs. However, they are not identical. They have different faulty mask rates. Let \(X_A\) and \(X_B\) be the number of faulty masks you get from each system, respectively, in a given day. For each of the possibilities below:

Use

`scipy.stats`

to make two Normal random variables \(X_A\) and \(X_B\) with the right mean and variance.Plot the PDF of the random variables in the same figure.

Find a 95% central credible interval.

Indicate which robotic system you would buy and why (three choices \(A\), \(B\), and ``I cannot choose’’).

**Case 1:**\(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.1\) and \(\mathbf{E}[X_B] = 1, \mathbf{V}[X_B] = 0.2\).

**Answer:**

```
# Make the random variables here
XA = st.norm(loc=#YOUR CHOICE,
scale=#YOUR CHOICE)
XB = # YOUR CODE
```

```
# Plot the PDF's here
```

```
# Find the 95% central credible interval for XA here
```

```
# Find the 95% central credible interval for XB here
```

*Your answer to which one you would buy and why here.*

**Case 2:**\(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.1\) and \(\mathbf{E}[X_B] = 2, \mathbf{V}[X_B] = 0.1\).

**Answer:**

```
# Make the random variables here
```

```
# Plot the PDF's here
```

```
# Find the 95% central credible interval for XA here
```

```
# Find the 95% central credible interval for XB here
```

*Your answer to which one you would buy and why here.*

**Case 3:**\(\mathbf{E}[X_A] = 1, \mathbf{V}[X_A] = 0.3\) and \(\mathbf{E}[X_B] = 1.1, \mathbf{V}[X_B] = 0.1\).

**Answer:**

```
# Make the random variables here
```

```
# Plot the PDF's here
```

```
# Find the 95% central credible interval for XA here
```

```
# Find the 95% central credible interval for XA here
```

*Your answer to which one you would buy and why here.*

## Problem 2 - Figuring out which household conserves less energy¶

In this homework problem, we are going to look at a dataset for which the Normal is not a good fit. In particular, we are going to look at HVAC energy consumption in our high-performance building data.

```
import requests
import os
def download(url, local_filename=None):
"""
Downloads the file in the ``url`` and saves it in the current working directory.
"""
data = requests.get(url)
if local_filename is None:
local_filename = os.path.basename(url)
with open(local_filename, 'wb') as fd:
fd.write(data.content)
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'
download(url)
import numpy as np
import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')
df
```

Extract the

`hvac`

column for household`a5`

:

```
hvac_a5 = df[df['household'] = <your-code>][<your-code>] # This does not work, you need to write your own code
```

Do the histogram of

`hvac`

for household`a5`

:

```
# Your code here
```

Use the method of moments to fit a Normal distribution to the

`hvac`

data for household`a5`

:

```
# Your code here
HVAC_a5 = st.norm(loc=<your-code>, scale=<your-code>) # This does not work, you need to write your own code
```

In the same figure, show the histogram of

`hvac`

data for household`a5`

(use`density=True`

to make sure itis normalized) and the PDF of the Normal you just fitted. Is this a good fit? Why do you think we do not get a good fit?

```
# Your code here
```

*Your answer here.*

Now I am asking you to transform the data in a way that will make them look more Normal. Do the histogram of the logarithm of the

`hvac`

data for household`a5`

:

```
# your code here
```

Fit a Normal to the logarithm of the

`hvac`

data for household`a5`

:

```
# your code here
```

In the same figure, show the histogram of the

**logarithm**of the`hvac`

data for household`a5`

and the PDF of the Normal you just fitted. Does this look like a good fit?

```
# Your code here
```

*Your answer here.*

Now do exactly the same thing as the previous bullet point for household

`a3`

.

```
# Your code here - as many blocks as you like
```

Which household consumes more energy,

`a5`

or`a3`

?

```
# Your code here - if needed
```

*Your answer here.*

## Problem 3 - Introducing the Log-Normal distribution¶

In the previous problem, we took the logarithm of `hvac`

in order to obtain a better fit to the Normal.
It turns out that this is a very common practice whenever you have a positive dataset that is skewed in the way we noticed above.
As a matter of fact, there is a distribution called the Log-Normal distribution which is designed to do exactly that.
Below, I show you could have fitted a Log-Normal distribution directly on the `hvac`

data.

```
params = st.lognorm.fit(hvac_a5) # This does something similar to the method of moments
HVAC_a5_ln = st.lognorm(*params) # This is the random varibale
# Here is how you can sample from it:
HVAC_a5_ln.rvs(size=10)
```

```
# Here is how you can evaluate its PDF:
HVAC_a5_ln.pdf(200)
```

In the same figure, plot the PDF of the Log-normal along with the histogram of the

`hvac`

data for`a5`

.

```
# Your code here - Plot the PDF of HVAC_a5_ln along with the histogram of hvac_a5.
```

Is the fit with Log-Normal good?

*Your answer here.*

Recall how in Quantiles of the standard Normal we used the

`ppf()`

function of a random variable to find quantiles. Use this function to find the 0.025-quantile of the Log-Normal we constructed above.

```
# Your code here
```

Now find the 95% central credible interval of the Log-Normal variable we constructed above.

```
# Your code here
```

Repeat as many code-blocks as you need to fit a Log-Normal to the

`hvac`

data for household`a3`

and then find the 95% central credible interval of the resulting random variable.

```
# Your code here
```

By comparing the 95% central credible intervals constructed above, which household

`a5`

or`a3`

consumes more energy?

*Your answer here.*