Homework 3

  • Type your name and email in the “Student details” section below.

  • Develop the code and generate the figures you need to solve the problems using this notebook.

  • For the answers that require a mathematical proof or derivation you can either:

    • Type the answer using the built-in latex capabilities. In this case, simply export the notebook as a pdf and upload it on gradescope; or

    • You can print the notebook (after you are done with all the code), write your answers by hand, scan, turn your response to a single pdf, and upload on gradescope.

  • The total homework points are 100. Please note that the problems are not weighed equally.


  • This is due before the beginning of the next lecture.

  • Please match all the pages corresponding to each of the questions when you submit on gradescope.

Student details

  • First Name:

  • Last Name:

  • Email:

Problem 1 - High-performance buildings

In this problem we are going to use the pandas library to analyze a real dataset collected as part of the NSF-funded project Sociotechnical Systems to Enable Energy-Aware Residential Communities. You can find the raw dataset here:


Feel free to download it and open it up in Excel.

Uploading data files to Google Colab

The first thing we are going to do in this problem is make this file accessible from this notebook. We could simply use the download() function that I introduced in the lecture. I will put the code at the very last block for your convenience. However, I still want to show you how you can upload your own data. Here we go.

Go to your Google Drive. We need to select a folder to put the files. To keep things simple, let’s just dump everything in “My Drive/Colab Notebooks,” i.e., the same folder that contains the copy of this Jupyter notebook which you should have already made Once you have entered this folder in your Google Drive (just double click on it), drag and drop the temperature_raw.xlsx file in there.

Now we need to make the Google Drive visible from this computational session. We do this by mounting the drive. You need to run this code and follow the instructions:

# The following code does not run unless you are running the notebook on Google Colab
from google.colab import drive

Finally, change directories so that the data files are in the current working directory:

# Only if you are running on Google Colab...
# Print working directory before changing
# Change to the desired directory on Google drive
%cd "/content/drive/My Drive/Colab Notebooks"
# Print working directory after changing
# List the contents of the directory - Make sure it does contain the files we want:

Now you should have access to the file and you can skip the next code block. If you tried and something went wrong, simply run the code block below and it will put the data file in the right location for you.

# Only run this if the steps above failed for some reason.
import requests
import os
def download(url, local_filename=None):
    Downloads the file in the ``url`` and saves it in the current working directory.
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
# The url of the file we want to download
url = 'https://github.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/raw/master/data/temperature_raw.xlsx'

Loading Excel files

Use the pd.read_excel function to read the temperature_raw.xlsx file. Name the data frame you read df.

import pandas as pd
df = # your code here


Now that you have access to the data frame, let’s explore it. Answer the following questions.

  • Print the first 10 rows of the data frame df:

# Your code here
  • Print the last 10 rows of the data frame df:

# Your code here
  • Print summary statistics of the data set.

# Your code here
  • Notice that there is a column called “date”, but pandas is not aware that this is actually a date. Let’s make it aware of that.

df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')
  • Count how many NaN values you have on each column. *Hint: Google it!

# your code here
  • Clean the data set by dropping all NaN values. Call the cleaned data df_clean.

df_clean = # your code here
  • Verify that there are no NaN values in df_clean.

# your code here
  • How many unique households do we have and what are their unique names?

  • Hint: Google it!

# your code here
  • Print summary statistics for your cleaned dataset.

# your code here
  • Save your cleaned dataset in a csv file using df_clean.to_csv. Use the name temperature_clean.csv.

# your code here
  • Verify that the file exists by running the following lines of code (does not work on Windows):

!ls -lth temperature_clean.csv
!head temperature_clean.csv
  • Save the cleanded data set in an Microsoft Excel format. Look at pandas.DataFrame for the right function. Call the file temperature_clean.xlsx.

# your code here
  • Download the the newly created temperature_clean.xlsx file to your computer and open it up. There is no need to show something here. Just do it for your own education. If you are running the Jupyter notebook on your own computer, then there is nothing to do apart from finding the file. If you are working on Google Colab, see this.

# your code here (if needed)

Problem 2 - Chemical reactions

In this problem, we are going to make a new DataFrame from a matrix of data. We will use the catalysis dataset we introduced in the lecture. As a reminder, this dataset is from a chemical reaction studied in [Katsounaros et al., 2012]. The chemical reaction converts NO3 to N2. We have seven measurements at thirty second intervals. The masses are in grams. We saw in the lecture that the mass of the measured chemical species is not conserved, which suggests that there is at least one more thing we do not measure. The goal of this problem is to create a dataframe with the data and then add a column to it with our guess (prediction) for the mass of the unobserved chemical species.

  • First, let’s make a list with the column names. The names are “Time, NO3, NO2, N2, NH3, N2O.” Make the list and call it column_names. Let me make part of the list and you complete the rest.

column_names = ['Time', 'NO3'] # Complete the rest of the list - Pay attention to
                               # the fact that the column names are strings.
  • Now, let’s put the measured data in a numpy array. Remember that these are grams. The data are:

# There is no point running this code block - I just wanted to put the raw data here

Let me make part of the matrix for you, and you should complete the rest:

raw_data = np.array(
        # your code here - be careful with the commas

Remember that your matrix must have six columns and seven rows.

  • Let’s put the data so far in a dataframe.

import pandas as pd
df = pd.DataFrame(data= # your code here - the raw data in matrix format
                  columns= # your code here - the column names
  • In the lecture notes, we saw that the sum of the masses at each timestep was not conserved. Let’s calculate how much mass is missing at each timestep. I will guide you through the steps. First, write some code to extract only the masses from the raw data. Get all columns of raw_data except the first one:

mass_data = # your code here
  • Now, write code that sums all the columns for each row:

total_measured_mass = # your code here
  • Now, find how much is the missing mass of the unobserved chemical species X by subtracting the total measured mass from the initial mass (which was 500 grams).

mass_of_X = # your code here

Very nice! You just made your first model! Your model is: $\( \text{mass of X} = 500\;\text{grams} - \text{mass of NO3} - \dots -\text{mass of N2O}. \)$ Notice that we did not use just data to build this model. We used a physical principle (conservation of mass) plus data! It is a special kind of model. We will learn more about models in later lectures!

  • Let’s now make a new dataframe that also contains X. We need to add a column to our original dataframe. This is super simple. You can do it like this:

df['X'] = # your code here
  • Now, save the dataframe to an Excel file.

# your code here