The Python data analysis library

Special thanks to Vanessa Kwarteng for putting this Jupyter notebook together.

Pandas introduction

Pandas is Python package that provides fast, flexible and data structures that lets you work real life data quickly. More information about the pandas package can be found here

In order to use pandas in python we need to import the library into the notebook, as shown in the following cell.

import pandas as pd

Loading files

If we want to look at data we have collected for a project or experiment, we need to upload that data into jupyter notebook. To do this we can import the file using a pandas command. Pandas is able to read various types of files, in this activity we will be using pandas to read a csv (comma separated values) file.

To import the csv file in Python we use the following command pandas.read_csv. In Python there are some common shorthand we can use, pandas is typically shorten to pd. So we will use pd.read_csv to import the data file that we will work with.

Dataset

Scenario: A researcher wants to gain some insight about energy usage in apartment homes in a community and provide a grade for each apartments energy usage. The dataset temp_price.csv that you are given contains HVAC energy usage data collected from 50 apartment homes in one day

  • The apartment ID number (household)

  • The outdoor temperature for the day (t_out)

  • The indoor temperature of the apartment for the day (t_unit)

  • The amount of kwh consumed by the HVAC system for the day (hvac)

  • The price per kwh used (price)

  • The price per week for using the hvac (Price per week)

  • The price per day for using the hvac (Price per day)

Let’s import the data to explore. This dataset is small, so that you will be able to see the changes you are making in order to become comfortable with cleaning data. In the future you will work with larger sized data sets.

The dataset can be accessed in this link:

https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/activities/temp_price.csv

If you run the book content on your PC, then all you have to do is to download this file and put it in the same directory as the book. If you work on Google Colab things are a bit more complicated. We will show you how to do it later in the homework problems. For now, let us just sprinkle some magick code that downloads the file and puts it in a spot that Google Colab can see:

import requests
import os
def download(url, local_filename=None):
    """
    Downloads the file in the ``url`` and saves it in the current working directory.
    """
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
        fd.write(data.content)
   
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temp_price.csv'
download(url)

Now you should have access to the temp_price.csv file from within this notebook and the following code should work:

temp_price = pd.read_csv('temp_price.csv')

Let’s see the contents:

temp_price
household date score t_out t_unit hvac price Price per week Price per day
0 a1 2019-01-06 85 38.599231 71.580704 35.113758 0.17303 6.075734 0.867962
1 a10 2019-01-06 70 38.599231 73.286260 63.949057 0.17303 11.065105 1.580729
2 a11 2019-01-06 61 38.599231 74.252046 147.612108 0.17303 25.541323 3.648760
3 a12 2019-01-06 65 38.599231 73.708482 74.394518 0.17303 12.872483 1.838926
4 a13 2019-01-06 66 38.599231 73.549554 173.095836 0.17303 29.950772 4.278682
5 a14 2019-01-06 84 38.599231 72.346457 7.040034 0.17303 1.218137 0.174020
6 a15 2019-01-06 1 38.599231 81.478472 639.959855 0.17303 110.732254 15.818893
7 a16 2019-01-06 25 38.599231 76.784504 616.805501 0.17303 106.725856 15.246551
8 a2 2019-01-06 74 38.599231 73.160367 322.648083 0.17303 55.827798 7.975400
9 a3 2019-01-06 55 38.599231 74.025149 95.139840 0.17303 16.462047 2.351721
10 a4 2019-01-06 99 38.599231 71.391270 0.000000 0.17303 0.000000 0.000000
11 a5 2019-01-06 78 38.599231 76.724752 70.942532 0.17303 12.275186 1.753598
12 a6 2019-01-06 44 38.599231 74.787475 67.350034 0.17303 11.653576 1.664797
13 a7 2019-01-06 62 38.599231 76.071354 71.177988 0.17303 12.315927 1.759418
14 a8 2019-01-06 65 38.599231 73.935665 50.812378 0.17303 8.792066 1.256009
15 a9 2019-01-06 51 38.599231 75.807242 108.091675 0.17303 18.703103 2.671872
16 b17 2019-01-06 66 38.599231 72.950992 130.509717 0.17303 22.582096 3.226014
17 b18 2019-01-06 90 38.599231 70.947520 60.078379 0.17303 10.395362 1.485052
18 b19 2019-01-06 30 38.599231 74.636062 0.000000 0.17303 0.000000 0.000000
19 b20 2019-01-06 93 38.599231 75.286905 21.320244 0.17303 3.689042 0.527006
20 b21 2019-01-06 16 38.599231 75.749281 0.000000 0.17303 0.000000 0.000000
21 b22 2019-01-06 54 38.599231 78.609127 93.928528 0.17303 16.252453 2.321779
22 b23 2019-01-06 42 38.599231 78.693750 66.335020 0.17303 11.477949 1.639707
23 b24 2019-01-06 89 38.599231 73.677604 68.989156 0.17303 11.937194 1.705313
24 b25 2019-01-06 45 38.599231 77.909474 101.537204 0.17303 17.568982 2.509855
25 b26 2019-01-06 57 38.599231 75.206448 122.327846 0.17303 21.166387 3.023770
26 b27 2019-01-06 -100 38.599231 -100.000000 NaN 0.17303 0.000000 NaN
27 b28 2019-01-06 48 38.599231 71.536830 39.719913 0.17303 6.872737 0.981820
28 b29 2019-01-06 82 38.599231 72.493794 73.090187 0.17303 12.646795 1.806685
29 b30 2019-01-06 83 38.599231 77.639881 17.972378 0.17303 3.109761 0.444252
30 b31 2019-01-06 95 38.599231 70.905208 55.940869 0.17303 9.679448 1.382778
31 b32 2019-01-06 -100 38.599231 -100.000000 NaN 0.17303 0.000000 NaN
32 b33 2019-01-06 85 38.599231 72.888294 35.809398 0.17303 6.196100 0.885157
33 c34 2019-01-06 -100 38.599231 -100.000000 NaN 0.17303 0.000000 NaN
34 c35 2019-01-06 92 38.599231 70.424206 63.946396 0.17303 11.064645 1.580664
35 c36 2019-01-06 47 38.599231 76.082391 104.719999 0.17303 18.119701 2.588529
36 c37 2019-01-06 58 38.599231 74.256548 41.795780 0.17303 7.231924 1.033132
37 c38 2019-01-06 38 38.599231 74.836905 63.938661 0.17303 11.063306 1.580472
38 c39 2019-01-06 56 38.599231 73.872272 68.115254 0.17303 11.785982 1.683712
39 c40 2019-01-06 66 38.599231 75.579861 114.096389 0.17303 19.742098 2.820300
40 c41 2019-01-06 90 38.599231 69.140749 0.000000 0.17303 0.000000 0.000000
41 c42 2019-01-06 -100 38.599231 -100.000000 NaN 0.17303 0.000000 NaN
42 c43 2019-01-06 38 38.599231 78.054390 136.530552 0.17303 23.623881 3.374840
43 c44 2019-01-06 28 38.599231 78.211012 170.058540 0.17303 29.425229 4.203604
44 c45 2019-01-06 66 38.599231 74.781052 197.492980 0.17303 34.172210 4.881744
45 c46 2019-01-06 76 38.599231 72.913740 124.591415 0.17303 21.558053 3.079722
46 c47 2019-01-06 74 38.599231 72.846354 84.631154 0.17303 14.643729 2.091961
47 c48 2019-01-06 53 38.599231 74.623562 279.324525 0.17303 48.331523 6.904503
48 c49 2019-01-06 71 38.599231 72.767163 51.483217 0.17303 8.908141 1.272592
49 c50 2019-01-06 52 38.599231 75.673562 87.482236 0.17303 15.137051 2.162436

Notice the the format is much nicer than a numpy array.

Observations from the dataset so far

The very first column is the index of each entry. Notice how python counts the rows, it starts with 0 and not 1. This is different than other programming softwares that start counting from 1. Then you have columns with various names. The first column is the “household” id, then the “date,” etc.

Let’s continue to explore the data. The following function gives you a summary of the satistics of the data:

temp_price.describe()
score t_out t_unit hvac price Price per week Price per day
count 50.000000 5.000000e+01 50.000000 46.000000 5.000000e+01 50.000000 46.000000
mean 49.300000 3.859923e+01 60.521694 109.128242 1.730300e-01 17.371863 2.697494
std 49.351405 2.871026e-14 47.875493 129.874146 8.411210e-17 22.148354 3.210303
min -100.000000 3.859923e+01 -100.000000 0.000000 1.730300e-01 0.000000 0.000000
25% 44.250000 3.859923e+01 72.562136 50.980088 1.730300e-01 6.365259 1.260155
50% 61.500000 3.859923e+01 73.980407 71.060260 1.730300e-01 11.861588 1.756508
75% 77.500000 3.859923e+01 75.730351 120.269982 1.730300e-01 19.482349 2.972902
max 99.000000 3.859923e+01 81.478472 639.959855 1.730300e-01 110.732254 15.818893

The stsatistics are shown for each column and they are calculated over rows. min and max are obvious. mean is the average and std is the standard deviation. We will learn about those in a few lectures. The entries 25%, 50\5, and 75% are the so called “quantiles.” These are also statistical quantities about which are going to learn in a few lectures. count is simply the number of observations per column. Notice that it is 50 for most columns. But for some columns, e.g., hvac, some entries are missing.

Printing specific ranges

In this next section we look at different ways to quickly get some insight about the data and also how to look at specific columns, rows, etc.

If we want to look at specific locations in the data we can do the following:

  • head() - the default displays the first 5 rows

  • tail() - the default displays the last 5 rows

The way you can think about the syntax when using pandas is Dataframe_name.action_to_be_done(params_if_needed)

temp_price.head()
household date score t_out t_unit hvac price Price per week Price per day
0 a1 2019-01-06 85 38.599231 71.580704 35.113758 0.17303 6.075734 0.867962
1 a10 2019-01-06 70 38.599231 73.286260 63.949057 0.17303 11.065105 1.580729
2 a11 2019-01-06 61 38.599231 74.252046 147.612108 0.17303 25.541323 3.648760
3 a12 2019-01-06 65 38.599231 73.708482 74.394518 0.17303 12.872483 1.838926
4 a13 2019-01-06 66 38.599231 73.549554 173.095836 0.17303 29.950772 4.278682
# This will print the first 10 rows
temp_price.head(10)
household date score t_out t_unit hvac price Price per week Price per day
0 a1 2019-01-06 85 38.599231 71.580704 35.113758 0.17303 6.075734 0.867962
1 a10 2019-01-06 70 38.599231 73.286260 63.949057 0.17303 11.065105 1.580729
2 a11 2019-01-06 61 38.599231 74.252046 147.612108 0.17303 25.541323 3.648760
3 a12 2019-01-06 65 38.599231 73.708482 74.394518 0.17303 12.872483 1.838926
4 a13 2019-01-06 66 38.599231 73.549554 173.095836 0.17303 29.950772 4.278682
5 a14 2019-01-06 84 38.599231 72.346457 7.040034 0.17303 1.218137 0.174020
6 a15 2019-01-06 1 38.599231 81.478472 639.959855 0.17303 110.732254 15.818893
7 a16 2019-01-06 25 38.599231 76.784504 616.805501 0.17303 106.725856 15.246551
8 a2 2019-01-06 74 38.599231 73.160367 322.648083 0.17303 55.827798 7.975400
9 a3 2019-01-06 55 38.599231 74.025149 95.139840 0.17303 16.462047 2.351721
# This prints the last 5 rows
temp_price.tail()
household date score t_out t_unit hvac price Price per week Price per day
45 c46 2019-01-06 76 38.599231 72.913740 124.591415 0.17303 21.558053 3.079722
46 c47 2019-01-06 74 38.599231 72.846354 84.631154 0.17303 14.643729 2.091961
47 c48 2019-01-06 53 38.599231 74.623562 279.324525 0.17303 48.331523 6.904503
48 c49 2019-01-06 71 38.599231 72.767163 51.483217 0.17303 8.908141 1.272592
49 c50 2019-01-06 52 38.599231 75.673562 87.482236 0.17303 15.137051 2.162436
# If you wanted to print the first 7 rows (instead of the first 5) you would do this:
temp_price.head(7)
household date score t_out t_unit hvac price Price per week Price per day
0 a1 2019-01-06 85 38.599231 71.580704 35.113758 0.17303 6.075734 0.867962
1 a10 2019-01-06 70 38.599231 73.286260 63.949057 0.17303 11.065105 1.580729
2 a11 2019-01-06 61 38.599231 74.252046 147.612108 0.17303 25.541323 3.648760
3 a12 2019-01-06 65 38.599231 73.708482 74.394518 0.17303 12.872483 1.838926
4 a13 2019-01-06 66 38.599231 73.549554 173.095836 0.17303 29.950772 4.278682
5 a14 2019-01-06 84 38.599231 72.346457 7.040034 0.17303 1.218137 0.174020
6 a15 2019-01-06 1 38.599231 81.478472 639.959855 0.17303 110.732254 15.818893
# Let's print the last 12 rows, what would you do?
temp_price.tail()
household date score t_out t_unit hvac price Price per week Price per day
45 c46 2019-01-06 76 38.599231 72.913740 124.591415 0.17303 21.558053 3.079722
46 c47 2019-01-06 74 38.599231 72.846354 84.631154 0.17303 14.643729 2.091961
47 c48 2019-01-06 53 38.599231 74.623562 279.324525 0.17303 48.331523 6.904503
48 c49 2019-01-06 71 38.599231 72.767163 51.483217 0.17303 8.908141 1.272592
49 c50 2019-01-06 52 38.599231 75.673562 87.482236 0.17303 15.137051 2.162436

Cleaning the data

In the temp_price file you notice that some cells have -100 under the score column, -100.0 under the t_unit or NaN value for hvac. There are various reasons for this, but the details around it are not helpful for this point. We will focus on the NaN.

When you see NaN this means that the value here is Not a Number and this is the pythonic notation to display that. For this example, when we see NaN that means that this unit is missing over 70% of its data for the week. This most likely means that the sensor for the unit needs to be replaced.

Having these -100, -100.0 and NaN can mess with the interpretation of our anaylsis, so we want to clean the data to remove the corresponding observations.

To do this we will use dropna() to remove the rows that have NaN. To get a better understanding of how dropna() works we can look at the documentation for it: pandas dropna docs

  • Look at the documentation to see what the axis needs to be in order to remove rows vs. columns

temp_no_null_rows = temp_price.dropna(axis=0)
# Print data with no NaN rows
temp_no_null_rows
household date score t_out t_unit hvac price Price per week Price per day
0 a1 2019-01-06 85 38.599231 71.580704 35.113758 0.17303 6.075734 0.867962
1 a10 2019-01-06 70 38.599231 73.286260 63.949057 0.17303 11.065105 1.580729
2 a11 2019-01-06 61 38.599231 74.252046 147.612108 0.17303 25.541323 3.648760
3 a12 2019-01-06 65 38.599231 73.708482 74.394518 0.17303 12.872483 1.838926
4 a13 2019-01-06 66 38.599231 73.549554 173.095836 0.17303 29.950772 4.278682
5 a14 2019-01-06 84 38.599231 72.346457 7.040034 0.17303 1.218137 0.174020
6 a15 2019-01-06 1 38.599231 81.478472 639.959855 0.17303 110.732254 15.818893
7 a16 2019-01-06 25 38.599231 76.784504 616.805501 0.17303 106.725856 15.246551
8 a2 2019-01-06 74 38.599231 73.160367 322.648083 0.17303 55.827798 7.975400
9 a3 2019-01-06 55 38.599231 74.025149 95.139840 0.17303 16.462047 2.351721
10 a4 2019-01-06 99 38.599231 71.391270 0.000000 0.17303 0.000000 0.000000
11 a5 2019-01-06 78 38.599231 76.724752 70.942532 0.17303 12.275186 1.753598
12 a6 2019-01-06 44 38.599231 74.787475 67.350034 0.17303 11.653576 1.664797
13 a7 2019-01-06 62 38.599231 76.071354 71.177988 0.17303 12.315927 1.759418
14 a8 2019-01-06 65 38.599231 73.935665 50.812378 0.17303 8.792066 1.256009
15 a9 2019-01-06 51 38.599231 75.807242 108.091675 0.17303 18.703103 2.671872
16 b17 2019-01-06 66 38.599231 72.950992 130.509717 0.17303 22.582096 3.226014
17 b18 2019-01-06 90 38.599231 70.947520 60.078379 0.17303 10.395362 1.485052
18 b19 2019-01-06 30 38.599231 74.636062 0.000000 0.17303 0.000000 0.000000
19 b20 2019-01-06 93 38.599231 75.286905 21.320244 0.17303 3.689042 0.527006
20 b21 2019-01-06 16 38.599231 75.749281 0.000000 0.17303 0.000000 0.000000
21 b22 2019-01-06 54 38.599231 78.609127 93.928528 0.17303 16.252453 2.321779
22 b23 2019-01-06 42 38.599231 78.693750 66.335020 0.17303 11.477949 1.639707
23 b24 2019-01-06 89 38.599231 73.677604 68.989156 0.17303 11.937194 1.705313
24 b25 2019-01-06 45 38.599231 77.909474 101.537204 0.17303 17.568982 2.509855
25 b26 2019-01-06 57 38.599231 75.206448 122.327846 0.17303 21.166387 3.023770
27 b28 2019-01-06 48 38.599231 71.536830 39.719913 0.17303 6.872737 0.981820
28 b29 2019-01-06 82 38.599231 72.493794 73.090187 0.17303 12.646795 1.806685
29 b30 2019-01-06 83 38.599231 77.639881 17.972378 0.17303 3.109761 0.444252
30 b31 2019-01-06 95 38.599231 70.905208 55.940869 0.17303 9.679448 1.382778
32 b33 2019-01-06 85 38.599231 72.888294 35.809398 0.17303 6.196100 0.885157
34 c35 2019-01-06 92 38.599231 70.424206 63.946396 0.17303 11.064645 1.580664
35 c36 2019-01-06 47 38.599231 76.082391 104.719999 0.17303 18.119701 2.588529
36 c37 2019-01-06 58 38.599231 74.256548 41.795780 0.17303 7.231924 1.033132
37 c38 2019-01-06 38 38.599231 74.836905 63.938661 0.17303 11.063306 1.580472
38 c39 2019-01-06 56 38.599231 73.872272 68.115254 0.17303 11.785982 1.683712
39 c40 2019-01-06 66 38.599231 75.579861 114.096389 0.17303 19.742098 2.820300
40 c41 2019-01-06 90 38.599231 69.140749 0.000000 0.17303 0.000000 0.000000
42 c43 2019-01-06 38 38.599231 78.054390 136.530552 0.17303 23.623881 3.374840
43 c44 2019-01-06 28 38.599231 78.211012 170.058540 0.17303 29.425229 4.203604
44 c45 2019-01-06 66 38.599231 74.781052 197.492980 0.17303 34.172210 4.881744
45 c46 2019-01-06 76 38.599231 72.913740 124.591415 0.17303 21.558053 3.079722
46 c47 2019-01-06 74 38.599231 72.846354 84.631154 0.17303 14.643729 2.091961
47 c48 2019-01-06 53 38.599231 74.623562 279.324525 0.17303 48.331523 6.904503
48 c49 2019-01-06 71 38.599231 72.767163 51.483217 0.17303 8.908141 1.272592
49 c50 2019-01-06 52 38.599231 75.673562 87.482236 0.17303 15.137051 2.162436

What do you notice?

  • Do you see any NaN or -100 values in the dataset?

  • Do we still have the same number of indicies?

We can use the DataFrame.shape to compare the previous table with the removal of the NaN to double check.

# shape of data after cleaning
temp_no_null_rows.shape
(46, 9)
# shape of data before cleaning
temp_price.shape
(50, 9)

The shape() tells the number of rows and columns of the dataset.

Note

When dealing with columns, you typically want to keep everything lowercase and remove spaces. This makes it easier to type the column names later on. So, to rename an existing column in the dataframe we can do the following: df.rename(columns={'current_name' : 'new_name'}, inplace=True)

Let’s rename Price per week to week_price and Price per day to daily_price:

# Changing the title of columns
clean_data = temp_no_null_rows.rename(columns={'Price per week': 'week_price',
                                               'Price per day': 'daily_price'})
# See the the column names have changed
clean_data.head()
household date score t_out t_unit hvac price week_price daily_price
0 a1 2019-01-06 85 38.599231 71.580704 35.113758 0.17303 6.075734 0.867962
1 a10 2019-01-06 70 38.599231 73.286260 63.949057 0.17303 11.065105 1.580729
2 a11 2019-01-06 61 38.599231 74.252046 147.612108 0.17303 25.541323 3.648760
3 a12 2019-01-06 65 38.599231 73.708482 74.394518 0.17303 12.872483 1.838926
4 a13 2019-01-06 66 38.599231 73.549554 173.095836 0.17303 29.950772 4.278682

Now, you may be wandering what exactly is:

columns={'Price per week': 'week_price',
         'Price per day': 'daily_price'}

This defines a dictionary. It is a special Python type called dict. See Sec. 5.5 here. But let’s also run it separately:

columns={'Price per week': 'week_price',
         'Price per day': 'daily_price'}
columns
{'Price per week': 'week_price', 'Price per day': 'daily_price'}
type(columns)
dict
columns['Price per week']
'week_price'
columns['Price per day']
'daily_price'
columns.keys()
dict_keys(['Price per week', 'Price per day'])

So, rename looks at the dictionary for keys that are column names and then replacs the corresponding column name with whatever the dictionary indicates.

Now that we have clean data, we can explore the data set with the following actions:

  • df['column_name'] - to select a single column from the dataframe

  • df[['column_name1', 'column_name7']] - to select a lists of columns from dataframe

  • df.loc[rowindex1] - to select a single row from the dataframe

  • df.loc[[rowindex1, rowindex2]] - to select a list of rows from the dataframe

Let’s give some examples.

Get the household column:

clean_data['household']
0      a1
1     a10
2     a11
3     a12
4     a13
5     a14
6     a15
7     a16
8      a2
9      a3
10     a4
11     a5
12     a6
13     a7
14     a8
15     a9
16    b17
17    b18
18    b19
19    b20
20    b21
21    b22
22    b23
23    b24
24    b25
25    b26
27    b28
28    b29
29    b30
30    b31
32    b33
34    c35
35    c36
36    c37
37    c38
38    c39
39    c40
40    c41
42    c43
43    c44
44    c45
45    c46
46    c47
47    c48
48    c49
49    c50
Name: household, dtype: object

Select multiple columns:

clean_data[['household', 'daily_price']]
household daily_price
0 a1 0.867962
1 a10 1.580729
2 a11 3.648760
3 a12 1.838926
4 a13 4.278682
5 a14 0.174020
6 a15 15.818893
7 a16 15.246551
8 a2 7.975400
9 a3 2.351721
10 a4 0.000000
11 a5 1.753598
12 a6 1.664797
13 a7 1.759418
14 a8 1.256009
15 a9 2.671872
16 b17 3.226014
17 b18 1.485052
18 b19 0.000000
19 b20 0.527006
20 b21 0.000000
21 b22 2.321779
22 b23 1.639707
23 b24 1.705313
24 b25 2.509855
25 b26 3.023770
27 b28 0.981820
28 b29 1.806685
29 b30 0.444252
30 b31 1.382778
32 b33 0.885157
34 c35 1.580664
35 c36 2.588529
36 c37 1.033132
37 c38 1.580472
38 c39 1.683712
39 c40 2.820300
40 c41 0.000000
42 c43 3.374840
43 c44 4.203604
44 c45 4.881744
45 c46 3.079722
46 c47 2.091961
47 c48 6.904503
48 c49 1.272592
49 c50 2.162436

In the above example you are giving a list of column names as an index. This is why you have to square brackets. The first is for the data frame index. The second is for the list of columns. Maybe, this will help understand what is going on:

column_list = ['household', 'daily_price']
clean_data[column_list]
household daily_price
0 a1 0.867962
1 a10 1.580729
2 a11 3.648760
3 a12 1.838926
4 a13 4.278682
5 a14 0.174020
6 a15 15.818893
7 a16 15.246551
8 a2 7.975400
9 a3 2.351721
10 a4 0.000000
11 a5 1.753598
12 a6 1.664797
13 a7 1.759418
14 a8 1.256009
15 a9 2.671872
16 b17 3.226014
17 b18 1.485052
18 b19 0.000000
19 b20 0.527006
20 b21 0.000000
21 b22 2.321779
22 b23 1.639707
23 b24 1.705313
24 b25 2.509855
25 b26 3.023770
27 b28 0.981820
28 b29 1.806685
29 b30 0.444252
30 b31 1.382778
32 b33 0.885157
34 c35 1.580664
35 c36 2.588529
36 c37 1.033132
37 c38 1.580472
38 c39 1.683712
39 c40 2.820300
40 c41 0.000000
42 c43 3.374840
43 c44 4.203604
44 c45 4.881744
45 c46 3.079722
46 c47 2.091961
47 c48 6.904503
48 c49 1.272592
49 c50 2.162436

Select a row:

clean_data.loc[30]
household             b31
date           2019-01-06
score                  95
t_out           38.599231
t_unit          70.905208
hvac            55.940869
price             0.17303
week_price       9.679448
daily_price      1.382778
Name: 30, dtype: object

Select certain rows:

clean_data.loc[[25,39, 45]]
household date score t_out t_unit hvac price week_price daily_price
25 b26 2019-01-06 57 38.599231 75.206448 122.327846 0.17303 21.166387 3.023770
39 c40 2019-01-06 66 38.599231 75.579861 114.096389 0.17303 19.742098 2.820300
45 c46 2019-01-06 76 38.599231 72.913740 124.591415 0.17303 21.558053 3.079722

Select a slice of rows:

clean_data.loc[25:36]
household date score t_out t_unit hvac price week_price daily_price
25 b26 2019-01-06 57 38.599231 75.206448 122.327846 0.17303 21.166387 3.023770
27 b28 2019-01-06 48 38.599231 71.536830 39.719913 0.17303 6.872737 0.981820
28 b29 2019-01-06 82 38.599231 72.493794 73.090187 0.17303 12.646795 1.806685
29 b30 2019-01-06 83 38.599231 77.639881 17.972378 0.17303 3.109761 0.444252
30 b31 2019-01-06 95 38.599231 70.905208 55.940869 0.17303 9.679448 1.382778
32 b33 2019-01-06 85 38.599231 72.888294 35.809398 0.17303 6.196100 0.885157
34 c35 2019-01-06 92 38.599231 70.424206 63.946396 0.17303 11.064645 1.580664
35 c36 2019-01-06 47 38.599231 76.082391 104.719999 0.17303 18.119701 2.588529
36 c37 2019-01-06 58 38.599231 74.256548 41.795780 0.17303 7.231924 1.033132

And there many more things that you can easily do with pandas. There are so many things to do with pandas that you can spend weeks on this library alone. Our suggestion is to learn pandas as needed when you experiment with data files. Typically, if you think about what you need to do, you can find how to do it in pandas with a good Google search.

Questions

  • Write pandas code that extracts the week_price column from clean_data.

# Your code here
  • Write pandas code that gives you tghe hvac and the price columns.

# Your code here