The Python data analysis library
Contents
The Python data analysis library¶
Special thanks to Vanessa Kwarteng for putting this Jupyter notebook together.
Pandas introduction¶
Pandas is Python package that provides fast, flexible and data structures that lets you work real life data quickly. More information about the pandas package can be found here
In order to use pandas in python we need to import the library into the notebook, as shown in the following cell.
import pandas as pd
Loading files¶
If we want to look at data we have collected for a project or experiment, we need to upload that data into jupyter notebook. To do this we can import the file using a pandas
command. Pandas is able to read various types of files, in this activity we will be using pandas to read a csv (comma separated values) file.
To import the csv file in Python we use the following command pandas.read_csv
. In Python there are some common shorthand we can use, pandas
is typically shorten to pd
. So we will use pd.read_csv
to import the data file that we will work with.
Dataset¶
Scenario: A researcher wants to gain some insight about energy usage in apartment homes in a community and provide a grade for each apartments energy usage. The dataset temp_price.csv
that you are given contains HVAC energy usage data collected from 50 apartment homes in one day
The apartment ID number (household)
The outdoor temperature for the day (t_out)
The indoor temperature of the apartment for the day (t_unit)
The amount of kwh consumed by the HVAC system for the day (hvac)
The price per kwh used (price)
The price per week for using the hvac (Price per week)
The price per day for using the hvac (Price per day)
Let’s import the data to explore. This dataset is small, so that you will be able to see the changes you are making in order to become comfortable with cleaning data. In the future you will work with larger sized data sets.
The dataset can be accessed in this link:
If you run the book content on your PC, then all you have to do is to download this file and put it in the same directory as the book. If you work on Google Colab things are a bit more complicated. We will show you how to do it later in the homework problems. For now, let us just sprinkle some magick code that downloads the file and puts it in a spot that Google Colab can see:
import requests
import os
def download(url, local_filename=None):
"""
Downloads the file in the ``url`` and saves it in the current working directory.
"""
data = requests.get(url)
if local_filename is None:
local_filename = os.path.basename(url)
with open(local_filename, 'wb') as fd:
fd.write(data.content)
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temp_price.csv'
download(url)
Now you should have access to the temp_price.csv
file from within this notebook and the following code should work:
temp_price = pd.read_csv('temp_price.csv')
Let’s see the contents:
temp_price
household | date | score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|---|---|
0 | a1 | 2019-01-06 | 85 | 38.599231 | 71.580704 | 35.113758 | 0.17303 | 6.075734 | 0.867962 |
1 | a10 | 2019-01-06 | 70 | 38.599231 | 73.286260 | 63.949057 | 0.17303 | 11.065105 | 1.580729 |
2 | a11 | 2019-01-06 | 61 | 38.599231 | 74.252046 | 147.612108 | 0.17303 | 25.541323 | 3.648760 |
3 | a12 | 2019-01-06 | 65 | 38.599231 | 73.708482 | 74.394518 | 0.17303 | 12.872483 | 1.838926 |
4 | a13 | 2019-01-06 | 66 | 38.599231 | 73.549554 | 173.095836 | 0.17303 | 29.950772 | 4.278682 |
5 | a14 | 2019-01-06 | 84 | 38.599231 | 72.346457 | 7.040034 | 0.17303 | 1.218137 | 0.174020 |
6 | a15 | 2019-01-06 | 1 | 38.599231 | 81.478472 | 639.959855 | 0.17303 | 110.732254 | 15.818893 |
7 | a16 | 2019-01-06 | 25 | 38.599231 | 76.784504 | 616.805501 | 0.17303 | 106.725856 | 15.246551 |
8 | a2 | 2019-01-06 | 74 | 38.599231 | 73.160367 | 322.648083 | 0.17303 | 55.827798 | 7.975400 |
9 | a3 | 2019-01-06 | 55 | 38.599231 | 74.025149 | 95.139840 | 0.17303 | 16.462047 | 2.351721 |
10 | a4 | 2019-01-06 | 99 | 38.599231 | 71.391270 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
11 | a5 | 2019-01-06 | 78 | 38.599231 | 76.724752 | 70.942532 | 0.17303 | 12.275186 | 1.753598 |
12 | a6 | 2019-01-06 | 44 | 38.599231 | 74.787475 | 67.350034 | 0.17303 | 11.653576 | 1.664797 |
13 | a7 | 2019-01-06 | 62 | 38.599231 | 76.071354 | 71.177988 | 0.17303 | 12.315927 | 1.759418 |
14 | a8 | 2019-01-06 | 65 | 38.599231 | 73.935665 | 50.812378 | 0.17303 | 8.792066 | 1.256009 |
15 | a9 | 2019-01-06 | 51 | 38.599231 | 75.807242 | 108.091675 | 0.17303 | 18.703103 | 2.671872 |
16 | b17 | 2019-01-06 | 66 | 38.599231 | 72.950992 | 130.509717 | 0.17303 | 22.582096 | 3.226014 |
17 | b18 | 2019-01-06 | 90 | 38.599231 | 70.947520 | 60.078379 | 0.17303 | 10.395362 | 1.485052 |
18 | b19 | 2019-01-06 | 30 | 38.599231 | 74.636062 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
19 | b20 | 2019-01-06 | 93 | 38.599231 | 75.286905 | 21.320244 | 0.17303 | 3.689042 | 0.527006 |
20 | b21 | 2019-01-06 | 16 | 38.599231 | 75.749281 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
21 | b22 | 2019-01-06 | 54 | 38.599231 | 78.609127 | 93.928528 | 0.17303 | 16.252453 | 2.321779 |
22 | b23 | 2019-01-06 | 42 | 38.599231 | 78.693750 | 66.335020 | 0.17303 | 11.477949 | 1.639707 |
23 | b24 | 2019-01-06 | 89 | 38.599231 | 73.677604 | 68.989156 | 0.17303 | 11.937194 | 1.705313 |
24 | b25 | 2019-01-06 | 45 | 38.599231 | 77.909474 | 101.537204 | 0.17303 | 17.568982 | 2.509855 |
25 | b26 | 2019-01-06 | 57 | 38.599231 | 75.206448 | 122.327846 | 0.17303 | 21.166387 | 3.023770 |
26 | b27 | 2019-01-06 | -100 | 38.599231 | -100.000000 | NaN | 0.17303 | 0.000000 | NaN |
27 | b28 | 2019-01-06 | 48 | 38.599231 | 71.536830 | 39.719913 | 0.17303 | 6.872737 | 0.981820 |
28 | b29 | 2019-01-06 | 82 | 38.599231 | 72.493794 | 73.090187 | 0.17303 | 12.646795 | 1.806685 |
29 | b30 | 2019-01-06 | 83 | 38.599231 | 77.639881 | 17.972378 | 0.17303 | 3.109761 | 0.444252 |
30 | b31 | 2019-01-06 | 95 | 38.599231 | 70.905208 | 55.940869 | 0.17303 | 9.679448 | 1.382778 |
31 | b32 | 2019-01-06 | -100 | 38.599231 | -100.000000 | NaN | 0.17303 | 0.000000 | NaN |
32 | b33 | 2019-01-06 | 85 | 38.599231 | 72.888294 | 35.809398 | 0.17303 | 6.196100 | 0.885157 |
33 | c34 | 2019-01-06 | -100 | 38.599231 | -100.000000 | NaN | 0.17303 | 0.000000 | NaN |
34 | c35 | 2019-01-06 | 92 | 38.599231 | 70.424206 | 63.946396 | 0.17303 | 11.064645 | 1.580664 |
35 | c36 | 2019-01-06 | 47 | 38.599231 | 76.082391 | 104.719999 | 0.17303 | 18.119701 | 2.588529 |
36 | c37 | 2019-01-06 | 58 | 38.599231 | 74.256548 | 41.795780 | 0.17303 | 7.231924 | 1.033132 |
37 | c38 | 2019-01-06 | 38 | 38.599231 | 74.836905 | 63.938661 | 0.17303 | 11.063306 | 1.580472 |
38 | c39 | 2019-01-06 | 56 | 38.599231 | 73.872272 | 68.115254 | 0.17303 | 11.785982 | 1.683712 |
39 | c40 | 2019-01-06 | 66 | 38.599231 | 75.579861 | 114.096389 | 0.17303 | 19.742098 | 2.820300 |
40 | c41 | 2019-01-06 | 90 | 38.599231 | 69.140749 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
41 | c42 | 2019-01-06 | -100 | 38.599231 | -100.000000 | NaN | 0.17303 | 0.000000 | NaN |
42 | c43 | 2019-01-06 | 38 | 38.599231 | 78.054390 | 136.530552 | 0.17303 | 23.623881 | 3.374840 |
43 | c44 | 2019-01-06 | 28 | 38.599231 | 78.211012 | 170.058540 | 0.17303 | 29.425229 | 4.203604 |
44 | c45 | 2019-01-06 | 66 | 38.599231 | 74.781052 | 197.492980 | 0.17303 | 34.172210 | 4.881744 |
45 | c46 | 2019-01-06 | 76 | 38.599231 | 72.913740 | 124.591415 | 0.17303 | 21.558053 | 3.079722 |
46 | c47 | 2019-01-06 | 74 | 38.599231 | 72.846354 | 84.631154 | 0.17303 | 14.643729 | 2.091961 |
47 | c48 | 2019-01-06 | 53 | 38.599231 | 74.623562 | 279.324525 | 0.17303 | 48.331523 | 6.904503 |
48 | c49 | 2019-01-06 | 71 | 38.599231 | 72.767163 | 51.483217 | 0.17303 | 8.908141 | 1.272592 |
49 | c50 | 2019-01-06 | 52 | 38.599231 | 75.673562 | 87.482236 | 0.17303 | 15.137051 | 2.162436 |
Notice the the format is much nicer than a numpy array.
Observations from the dataset so far¶
The very first column is the index of each entry. Notice how python counts the rows, it starts with 0 and not 1. This is different than other programming softwares that start counting from 1. Then you have columns with various names. The first column is the “household” id, then the “date,” etc.
Let’s continue to explore the data. The following function gives you a summary of the satistics of the data:
temp_price.describe()
score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|
count | 50.000000 | 5.000000e+01 | 50.000000 | 46.000000 | 5.000000e+01 | 50.000000 | 46.000000 |
mean | 49.300000 | 3.859923e+01 | 60.521694 | 109.128242 | 1.730300e-01 | 17.371863 | 2.697494 |
std | 49.351405 | 2.871026e-14 | 47.875493 | 129.874146 | 8.411210e-17 | 22.148354 | 3.210303 |
min | -100.000000 | 3.859923e+01 | -100.000000 | 0.000000 | 1.730300e-01 | 0.000000 | 0.000000 |
25% | 44.250000 | 3.859923e+01 | 72.562136 | 50.980088 | 1.730300e-01 | 6.365259 | 1.260155 |
50% | 61.500000 | 3.859923e+01 | 73.980407 | 71.060260 | 1.730300e-01 | 11.861588 | 1.756508 |
75% | 77.500000 | 3.859923e+01 | 75.730351 | 120.269982 | 1.730300e-01 | 19.482349 | 2.972902 |
max | 99.000000 | 3.859923e+01 | 81.478472 | 639.959855 | 1.730300e-01 | 110.732254 | 15.818893 |
The stsatistics are shown for each column and they are calculated over rows.
min
and max
are obvious. mean
is the average and std
is the standard deviation. We will learn about those in a few lectures.
The entries 25%, 50\5, and 75% are the so called “quantiles.” These are also statistical quantities about which are going to learn in a few lectures.
count
is simply the number of observations per column. Notice that it is 50 for most columns. But for some columns, e.g., hvac
, some entries are missing.
Printing specific ranges¶
In this next section we look at different ways to quickly get some insight about the data and also how to look at specific columns, rows, etc.
If we want to look at specific locations in the data we can do the following:
head()
- the default displays the first 5 rowstail()
- the default displays the last 5 rows
The way you can think about the syntax when using pandas is Dataframe_name.action_to_be_done(params_if_needed)
temp_price.head()
household | date | score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|---|---|
0 | a1 | 2019-01-06 | 85 | 38.599231 | 71.580704 | 35.113758 | 0.17303 | 6.075734 | 0.867962 |
1 | a10 | 2019-01-06 | 70 | 38.599231 | 73.286260 | 63.949057 | 0.17303 | 11.065105 | 1.580729 |
2 | a11 | 2019-01-06 | 61 | 38.599231 | 74.252046 | 147.612108 | 0.17303 | 25.541323 | 3.648760 |
3 | a12 | 2019-01-06 | 65 | 38.599231 | 73.708482 | 74.394518 | 0.17303 | 12.872483 | 1.838926 |
4 | a13 | 2019-01-06 | 66 | 38.599231 | 73.549554 | 173.095836 | 0.17303 | 29.950772 | 4.278682 |
# This will print the first 10 rows
temp_price.head(10)
household | date | score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|---|---|
0 | a1 | 2019-01-06 | 85 | 38.599231 | 71.580704 | 35.113758 | 0.17303 | 6.075734 | 0.867962 |
1 | a10 | 2019-01-06 | 70 | 38.599231 | 73.286260 | 63.949057 | 0.17303 | 11.065105 | 1.580729 |
2 | a11 | 2019-01-06 | 61 | 38.599231 | 74.252046 | 147.612108 | 0.17303 | 25.541323 | 3.648760 |
3 | a12 | 2019-01-06 | 65 | 38.599231 | 73.708482 | 74.394518 | 0.17303 | 12.872483 | 1.838926 |
4 | a13 | 2019-01-06 | 66 | 38.599231 | 73.549554 | 173.095836 | 0.17303 | 29.950772 | 4.278682 |
5 | a14 | 2019-01-06 | 84 | 38.599231 | 72.346457 | 7.040034 | 0.17303 | 1.218137 | 0.174020 |
6 | a15 | 2019-01-06 | 1 | 38.599231 | 81.478472 | 639.959855 | 0.17303 | 110.732254 | 15.818893 |
7 | a16 | 2019-01-06 | 25 | 38.599231 | 76.784504 | 616.805501 | 0.17303 | 106.725856 | 15.246551 |
8 | a2 | 2019-01-06 | 74 | 38.599231 | 73.160367 | 322.648083 | 0.17303 | 55.827798 | 7.975400 |
9 | a3 | 2019-01-06 | 55 | 38.599231 | 74.025149 | 95.139840 | 0.17303 | 16.462047 | 2.351721 |
# This prints the last 5 rows
temp_price.tail()
household | date | score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|---|---|
45 | c46 | 2019-01-06 | 76 | 38.599231 | 72.913740 | 124.591415 | 0.17303 | 21.558053 | 3.079722 |
46 | c47 | 2019-01-06 | 74 | 38.599231 | 72.846354 | 84.631154 | 0.17303 | 14.643729 | 2.091961 |
47 | c48 | 2019-01-06 | 53 | 38.599231 | 74.623562 | 279.324525 | 0.17303 | 48.331523 | 6.904503 |
48 | c49 | 2019-01-06 | 71 | 38.599231 | 72.767163 | 51.483217 | 0.17303 | 8.908141 | 1.272592 |
49 | c50 | 2019-01-06 | 52 | 38.599231 | 75.673562 | 87.482236 | 0.17303 | 15.137051 | 2.162436 |
# If you wanted to print the first 7 rows (instead of the first 5) you would do this:
temp_price.head(7)
household | date | score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|---|---|
0 | a1 | 2019-01-06 | 85 | 38.599231 | 71.580704 | 35.113758 | 0.17303 | 6.075734 | 0.867962 |
1 | a10 | 2019-01-06 | 70 | 38.599231 | 73.286260 | 63.949057 | 0.17303 | 11.065105 | 1.580729 |
2 | a11 | 2019-01-06 | 61 | 38.599231 | 74.252046 | 147.612108 | 0.17303 | 25.541323 | 3.648760 |
3 | a12 | 2019-01-06 | 65 | 38.599231 | 73.708482 | 74.394518 | 0.17303 | 12.872483 | 1.838926 |
4 | a13 | 2019-01-06 | 66 | 38.599231 | 73.549554 | 173.095836 | 0.17303 | 29.950772 | 4.278682 |
5 | a14 | 2019-01-06 | 84 | 38.599231 | 72.346457 | 7.040034 | 0.17303 | 1.218137 | 0.174020 |
6 | a15 | 2019-01-06 | 1 | 38.599231 | 81.478472 | 639.959855 | 0.17303 | 110.732254 | 15.818893 |
# Let's print the last 12 rows, what would you do?
temp_price.tail()
household | date | score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|---|---|
45 | c46 | 2019-01-06 | 76 | 38.599231 | 72.913740 | 124.591415 | 0.17303 | 21.558053 | 3.079722 |
46 | c47 | 2019-01-06 | 74 | 38.599231 | 72.846354 | 84.631154 | 0.17303 | 14.643729 | 2.091961 |
47 | c48 | 2019-01-06 | 53 | 38.599231 | 74.623562 | 279.324525 | 0.17303 | 48.331523 | 6.904503 |
48 | c49 | 2019-01-06 | 71 | 38.599231 | 72.767163 | 51.483217 | 0.17303 | 8.908141 | 1.272592 |
49 | c50 | 2019-01-06 | 52 | 38.599231 | 75.673562 | 87.482236 | 0.17303 | 15.137051 | 2.162436 |
Cleaning the data¶
In the temp_price
file you notice that some cells have -100 under the score column, -100.0 under the t_unit or NaN value for hvac. There are various reasons for this, but the details around it are not helpful for this point. We will focus on the NaN
.
When you see NaN
this means that the value here is Not a Number and this is the pythonic notation to display that. For this example, when we see NaN that means that this unit is missing over 70% of its data for the week. This most likely means that the sensor for the unit needs to be replaced.
Having these -100, -100.0 and NaN can mess with the interpretation of our anaylsis, so we want to clean the data to remove the corresponding observations.
To do this we will use dropna()
to remove the rows that have NaN. To get a better understanding of how dropna()
works we can look at the documentation for it: pandas dropna docs
Look at the documentation to see what the axis needs to be in order to remove rows vs. columns
temp_no_null_rows = temp_price.dropna(axis=0)
# Print data with no NaN rows
temp_no_null_rows
household | date | score | t_out | t_unit | hvac | price | Price per week | Price per day | |
---|---|---|---|---|---|---|---|---|---|
0 | a1 | 2019-01-06 | 85 | 38.599231 | 71.580704 | 35.113758 | 0.17303 | 6.075734 | 0.867962 |
1 | a10 | 2019-01-06 | 70 | 38.599231 | 73.286260 | 63.949057 | 0.17303 | 11.065105 | 1.580729 |
2 | a11 | 2019-01-06 | 61 | 38.599231 | 74.252046 | 147.612108 | 0.17303 | 25.541323 | 3.648760 |
3 | a12 | 2019-01-06 | 65 | 38.599231 | 73.708482 | 74.394518 | 0.17303 | 12.872483 | 1.838926 |
4 | a13 | 2019-01-06 | 66 | 38.599231 | 73.549554 | 173.095836 | 0.17303 | 29.950772 | 4.278682 |
5 | a14 | 2019-01-06 | 84 | 38.599231 | 72.346457 | 7.040034 | 0.17303 | 1.218137 | 0.174020 |
6 | a15 | 2019-01-06 | 1 | 38.599231 | 81.478472 | 639.959855 | 0.17303 | 110.732254 | 15.818893 |
7 | a16 | 2019-01-06 | 25 | 38.599231 | 76.784504 | 616.805501 | 0.17303 | 106.725856 | 15.246551 |
8 | a2 | 2019-01-06 | 74 | 38.599231 | 73.160367 | 322.648083 | 0.17303 | 55.827798 | 7.975400 |
9 | a3 | 2019-01-06 | 55 | 38.599231 | 74.025149 | 95.139840 | 0.17303 | 16.462047 | 2.351721 |
10 | a4 | 2019-01-06 | 99 | 38.599231 | 71.391270 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
11 | a5 | 2019-01-06 | 78 | 38.599231 | 76.724752 | 70.942532 | 0.17303 | 12.275186 | 1.753598 |
12 | a6 | 2019-01-06 | 44 | 38.599231 | 74.787475 | 67.350034 | 0.17303 | 11.653576 | 1.664797 |
13 | a7 | 2019-01-06 | 62 | 38.599231 | 76.071354 | 71.177988 | 0.17303 | 12.315927 | 1.759418 |
14 | a8 | 2019-01-06 | 65 | 38.599231 | 73.935665 | 50.812378 | 0.17303 | 8.792066 | 1.256009 |
15 | a9 | 2019-01-06 | 51 | 38.599231 | 75.807242 | 108.091675 | 0.17303 | 18.703103 | 2.671872 |
16 | b17 | 2019-01-06 | 66 | 38.599231 | 72.950992 | 130.509717 | 0.17303 | 22.582096 | 3.226014 |
17 | b18 | 2019-01-06 | 90 | 38.599231 | 70.947520 | 60.078379 | 0.17303 | 10.395362 | 1.485052 |
18 | b19 | 2019-01-06 | 30 | 38.599231 | 74.636062 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
19 | b20 | 2019-01-06 | 93 | 38.599231 | 75.286905 | 21.320244 | 0.17303 | 3.689042 | 0.527006 |
20 | b21 | 2019-01-06 | 16 | 38.599231 | 75.749281 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
21 | b22 | 2019-01-06 | 54 | 38.599231 | 78.609127 | 93.928528 | 0.17303 | 16.252453 | 2.321779 |
22 | b23 | 2019-01-06 | 42 | 38.599231 | 78.693750 | 66.335020 | 0.17303 | 11.477949 | 1.639707 |
23 | b24 | 2019-01-06 | 89 | 38.599231 | 73.677604 | 68.989156 | 0.17303 | 11.937194 | 1.705313 |
24 | b25 | 2019-01-06 | 45 | 38.599231 | 77.909474 | 101.537204 | 0.17303 | 17.568982 | 2.509855 |
25 | b26 | 2019-01-06 | 57 | 38.599231 | 75.206448 | 122.327846 | 0.17303 | 21.166387 | 3.023770 |
27 | b28 | 2019-01-06 | 48 | 38.599231 | 71.536830 | 39.719913 | 0.17303 | 6.872737 | 0.981820 |
28 | b29 | 2019-01-06 | 82 | 38.599231 | 72.493794 | 73.090187 | 0.17303 | 12.646795 | 1.806685 |
29 | b30 | 2019-01-06 | 83 | 38.599231 | 77.639881 | 17.972378 | 0.17303 | 3.109761 | 0.444252 |
30 | b31 | 2019-01-06 | 95 | 38.599231 | 70.905208 | 55.940869 | 0.17303 | 9.679448 | 1.382778 |
32 | b33 | 2019-01-06 | 85 | 38.599231 | 72.888294 | 35.809398 | 0.17303 | 6.196100 | 0.885157 |
34 | c35 | 2019-01-06 | 92 | 38.599231 | 70.424206 | 63.946396 | 0.17303 | 11.064645 | 1.580664 |
35 | c36 | 2019-01-06 | 47 | 38.599231 | 76.082391 | 104.719999 | 0.17303 | 18.119701 | 2.588529 |
36 | c37 | 2019-01-06 | 58 | 38.599231 | 74.256548 | 41.795780 | 0.17303 | 7.231924 | 1.033132 |
37 | c38 | 2019-01-06 | 38 | 38.599231 | 74.836905 | 63.938661 | 0.17303 | 11.063306 | 1.580472 |
38 | c39 | 2019-01-06 | 56 | 38.599231 | 73.872272 | 68.115254 | 0.17303 | 11.785982 | 1.683712 |
39 | c40 | 2019-01-06 | 66 | 38.599231 | 75.579861 | 114.096389 | 0.17303 | 19.742098 | 2.820300 |
40 | c41 | 2019-01-06 | 90 | 38.599231 | 69.140749 | 0.000000 | 0.17303 | 0.000000 | 0.000000 |
42 | c43 | 2019-01-06 | 38 | 38.599231 | 78.054390 | 136.530552 | 0.17303 | 23.623881 | 3.374840 |
43 | c44 | 2019-01-06 | 28 | 38.599231 | 78.211012 | 170.058540 | 0.17303 | 29.425229 | 4.203604 |
44 | c45 | 2019-01-06 | 66 | 38.599231 | 74.781052 | 197.492980 | 0.17303 | 34.172210 | 4.881744 |
45 | c46 | 2019-01-06 | 76 | 38.599231 | 72.913740 | 124.591415 | 0.17303 | 21.558053 | 3.079722 |
46 | c47 | 2019-01-06 | 74 | 38.599231 | 72.846354 | 84.631154 | 0.17303 | 14.643729 | 2.091961 |
47 | c48 | 2019-01-06 | 53 | 38.599231 | 74.623562 | 279.324525 | 0.17303 | 48.331523 | 6.904503 |
48 | c49 | 2019-01-06 | 71 | 38.599231 | 72.767163 | 51.483217 | 0.17303 | 8.908141 | 1.272592 |
49 | c50 | 2019-01-06 | 52 | 38.599231 | 75.673562 | 87.482236 | 0.17303 | 15.137051 | 2.162436 |
What do you notice?
Do you see any NaN or -100 values in the dataset?
Do we still have the same number of indicies?
We can use the DataFrame.shape
to compare the previous table with the removal of the NaN
to double check.
# shape of data after cleaning
temp_no_null_rows.shape
(46, 9)
# shape of data before cleaning
temp_price.shape
(50, 9)
The shape()
tells the number of rows and columns of the dataset.
Note
When dealing with columns, you typically want to keep everything lowercase and remove spaces. This makes it easier to type the column names later on. So, to rename an existing column in the dataframe we can do the following: df.rename(columns={'current_name' : 'new_name'}, inplace=True)
Let’s rename Price per week
to week_price
and Price per day
to daily_price
:
# Changing the title of columns
clean_data = temp_no_null_rows.rename(columns={'Price per week': 'week_price',
'Price per day': 'daily_price'})
# See the the column names have changed
clean_data.head()
household | date | score | t_out | t_unit | hvac | price | week_price | daily_price | |
---|---|---|---|---|---|---|---|---|---|
0 | a1 | 2019-01-06 | 85 | 38.599231 | 71.580704 | 35.113758 | 0.17303 | 6.075734 | 0.867962 |
1 | a10 | 2019-01-06 | 70 | 38.599231 | 73.286260 | 63.949057 | 0.17303 | 11.065105 | 1.580729 |
2 | a11 | 2019-01-06 | 61 | 38.599231 | 74.252046 | 147.612108 | 0.17303 | 25.541323 | 3.648760 |
3 | a12 | 2019-01-06 | 65 | 38.599231 | 73.708482 | 74.394518 | 0.17303 | 12.872483 | 1.838926 |
4 | a13 | 2019-01-06 | 66 | 38.599231 | 73.549554 | 173.095836 | 0.17303 | 29.950772 | 4.278682 |
Now, you may be wandering what exactly is:
columns={'Price per week': 'week_price',
'Price per day': 'daily_price'}
This defines a dictionary.
It is a special Python type called dict
.
See Sec. 5.5 here.
But let’s also run it separately:
columns={'Price per week': 'week_price',
'Price per day': 'daily_price'}
columns
{'Price per week': 'week_price', 'Price per day': 'daily_price'}
type(columns)
dict
columns['Price per week']
'week_price'
columns['Price per day']
'daily_price'
columns.keys()
dict_keys(['Price per week', 'Price per day'])
So, rename
looks at the dictionary for keys that are column names and then replacs the corresponding column name with whatever the dictionary indicates.
Now that we have clean data, we can explore the data set with the following actions:
df['column_name']
- to select a single column from the dataframedf[['column_name1', 'column_name7']]
- to select a lists of columns from dataframedf.loc[rowindex1]
- to select a single row from the dataframedf.loc[[rowindex1, rowindex2]]
- to select a list of rows from the dataframe
Let’s give some examples.
Get the household
column:
clean_data['household']
0 a1
1 a10
2 a11
3 a12
4 a13
5 a14
6 a15
7 a16
8 a2
9 a3
10 a4
11 a5
12 a6
13 a7
14 a8
15 a9
16 b17
17 b18
18 b19
19 b20
20 b21
21 b22
22 b23
23 b24
24 b25
25 b26
27 b28
28 b29
29 b30
30 b31
32 b33
34 c35
35 c36
36 c37
37 c38
38 c39
39 c40
40 c41
42 c43
43 c44
44 c45
45 c46
46 c47
47 c48
48 c49
49 c50
Name: household, dtype: object
Select multiple columns:
clean_data[['household', 'daily_price']]
household | daily_price | |
---|---|---|
0 | a1 | 0.867962 |
1 | a10 | 1.580729 |
2 | a11 | 3.648760 |
3 | a12 | 1.838926 |
4 | a13 | 4.278682 |
5 | a14 | 0.174020 |
6 | a15 | 15.818893 |
7 | a16 | 15.246551 |
8 | a2 | 7.975400 |
9 | a3 | 2.351721 |
10 | a4 | 0.000000 |
11 | a5 | 1.753598 |
12 | a6 | 1.664797 |
13 | a7 | 1.759418 |
14 | a8 | 1.256009 |
15 | a9 | 2.671872 |
16 | b17 | 3.226014 |
17 | b18 | 1.485052 |
18 | b19 | 0.000000 |
19 | b20 | 0.527006 |
20 | b21 | 0.000000 |
21 | b22 | 2.321779 |
22 | b23 | 1.639707 |
23 | b24 | 1.705313 |
24 | b25 | 2.509855 |
25 | b26 | 3.023770 |
27 | b28 | 0.981820 |
28 | b29 | 1.806685 |
29 | b30 | 0.444252 |
30 | b31 | 1.382778 |
32 | b33 | 0.885157 |
34 | c35 | 1.580664 |
35 | c36 | 2.588529 |
36 | c37 | 1.033132 |
37 | c38 | 1.580472 |
38 | c39 | 1.683712 |
39 | c40 | 2.820300 |
40 | c41 | 0.000000 |
42 | c43 | 3.374840 |
43 | c44 | 4.203604 |
44 | c45 | 4.881744 |
45 | c46 | 3.079722 |
46 | c47 | 2.091961 |
47 | c48 | 6.904503 |
48 | c49 | 1.272592 |
49 | c50 | 2.162436 |
In the above example you are giving a list of column names as an index. This is why you have to square brackets. The first is for the data frame index. The second is for the list of columns. Maybe, this will help understand what is going on:
column_list = ['household', 'daily_price']
clean_data[column_list]
household | daily_price | |
---|---|---|
0 | a1 | 0.867962 |
1 | a10 | 1.580729 |
2 | a11 | 3.648760 |
3 | a12 | 1.838926 |
4 | a13 | 4.278682 |
5 | a14 | 0.174020 |
6 | a15 | 15.818893 |
7 | a16 | 15.246551 |
8 | a2 | 7.975400 |
9 | a3 | 2.351721 |
10 | a4 | 0.000000 |
11 | a5 | 1.753598 |
12 | a6 | 1.664797 |
13 | a7 | 1.759418 |
14 | a8 | 1.256009 |
15 | a9 | 2.671872 |
16 | b17 | 3.226014 |
17 | b18 | 1.485052 |
18 | b19 | 0.000000 |
19 | b20 | 0.527006 |
20 | b21 | 0.000000 |
21 | b22 | 2.321779 |
22 | b23 | 1.639707 |
23 | b24 | 1.705313 |
24 | b25 | 2.509855 |
25 | b26 | 3.023770 |
27 | b28 | 0.981820 |
28 | b29 | 1.806685 |
29 | b30 | 0.444252 |
30 | b31 | 1.382778 |
32 | b33 | 0.885157 |
34 | c35 | 1.580664 |
35 | c36 | 2.588529 |
36 | c37 | 1.033132 |
37 | c38 | 1.580472 |
38 | c39 | 1.683712 |
39 | c40 | 2.820300 |
40 | c41 | 0.000000 |
42 | c43 | 3.374840 |
43 | c44 | 4.203604 |
44 | c45 | 4.881744 |
45 | c46 | 3.079722 |
46 | c47 | 2.091961 |
47 | c48 | 6.904503 |
48 | c49 | 1.272592 |
49 | c50 | 2.162436 |
Select a row:
clean_data.loc[30]
household b31
date 2019-01-06
score 95
t_out 38.599231
t_unit 70.905208
hvac 55.940869
price 0.17303
week_price 9.679448
daily_price 1.382778
Name: 30, dtype: object
Select certain rows:
clean_data.loc[[25,39, 45]]
household | date | score | t_out | t_unit | hvac | price | week_price | daily_price | |
---|---|---|---|---|---|---|---|---|---|
25 | b26 | 2019-01-06 | 57 | 38.599231 | 75.206448 | 122.327846 | 0.17303 | 21.166387 | 3.023770 |
39 | c40 | 2019-01-06 | 66 | 38.599231 | 75.579861 | 114.096389 | 0.17303 | 19.742098 | 2.820300 |
45 | c46 | 2019-01-06 | 76 | 38.599231 | 72.913740 | 124.591415 | 0.17303 | 21.558053 | 3.079722 |
Select a slice of rows:
clean_data.loc[25:36]
household | date | score | t_out | t_unit | hvac | price | week_price | daily_price | |
---|---|---|---|---|---|---|---|---|---|
25 | b26 | 2019-01-06 | 57 | 38.599231 | 75.206448 | 122.327846 | 0.17303 | 21.166387 | 3.023770 |
27 | b28 | 2019-01-06 | 48 | 38.599231 | 71.536830 | 39.719913 | 0.17303 | 6.872737 | 0.981820 |
28 | b29 | 2019-01-06 | 82 | 38.599231 | 72.493794 | 73.090187 | 0.17303 | 12.646795 | 1.806685 |
29 | b30 | 2019-01-06 | 83 | 38.599231 | 77.639881 | 17.972378 | 0.17303 | 3.109761 | 0.444252 |
30 | b31 | 2019-01-06 | 95 | 38.599231 | 70.905208 | 55.940869 | 0.17303 | 9.679448 | 1.382778 |
32 | b33 | 2019-01-06 | 85 | 38.599231 | 72.888294 | 35.809398 | 0.17303 | 6.196100 | 0.885157 |
34 | c35 | 2019-01-06 | 92 | 38.599231 | 70.424206 | 63.946396 | 0.17303 | 11.064645 | 1.580664 |
35 | c36 | 2019-01-06 | 47 | 38.599231 | 76.082391 | 104.719999 | 0.17303 | 18.119701 | 2.588529 |
36 | c37 | 2019-01-06 | 58 | 38.599231 | 74.256548 | 41.795780 | 0.17303 | 7.231924 | 1.033132 |
And there many more things that you can easily do with pandas. There are so many things to do with pandas that you can spend weeks on this library alone. Our suggestion is to learn pandas as needed when you experiment with data files. Typically, if you think about what you need to do, you can find how to do it in pandas with a good Google search.
Questions¶
Write pandas code that extracts the
week_price
column fromclean_data
.
# Your code here
Write pandas code that gives you tghe
hvac
and theprice
columns.
# Your code here