The Python Data Analysis Library#
Pandas is Python library that provides fast, flexible and data structures that lets you work real life data quickly. Typically, we import the library like this:
import pandas as pd
Let’s load the catalysis.csv
file we created before. I will download it for you first:
!curl -O https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-239-intro-to-data-science/master/data/catalysis.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 261 100 261 0 0 544 0 --:--:-- --:--:-- --:--:-- 544
The file should now be in the working directory. Let’s check that:
!ls catalysis.csv
catalysis.csv
We can now load the file into a pandas DataFrame:
data = pd.read_csv('catalysis.csv')
data
# Time | NO3 | NO2 | N2 | NH3 | N2O | |
---|---|---|---|---|---|---|
0 | 0 | 500.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 30 | 250.95 | 107.32 | 18.51 | 3.33 | 4.98 |
2 | 60 | 123.66 | 132.33 | 74.85 | 7.34 | 20.14 |
3 | 90 | 84.47 | 98.81 | 166.19 | 13.14 | 42.10 |
4 | 120 | 30.24 | 38.74 | 249.78 | 19.54 | 55.98 |
5 | 150 | 27.94 | 10.42 | 292.32 | 24.07 | 60.65 |
6 | 180 | 13.54 | 6.11 | 309.50 | 27.26 | 62.54 |
DataFrame is a two-dimensional data structure that can store data of different types (e.g., numbers, strings, booleans) in a table format. It is similar to a spreadsheet. The columns can have names. Here are the column names:
data.columns
Index(['# Time', 'NO3', 'NO2', 'N2', 'NH3', 'N2O'], dtype='object')
We do not particularly like the fact that the first column is has the “#” symbol. Let’s remove it by renaming the column:
data = data.rename(columns={"# Time": "Time"})
data.columns
Index(['Time', 'NO3', 'NO2', 'N2', 'NH3', 'N2O'], dtype='object')
That’s better!
Let’s summarize the data. The function DataFrame.describe()
provides a summary of the data. It shows the mean, standard deviation, minimum, maximum, and quartiles of the data:
data.describe()
Time | NO3 | NO2 | N2 | NH3 | N2O | X | |
---|---|---|---|---|---|---|---|
count | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 | 7.000000 |
mean | 90.000000 | 147.257143 | 56.247143 | 158.735714 | 13.525714 | 35.198571 | 89.035714 |
std | 64.807407 | 175.825979 | 55.215843 | 129.625326 | 10.498362 | 26.626079 | 44.289176 |
min | 0.000000 | 13.540000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 45.000000 | 29.090000 | 8.265000 | 46.680000 | 5.335000 | 12.560000 | 82.825000 |
50% | 90.000000 | 84.470000 | 38.740000 | 166.190000 | 13.140000 | 42.100000 | 95.290000 |
75% | 135.000000 | 187.305000 | 103.065000 | 271.050000 | 21.805000 | 58.315000 | 110.315000 |
max | 180.000000 | 500.000000 | 132.330000 | 309.500000 | 27.260000 | 62.540000 | 141.680000 |
How can we extract the data pertaining to “NO3”? Easy!
data['NO3']
0 500.00
1 250.95
2 123.66
3 84.47
4 30.24
5 27.94
6 13.54
Name: NO3, dtype: float64
What type of object is this? Let’s check:
type(data['NO3'])
pandas.core.series.Series
Aha! It’s a pandas Series! You can go to the documentation to see what you can do with it.
What if you want to extract the data pertaining to “NO3” and “NO2”? You can do this by passing a list of column names to the dataframe:
data[['NO3', 'NO2']]
NO3 | NO2 | |
---|---|---|
0 | 500.00 | 0.00 |
1 | 250.95 | 107.32 |
2 | 123.66 | 132.33 |
3 | 84.47 | 98.81 |
4 | 30.24 | 38.74 |
5 | 27.94 | 10.42 |
6 | 13.54 | 6.11 |
What if I ask for a column that does not exist? Try it out to see the error.
Another useful operation is to extract a subset of the dataframe based on some condition. For example, let’s extract the data for the first 90 minutes:
data[data['Time'] <= 90]
Time | NO3 | NO2 | N2 | NH3 | N2O | X | |
---|---|---|---|---|---|---|---|
0 | 0 | 500.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 30 | 250.95 | 107.32 | 18.51 | 3.33 | 4.98 | 114.91 |
2 | 60 | 123.66 | 132.33 | 74.85 | 7.34 | 20.14 | 141.68 |
3 | 90 | 84.47 | 98.81 | 166.19 | 13.14 | 42.10 | 95.29 |
What is going on here? What on earth is data['Time'] <= 90
? Let’s print it to see:
data['Time'] <= 90
0 True
1 True
2 True
3 True
4 False
5 False
6 False
Name: Time, dtype: bool
It looks like a boolean Series
object. It contains True
and False
values. When you pass it to the dataframe, it only keeps the rows where the condition is True
. This is neat!
What about getting the rows corresponding to times between 60 and 150 minutes? This is how you can do it:
data[(data['Time'] >= 60) & (data['Time'] <= 150)]
Time | NO3 | NO2 | N2 | NH3 | N2O | X | |
---|---|---|---|---|---|---|---|
2 | 60 | 123.66 | 132.33 | 74.85 | 7.34 | 20.14 | 141.68 |
3 | 90 | 84.47 | 98.81 | 166.19 | 13.14 | 42.10 | 95.29 |
4 | 120 | 30.24 | 38.74 | 249.78 | 19.54 | 55.98 | 105.72 |
5 | 150 | 27.94 | 10.42 | 292.32 | 24.07 | 60.65 | 84.60 |
Pay attention to the parentheses. They are necessary! The character &
is the logical AND operator. It is applied element-wise to the Series
objects. If it sees a True
and a False
, it returns False
. Only if it sees two True
s, it returns True
. Note that this is different from the and
operator in Python. If you try using and
instead of &
, you will get an error.
Let’s do an OR operation. For example, let’s get the rows where the time is either less than 60 minutes or greater than 150 minutes:
data[(data['Time'] < 60) | (data['Time'] > 150)]
Time | NO3 | NO2 | N2 | NH3 | N2O | X | |
---|---|---|---|---|---|---|---|
0 | 0 | 500.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 30 | 250.95 | 107.32 | 18.51 | 3.33 | 4.98 | 114.91 |
6 | 180 | 13.54 | 6.11 | 309.50 | 27.26 | 62.54 | 81.05 |
The |
operator is the logical OR operator. It is applied element-wise to the Series
objects. If it sees a True
or a False
, it returns True
. Only if it sees two False
s, it returns False
.
How do we do a NOT operation? The operator we need is ~
. For example, let’s get the rows where the time is not between 60 and 150 minutes:
data[~(data['Time'] >= 60) & (data['Time'] <= 150)]
Time | NO3 | NO2 | N2 | NH3 | N2O | X | |
---|---|---|---|---|---|---|---|
0 | 0 | 500.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 30 | 250.95 | 107.32 | 18.51 | 3.33 | 4.98 | 114.91 |
Let’s repeat the total mass calculation that we did before:
total_mass = data.iloc[:, 1:].sum(axis=1)
total_mass
0 500.00
1 385.09
2 358.32
3 404.71
4 394.28
5 415.40
6 418.95
dtype: float64
The result of such manipulations is of type Series:
type(total_mass)
pandas.core.series.Series
Mass is not conserved in this reaction. This doesn’t sound right. There should be another species that we are not taking into account. Let’s call that species “X”. The mass of that species is the total mass minus the mass of the other species:
mass_X = 500 - data.iloc[:, 1:].sum(axis=1)
mass_X
0 0.00
1 114.91
2 141.68
3 95.29
4 105.72
5 84.60
6 81.05
dtype: float64
Let’s add a column to the dataframe:
data['X'] = mass_X
data
Time | NO3 | NO2 | N2 | NH3 | N2O | X | |
---|---|---|---|---|---|---|---|
0 | 0 | 500.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
1 | 30 | 250.95 | 107.32 | 18.51 | 3.33 | 4.98 | 114.91 |
2 | 60 | 123.66 | 132.33 | 74.85 | 7.34 | 20.14 | 141.68 |
3 | 90 | 84.47 | 98.81 | 166.19 | 13.14 | 42.10 | 95.29 |
4 | 120 | 30.24 | 38.74 | 249.78 | 19.54 | 55.98 | 105.72 |
5 | 150 | 27.94 | 10.42 | 292.32 | 24.07 | 60.65 | 84.60 |
6 | 180 | 13.54 | 6.11 | 309.50 | 27.26 | 62.54 | 81.05 |
Let’s save our work to a new file:
data.to_csv('catalysis-with-X.csv')
Here are the contents of the new file:
!cat catalysis-with-X.csv
,Time,NO3,NO2,N2,NH3,N2O,X
0,0,500.0,0.0,0.0,0.0,0.0,0.0
1,30,250.95,107.32,18.51,3.33,4.98,114.91000000000003
2,60,123.66,132.33,74.85,7.34,20.14,141.68
3,90,84.47,98.81,166.19,13.14,42.1,95.28999999999996
4,120,30.24,38.74,249.78,19.54,55.98,105.71999999999997
5,150,27.94,10.42,292.32,24.07,60.65,84.60000000000002
6,180,13.54,6.11,309.5,27.26,62.54,81.05000000000001
You can save the file in different formats. For example, you can save it in Excel format:
data.to_excel('catalysis-with-X.xlsx')
If you are working on Google Colab, download the file to your local machine. If you are working on your own computer, locate the file, and open it with Excel.