The Python Data Analysis Library

The Python Data Analysis Library#

Pandas is Python library that provides fast, flexible and data structures that lets you work real life data quickly. Typically, we import the library like this:

import pandas as pd

Let’s load the catalysis.csv file we created before. I will download it for you first:

!curl -O https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-239-intro-to-data-science/master/data/catalysis.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   258  100   258    0     0    742      0 --:--:-- --:--:-- --:--:--   741

The file should now be in the working directory. Let’s check that:

!ls catalysis.csv

catalysis.csv

We can now load the file into a pandas DataFrame:

data = pd.read_csv('catalysis.csv')
data

	Time	NO3	NO2	N2	NH3	N2O
0	0	500.00	0.00	0.00	0.00	0.00
1	30	250.95	107.32	18.51	3.33	4.98
2	60	123.66	132.33	74.85	7.34	20.14
3	90	84.47	98.81	166.19	13.14	42.10
4	120	30.24	38.74	249.78	19.54	55.98
5	150	27.94	10.42	292.32	24.07	60.65
6	180	13.54	6.11	309.50	27.26	62.54

DataFrame is a two-dimensional data structure that can store data of different types (e.g., numbers, strings, booleans) in a table format. It is similar to a spreadsheet. The columns can have names. Here are the column names:

data.columns

Index(['# Time', 'NO3', 'NO2', 'N2', 'NH3', 'N2O'], dtype='object')

Let’s summarize the data. The function DataFrame.describe() provides a summary of the data. It shows the mean, standard deviation, minimum, maximum, and quartiles of the data:

data.describe()

	Time	NO3	NO2	N2	NH3	N2O
count	7.000000	7.000000	7.000000	7.000000	7.000000	7.000000
mean	90.000000	147.257143	56.247143	158.735714	13.525714	35.198571
std	64.807407	175.825979	55.215843	129.625326	10.498362	26.626079
min	0.000000	13.540000	0.000000	0.000000	0.000000	0.000000
25%	45.000000	29.090000	8.265000	46.680000	5.335000	12.560000
50%	90.000000	84.470000	38.740000	166.190000	13.140000	42.100000
75%	135.000000	187.305000	103.065000	271.050000	21.805000	58.315000
max	180.000000	500.000000	132.330000	309.500000	27.260000	62.540000

How can we extract the data pertaining to “NO3”? Easy!

data['NO3']

  500.00
  250.95
  123.66
   84.47
   30.24
   27.94
   13.54
Name: NO3, dtype: float64

What type of object is this? Let’s check:

type(data['NO3'])

pandas.core.series.Series

Aha! It’s a pandas Series! You can go to the documentation to see what you can do with it.

What if you want to extract the data pertaining to “NO3” and “NO2”? You can do this by passing a list of column names to the dataframe:

data[['NO3', 'NO2']]

	NO3	NO2
0	500.00	0.00
1	250.95	107.32
2	123.66	132.33
3	84.47	98.81
4	30.24	38.74
5	27.94	10.42
6	13.54	6.11

What if I ask for a column that does not exist? Try it out to see the error.

Another useful operation is to extract a subset of the dataframe based on some condition. For example, let’s extract the data for the first 90 minutes:

data[data['Time'] <= 90]

	Time	NO3	NO2	N2	NH3	N2O
0	0	500.00	0.00	0.00	0.00	0.00
1	30	250.95	107.32	18.51	3.33	4.98
2	60	123.66	132.33	74.85	7.34	20.14
3	90	84.47	98.81	166.19	13.14	42.10

What is going on here? What on earth is data['Time'] <= 90? Let’s print it to see:

data['Time'] <= 90

   True
   True
   True
   True
  False
  False
  False
Name: Time, dtype: bool

It looks like a boolean Series object. It contains True and False values. When you pass it to the dataframe, it only keeps the rows where the condition is True. This is neat!

What about getting the rows corresponding to times between 60 and 150 minutes? This is how you can do it:

data[(data['Time'] >= 60) & (data['Time'] <= 150)]

	Time	NO3	NO2	N2	NH3	N2O
2	60	123.66	132.33	74.85	7.34	20.14
3	90	84.47	98.81	166.19	13.14	42.10
4	120	30.24	38.74	249.78	19.54	55.98
5	150	27.94	10.42	292.32	24.07	60.65

Pay attention to the parentheses. They are necessary! The character & is the logical AND operator. It is applied element-wise to the Series objects. If it sees a True and a False, it returns False. Only if it sees two Trues, it returns True. Note that this is different from the and operator in Python. If you try using and instead of &, you will get an error.

Let’s do an OR operation. For example, let’s get the rows where the time is either less than 60 minutes or greater than 150 minutes:

data[(data['Time'] < 60) | (data['Time'] > 150)]

	Time	NO3	NO2	N2	NH3	N2O
0	0	500.00	0.00	0.00	0.00	0.00
1	30	250.95	107.32	18.51	3.33	4.98
6	180	13.54	6.11	309.50	27.26	62.54

The | operator is the logical OR operator. It is applied element-wise to the Series objects. If it sees a True or a False, it returns True. Only if it sees two Falses, it returns False.

How do we do a NOT operation? The operator we need is ~. For example, let’s get the rows where the time is not between 60 and 150 minutes:

data[~(data['Time'] >= 60) & (data['Time'] <= 150)]

	Time	NO3	NO2	N2	NH3	N2O
0	0	500.00	0.00	0.00	0.00	0.00
1	30	250.95	107.32	18.51	3.33	4.98

Let’s repeat the total mass calculation that we did before:

total_mass = data.iloc[:, 1:].sum(axis=1)
total_mass

  500.00
  385.09
  358.32
  404.71
  394.28
  415.40
  418.95
dtype: float64

The result of such manipulations is of type Series:

type(total_mass)

pandas.core.series.Series

Mass is not conserved in this reaction. This doesn’t sound right. There should be another species that we are not taking into account. Let’s call that species “X”. The mass of that species is the total mass minus the mass of the other species:

mass_X = 500 - data.iloc[:, 1:].sum(axis=1)
mass_X

    0.00
  114.91
  141.68
   95.29
  105.72
   84.60
   81.05
dtype: float64

Let’s add a column to the dataframe:

data['X'] = mass_X
data

	Time	NO3	NO2	N2	NH3	N2O	X
0	0	500.00	0.00	0.00	0.00	0.00	0.00
1	30	250.95	107.32	18.51	3.33	4.98	114.91
2	60	123.66	132.33	74.85	7.34	20.14	141.68
3	90	84.47	98.81	166.19	13.14	42.10	95.29
4	120	30.24	38.74	249.78	19.54	55.98	105.72
5	150	27.94	10.42	292.32	24.07	60.65	84.60
6	180	13.54	6.11	309.50	27.26	62.54	81.05

Let’s save our work to a new file:

data.to_csv('catalysis-with-X.csv')

Here are the contents of the new file:

!cat catalysis-with-X.csv

,Time,NO3,NO2,N2,NH3,N2O,X
0,0,500.0,0.0,0.0,0.0,0.0,0.0
1,30,250.95,107.32,18.51,3.33,4.98,114.91000000000003
2,60,123.66,132.33,74.85,7.34,20.14,141.68
3,90,84.47,98.81,166.19,13.14,42.1,95.28999999999996
4,120,30.24,38.74,249.78,19.54,55.98,105.71999999999997
5,150,27.94,10.42,292.32,24.07,60.65,84.60000000000002
6,180,13.54,6.11,309.5,27.26,62.54,81.05000000000001

You can save the file in different formats. For example, you can save it in Excel format:

data.to_excel('catalysis-with-X.xlsx')

If you are working on Google Colab, download the file to your local machine. If you are working on your own computer, locate the file, and open it with Excel.