Comma-separated values file format

Comma-separated values file format

How do we store data on a computer? Obviously, there is a plethora of data storing formats. For example, Microsoft Excel has a format for storing data. There are also several high-performance binary data formats which are common, e.g., HDF5. However, by far the most ubiquitous data format is the “comma-separated values” file format – or simply the csv file format. The csv file format can be used to store a matrix. Here is how:

  • Each row of the file contains numbers separated by commas. There must be as many entries in a row as matrix columns.

  • If you start a row with “#” it will be ignored when reading the file.

Let’s look at an example of such a file from a catalysis experiment [Katsounaros et al., 2012]. Each row corresponds to a different time instant. Columns 1 to 5 give the mass in grams of a chemical species at that instant of the chemical reaction.

# contents of catalysis.csv
# Time,NO3,NO2,N2,NH3,N2O
0,500.00,0.00,0.00,0.00,0.00
30,250.95,107.32,18.51,3.33,4.98
60,123.66,132.33,74.85,7.34,20.14
90,84.47,98.81,166.19,13.14,42.10
120,30.24,38.74,249.78,19.54,55.98
150,27.94,10.42,292.32,24.07,60.65
180,13.54,6.11,309.50,27.26,62.54
# end of file

You do not have this file on your working directory, so let me create if for you using a bit of Python magick:

with open('catalysis.csv', 'w') as fd:
    fd.write(
"""
# Time,NO3,NO2,N2,NH3,N2O
0,500.00,0.00,0.00,0.00,0.00
30,250.95,107.32,18.51,3.33,4.98
60,123.66,132.33,74.85,7.34,20.14
90,84.47,98.81,166.19,13.14,42.10
120,30.24,38.74,249.78,19.54,55.98
150,27.94,10.42,292.32,24.07,60.65
180,13.54,6.11,309.50,27.26,62.54
"""
    )

Now you have it. If you are working on Google Colab (or your Mac or Linux machine) you can verify that the file exists:

!ls
catalysis.csv         intro.md
csv-files.ipynb       python-matrices.ipynb

It is just a text file. You can see its contents like this:

!cat catalysis.csv
# Time,NO3,NO2,N2,NH3,N2O
0,500.00,0.00,0.00,0.00,0.00
30,250.95,107.32,18.51,3.33,4.98
60,123.66,132.33,74.85,7.34,20.14
90,84.47,98.81,166.19,13.14,42.10
120,30.24,38.74,249.78,19.54,55.98
150,27.94,10.42,292.32,24.07,60.65
180,13.54,6.11,309.50,27.26,62.54

Let’s use numpy to load the file. Here is how:

import numpy as np
data = np.loadtxt('catalysis.csv', delimiter=',')
data
array([[  0.  , 500.  ,   0.  ,   0.  ,   0.  ,   0.  ],
       [ 30.  , 250.95, 107.32,  18.51,   3.33,   4.98],
       [ 60.  , 123.66, 132.33,  74.85,   7.34,  20.14],
       [ 90.  ,  84.47,  98.81, 166.19,  13.14,  42.1 ],
       [120.  ,  30.24,  38.74, 249.78,  19.54,  55.98],
       [150.  ,  27.94,  10.42, 292.32,  24.07,  60.65],
       [180.  ,  13.54,   6.11, 309.5 ,  27.26,  62.54]])

That’s it. You can now do whatever you want with the data. For example, let’s see if mass is conserved in this reaction. First, take everything except the first column (the first column is time, all the rest are masses):

data[:, 1:]
array([[500.  ,   0.  ,   0.  ,   0.  ,   0.  ],
       [250.95, 107.32,  18.51,   3.33,   4.98],
       [123.66, 132.33,  74.85,   7.34,  20.14],
       [ 84.47,  98.81, 166.19,  13.14,  42.1 ],
       [ 30.24,  38.74, 249.78,  19.54,  55.98],
       [ 27.94,  10.42, 292.32,  24.07,  60.65],
       [ 13.54,   6.11, 309.5 ,  27.26,  62.54]])

Now, let’s sum all columns for each row:

data[:, 1:].sum(axis=1)
array([500.  , 385.09, 358.32, 404.71, 394.28, 415.4 , 418.95])

We see that the mass is 500 grams initially, and then it decreases. This means that there is one (or more) intermediate chemical species that are have not been measured! You cannot violate the conservation of mass (at these engergies).

Saving data in csv files

Let me show you how simple it is to save data in csv files. First, let us create some synthetic data at random:

np.random.seed(12345) # to get the same numbers
X = np.random.randn(10, 3)
X
array([[-2.04707659e-01,  4.78943338e-01, -5.19438715e-01],
       [-5.55730304e-01,  1.96578057e+00,  1.39340583e+00],
       [ 9.29078767e-02,  2.81746153e-01,  7.69022568e-01],
       [ 1.24643474e+00,  1.00718936e+00, -1.29622111e+00],
       [ 2.74991633e-01,  2.28912879e-01,  1.35291684e+00],
       [ 8.86429341e-01, -2.00163731e+00, -3.71842537e-01],
       [ 1.66902531e+00, -4.38569736e-01, -5.39741446e-01],
       [ 4.76985010e-01,  3.24894392e+00, -1.02122752e+00],
       [-5.77087303e-01,  1.24121276e-01,  3.02613562e-01],
       [ 5.23772068e-01,  9.40277775e-04,  1.34380979e+00]])

We have 10 rows and 3 columns (whatever they mean). To save them in a csv file, we do:

np.savetxt('random_data.csv', X, delimiter=',')

Here is the file:

!ls random_data.csv
random_data.csv

and here are the contents:

!cat random_data.csv
-2.047076594847129505e-01,4.789433380575482402e-01,-5.194387150567381095e-01
-5.557303043474900450e-01,1.965780572502714207e+00,1.393405832972990366e+00
9.290787674371767113e-02,2.817461528302024898e-01,7.690225676118387366e-01
1.246434736386282216e+00,1.007189357583004874e+00,-1.296221109112263514e+00
2.749916334321240163e-01,2.289128789353159221e-01,1.352916835165449694e+00
8.864293405915888435e-01,-2.001637309660397435e+00,-3.718425371402543922e-01
1.669025309524870604e+00,-4.385697358355719055e-01,-5.397414455216628104e-01
4.769850104122995149e-01,3.248943919430754779e+00,-1.021227524355596827e+00
-5.770873030407671633e-01,1.241212756734077355e-01,3.026135619125113752e-01
5.237720681504165476e-01,9.402777753328851306e-04,1.343809793614132220e+00