Selecting dataframe rows that satisfy a boolean expression

We are now going to put to use what we learned about Python boolean expressions to extract rows from a dataframe that satisfy certain criteria.

Extract rows that satisfy single boolean expression

Let’s do this by example. Let’s load again the temperature_raw.xlsx dataset we played with in Models are functions.

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(rc={"figure.dpi":100, 'savefig.dpi':300})
sns.set_context('notebook')
sns.set_style("ticks")
from IPython.display import set_matplotlib_formats
set_matplotlib_formats('retina', 'svg')
import requests
import os
def download(url, local_filename=None):
    """
    Downloads the file in the ``url`` and saves it in the current working directory.
    """
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
        fd.write(data.content)
   
# The url of the file we want to download
url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'
download(url)
import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')
df.head()
household date score t_out t_unit hvac
0 a1 2018-01-07 100.0 4.283373 66.693229 246.473231
1 a10 2018-01-07 100.0 4.283373 66.356134 5.492116
2 a11 2018-01-07 58.0 4.283373 71.549132 402.094327
3 a12 2018-01-07 64.0 4.283373 73.429514 211.692244
4 a13 2018-01-07 100.0 4.283373 63.923937 0.850536

Recall that we made a simple model between the weekly average of the consumed energy in kWh hvac and the average weekly external temperature t_out. The model was:

def hvac_model(t_out, a, b, t_out_min=60):
    """
    A naïve model of weekly HVAC energy consumption (kWh) as a function of external temperature t_out.
    The mathematical form of the model is:
    
        hvac = a * (t_out - t_out_min)^2 + b
    
    Arguments:
    t_out      -   The average external temperature in degrees F (average over a week).
    a          -   A parameter to be calibrated using observed data. In units of kWh / (deg F)^2.
    b          -   Another parameter to be calibrated using observed data. This is in units of kWh.
                   It is the energy consumption when the HVAC system is not used.
    t_out_min  -   The external temperature above at which the occupants feel comfortable without using
                   their HVAC system.
    
    Returns: The weekly HVAC energy consumption in kWh.
    """
    return a * (t_out - t_out_min) ** 2 + b

and the predictions looked like this:

import numpy as np

a = 0.1 # in kWh / F^2
b = 50 # in kWh
fig, ax = plt.subplots()

# First the scatter plot of all the data we have
ax.scatter(df['t_out'], df['hvac'], label='Observed data')
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')

# Now pick some temperatures to use our model on:
t_outs = np.linspace(df['t_out'].min(), df['t_out'].max(), 100)
predicted_hvac = hvac_model(t_outs, a, b)
ax.plot(t_outs, predicted_hvac, 'r', label='Model predictons')
plt.legend(loc='best');
../_images/python-pandas-selections_5_0.svg

Clearly, this model represents some sort of average behavior. Let’s use see how specific households perform. Here is how many households we have:

df['household'].unique()
array(['a1', 'a10', 'a11', 'a12', 'a13', 'a16', 'a2', 'a3', 'a4', 'a5',
       'a6', 'a7', 'a8', 'a9', 'b17', 'b18', 'b19', 'b20', 'b21', 'b22',
       'b23', 'b24', 'b25', 'b26', 'b28', 'b29', 'b30', 'b31', 'b33',
       'c34', 'c35', 'c36', 'c37', 'c38', 'c39', 'c40', 'c41', 'c42',
       'c43', 'c44', 'c45', 'c46', 'c47', 'c48', 'c49', 'c50', 'a15',
       'a14', 'b32', 'b27'], dtype=object)

How can we extract the data for say household a1? You can do it as follows. First, notice that you can use a boolean expression directly on the column household. The result is an an array with True or False indicating the value of the boolean expression on the corresponding rows. Here it is:

df['household'] == 'a1'
0        True
1       False
2       False
3       False
4       False
        ...  
5643    False
5644    False
5646    False
5647    False
5649    False
Name: household, Length: 4887, dtype: bool

Now, if you feed this array of booleans to the dataframe, it will understand that you only want to keep the rows that are True. Here it is:

df[df['household'] == 'a1']
household date score t_out t_unit hvac
0 a1 2018-01-07 100.0 4.283373 66.693229 246.473231
50 a1 2018-01-14 98.0 33.444172 67.812649 116.952821
100 a1 2018-01-21 100.0 19.584102 66.512783 116.128585
150 a1 2018-01-28 96.0 41.076513 69.575546 61.086148
200 a1 2018-02-04 96.0 30.065774 70.085743 61.349374
... ... ... ... ... ... ...
4700 a1 2019-10-27 97.0 54.829439 73.619056 8.560911
4750 a1 2019-11-03 98.0 45.925446 72.456811 14.331006
4800 a1 2019-11-10 97.0 40.175345 72.450988 10.756552
4850 a1 2019-11-17 99.0 32.199504 71.451835 11.828865
4900 a1 2019-11-24 97.0 39.327257 70.896187 8.457236

82 rows × 6 columns

So, there are 82 rows corresponding to this household. If you wanted to find rows of another household, say b17, you do this:

df[df['household'] == 'b17']
household date score t_out t_unit hvac
16 b17 2018-01-07 73.0 4.283373 74.913484 330.891743
66 b17 2018-01-14 70.0 33.444172 75.113492 163.867247
116 b17 2018-01-21 69.0 19.584102 74.962054 296.645428
166 b17 2018-01-28 70.0 41.076513 74.628770 146.174333
216 b17 2018-02-04 79.0 30.065774 74.340575 244.354832
... ... ... ... ... ... ...
5416 b17 2020-02-02 92.0 33.655308 70.938269 101.666995
5466 b17 2020-02-09 95.0 38.123983 71.553150 109.245547
5516 b17 2020-02-16 92.0 28.548760 71.428274 182.109547
5566 b17 2020-02-23 92.0 33.930233 69.286012 139.912270
5616 b17 2020-02-25 95.0 43.642388 72.251038 8.293659

113 rows × 6 columns

So, we have 113 rows for household b17. The difference in the entries is likely due to sensor malfunction. Now, let’s redo our scatter plot but using different colors for a1 and b17.

fig, ax = plt.subplots()

# First the scatter plot of all the data we have.
# Notice the keyword argument aplha=0.25. This makes the points transparent.
# alpha=1 means opaque and alpha=0 means invisible. So, something in between
# is transparent.
ax.scatter(df['t_out'], df['hvac'], label='All households', alpha=0.25)

# Do the scatter plot for a1
a1_df = df[df['household'] == 'a1']
ax.scatter(a1_df['t_out'], a1_df['hvac'], label='Household a1', alpha=0.5)
# and the scatter plot for b17
b17_df = df[df['household'] == 'b17']
ax.scatter(b17_df['t_out'], b17_df['hvac'], label='Household b17', alpha=0.5)

# The labels
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')

# Plot also our model.
predicted_hvac = hvac_model(t_outs, a, b)
ax.plot(t_outs, predicted_hvac, 'r', label='Model predictons')
plt.legend(loc='best');
../_images/python-pandas-selections_15_0.svg

Questions

  • Modify the plot above to add data from household c36.

Plotting timeseries data

Since we have extracted the data for units a1 and b17, we have a good opportunity to demonstrate another set of useful plots that you can do with dataframes. We can now plot t_unit as a function of the date for each one of these units. Let’s do it step by step. First, see this plot:

a1_df.plot(x='date', y='t_unit', label='Household a1');
../_images/python-pandas-selections_18_0.svg

Now, there are some data missing during Spring of 2019 and the dataframe is automatically interpolating. We could change the style of the plot so that we see the individual data points.

a1_df.plot(x='date', y='t_unit', label='Household a1', style='.-');
../_images/python-pandas-selections_20_0.svg

Sometimes, you may want to use your own figure object so that you can plot more stuff in it (e.g., titles, labels). You can do this by passing the ax keyword argument:

fig, ax = plt.subplots()
a1_df.plot(x='date', y='t_unit', label='Household a1', style='.-', ax=ax)
ax.set_ylabel('Unit temperature (degrees F)');
../_images/python-pandas-selections_22_0.svg

And here is how you can add the temperature of another unit:

fig, ax = plt.subplots()
a1_df.plot(x='date', y='t_unit', label='Household a1', style='.-', ax=ax)
b17_df.plot(x='date', y='t_unit', label='Household b17', style='d--', ax=ax)
ax.set_ylabel('Unit temperature (degrees F)');
../_images/python-pandas-selections_24_0.svg

Questions

  • In the figure of t_unit vs date above, add data from household c36.

  • For households a1, b17 and c36 plot hvac as a function of the date. Hint: Change the keyword y in Dataframe.plot().

Avoid writing a lot of code by using loops

So, far we manually extracted the data we need for each unit. Now, I will show you how you can do it for as many usints as you like using a for loop. Let’s redo our scatter plot for a list of households:

fig, ax = plt.subplots()
ax.scatter(df['t_out'], df['hvac'], label='All households', alpha=0.25)

# Loop over household names
for household_name in ['a1', 'a2', 'a3', 'a4', 'a5']:
    # Extract the relevant data
    df_tmp = df[df['household'] == household_name]
    # Plot the relevant data
    ax.scatter(df_tmp['t_out'], df_tmp['hvac'], label='Household ' + household_name, alpha=0.5)

ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best');
../_images/python-pandas-selections_27_0.svg

Questions

  • Modify the code above so that you plot the scatter plot for households c34 to c359.

  • For the same households as above, plot hvac as a function of date.

Extract rows that satisfy more complicated boolean expressions

There is really no lilmit to how complicated the boolean expression you use can be. Let me give you some examples. Let’s say that we want to see how the selection of t_unit (temperature setpoint) affects energy consumption during the heating season. We are on heating season for sure when the external temperature is smaller than 55 degrees F. Okay, so we need to extract all the rows with df['t_unit'] < 55. Let’s do that:

df_heating = df[df['t_out'] < 55]
df_heating
household date score t_out t_unit hvac
0 a1 2018-01-07 100.0 4.283373 66.693229 246.473231
1 a10 2018-01-07 100.0 4.283373 66.356134 5.492116
2 a11 2018-01-07 58.0 4.283373 71.549132 402.094327
3 a12 2018-01-07 64.0 4.283373 73.429514 211.692244
4 a13 2018-01-07 100.0 4.283373 63.923937 0.850536
... ... ... ... ... ... ...
5643 c44 2020-02-25 59.0 43.642388 76.494637 19.135139
5644 c45 2020-02-25 87.0 43.642388 71.165052 30.794281
5646 c47 2020-02-25 97.0 43.642388 68.603287 5.339391
5647 c48 2020-02-25 92.0 43.642388 73.429239 18.040078
5649 c50 2020-02-25 59.0 43.642388 77.716955 14.405155

2741 rows × 6 columns

Now, we are going to draw the same scatter plot, but we are going to color the points according to the unit temperature t_unit. Let’s have pick the rows with t_unit between 70 and 72 degrees F.

df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 and
                               df_heating['t_unit'] < 72]
df_heating_70to72.head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-2b2ecdc642cb> in <module>
----> 1 df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 and
      2                                df_heating['t_unit'] < 72]
      3 df_heating_70to72.head()

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/generic.py in __nonzero__(self)
   1440     @final
   1441     def __nonzero__(self):
-> 1442         raise ValueError(
   1443             f"The truth value of a {type(self).__name__} is ambiguous. "
   1444             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The problem here is that the on the left and on the right of the logical and operator you have arrays of booleans not just booleans. So, Python does not know what to do with them. The correct operator to use to represent logical AND in this situation is & (which you can also read as AND). Let’s try again. This time we are going to get a really big error message. I’m going to hide it.

df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 &
                               df_heating['t_unit'] < 72]
df_heating_70to72.head()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    264         #  (xint or xbool) and (yint or bool)
--> 265         result = op(x, y)
    266     except TypeError:

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ops/roperator.py in rand_(left, right)
     51 def rand_(left, right):
---> 52     return operator.and_(right, left)
     53 

TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    278             try:
--> 279                 result = libops.scalar_binop(x, y, op)
    280             except (

pandas/_libs/ops.pyx in pandas._libs.ops.scalar_binop()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-28-530d5ff944c3> in <module>
----> 1 df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 &
      2                                df_heating['t_unit'] < 72]
      3 df_heating_70to72.head()

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ops/common.py in new_method(self, other)
     63         other = item_from_zerodim(other)
     64 
---> 65         return method(self, other)
     66 
     67     return new_method

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/arraylike.py in __rand__(self, other)
     61     @unpack_zerodim_and_defer("__rand__")
     62     def __rand__(self, other):
---> 63         return self._logical_method(other, roperator.rand_)
     64 
     65     @unpack_zerodim_and_defer("__or__")

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _logical_method(self, other, op)
   4987         rvalues = extract_array(other, extract_numpy=True)
   4988 
-> 4989         res_values = ops.logical_op(lvalues, rvalues, op)
   4990         return self._construct_result(res_values, name=res_name)
   4991 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in logical_op(left, right, op)
    353         filler = fill_int if is_self_int_dtype and is_other_int_dtype else fill_bool
    354 
--> 355         res_values = na_logical_op(lvalues, rvalues, op)
    356         # error: Cannot call function of unknown type
    357         res_values = filler(res_values)  # type: ignore[operator]

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/ops/array_ops.py in na_logical_op(x, y, op)
    286             ) as err:
    287                 typ = type(y).__name__
--> 288                 raise TypeError(
    289                     f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array "
    290                     f"and scalar of type [{typ}]"

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

The problem here is that Python is the order in which the expression is evaluated. The operator & has a special meaning in Python (we are not going to get into it) and it has high priority. So, python tries to evaluate first 68 & df_heating['t_unit'] which is meanigless. What needs to happen here is that is should first evaluate df_heating['t_unit'] >= 68 to get an array of booleans, then evaluate df_heating['t_unit'] < 70 to get another array of booleans, and the use & to logically compare the arrays pointwise. We can enforce this order using parentheses. Whatever Python expression is enclosed in parentheses will be evaluated first. So, here is the correct way to do it:

df_heating_70to72 = df_heating[(df_heating['t_unit'] >= 70) &
                               (df_heating['t_unit'] < 72)]
df_heating_70to72.head()
household date score t_out t_unit hvac
2 a11 2018-01-07 58.0 4.283373 71.549132 402.094327
17 b18 2018-01-07 87.0 4.283373 70.608333 368.153883
25 b26 2018-01-07 97.0 4.283373 70.014178 36.645884
29 b30 2018-01-07 99.0 4.283373 71.423206 23.679263
34 c35 2018-01-07 96.0 4.283373 70.558912 349.025111

And let’s plot the scatter plot highlighting the different unit temeprature ranges with different colors:

fig, ax = plt.subplots()

ax.scatter(df_heating['t_out'], df_heating['hvac'], label='All households', alpha=0.25)
ax.scatter(df_heating_70to72['t_out'], df_heating_70to72['hvac'], 
           label='t_unit in [70, 72)', alpha=0.5)
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best');
../_images/python-pandas-selections_38_0.svg

Let’s finish this section by demonstrating logical OR for selecting rows. As you may have guessed or does not work. We need to use the operator |. And yes, we need to enclose everything in parentheses. Let’s select the entries that exhibit bad energy consuming behavior. That would be units that have a very high t_unit (say above 78) during heating or a very low t_unit during cooling (say below 70). Let’s see how much energy they consume.

df_bad_behavior = df[((df['t_out'] < 55) & (df['t_unit'] > 78)) |
                     ((df['t_out'] >= 55) & (df['t_unit'] < 70))]
df_bad_behavior.head()
household date score t_out t_unit hvac
15 a9 2018-01-07 39.0 4.283373 78.709433 495.227311
22 b23 2018-01-07 44.0 4.283373 78.618808 359.167522
65 a9 2018-01-14 41.0 33.444172 78.557192 321.834468
72 b23 2018-01-14 43.0 33.444172 78.684325 197.924672
115 a9 2018-01-21 41.0 19.584102 78.572321 431.236407
fig, ax = plt.subplots()

ax.scatter(df['t_out'], df['hvac'], label='All households', alpha=0.25)
ax.scatter(df_bad_behavior['t_out'], df_bad_behavior['hvac'], 
           label='Bad behavior', alpha=0.5)

ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best');
../_images/python-pandas-selections_41_0.svg

Notice that they are some entries that exhibit bad behavior without being penalized by excessive energy consumption. This is because of a third variable that affects energy consumption, i.e., the location of the unit in the building. Here is what is happening:

  • Units that are at the corners of the building expose at least two sides to the external environment.

  • Units that are on the top floor expose at least two sides to the external environment.

  • Units that are at a corner and on the top floor expose three sides to the external environment.

  • The rest of the units expose a single side to the external environment. The first three groups (and especially the third group) are loosing a lot of energy to the environment and their energy consumption will in general be higher. The final group is basically insulated from the environment and on top of that they may be getting heat energy from their neighbors. Behaving badly in a top floor, corner unit will be reflected on your energy bill. Behaving badly in an insulated unit may have no effect on your energy bill.

Let’s end with logical NOT. Say we wanted to negate the previous selection so that we highlight all those who behaved well. For this, you use the ~ operator in front of the logical expression you want to negate. Like this:

df_good_behavior = df[~(((df['t_out'] < 55) & (df['t_unit'] > 78)) |
                        ((df['t_out'] >= 55) & (df['t_unit'] < 70)))]
df_bad_behavior.head()
household date score t_out t_unit hvac
15 a9 2018-01-07 39.0 4.283373 78.709433 495.227311
22 b23 2018-01-07 44.0 4.283373 78.618808 359.167522
65 a9 2018-01-14 41.0 33.444172 78.557192 321.834468
72 b23 2018-01-14 43.0 33.444172 78.684325 197.924672
115 a9 2018-01-21 41.0 19.584102 78.572321 431.236407

Again, pay attention to the parentheses because they are critical for the correct evaluation of the expression.

fig, ax = plt.subplots()

ax.scatter(df['t_out'], df['hvac'], label='All households', alpha=0.25)
ax.scatter(df_good_behavior['t_out'], df_good_behavior['hvac'], 
           label='Good behavior', color='green', alpha=0.5)

ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best');
../_images/python-pandas-selections_46_0.svg

Questions

  • The final histogram highlighting units with good behavior is not very useful because the range of t_unit temperatures we considered as good is too wide. Rerun the code block above using a narrower range of temperatures. Say a bad choice for heating is t_unit > 72 and a bad for cooling is t_unit < 75. Then, redo the scatter plot.