Selecting dataframe rows that satisfy a boolean expression#

We are now going to put to use what we learned about Python boolean expressions to extract rows from a dataframe that satisfy certain criteria.

Extract rows that satisfy single boolean expression#

Let’s do this by example. Let’s load again the temperature_raw.xlsx dataset we played with in Models are functions.

Hide code cell source
MAKE_BOOK_FIGURES=False

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
import seaborn as sns
sns.set_context("paper")
sns.set_style("ticks")

def set_book_style():
    plt.style.use('seaborn-v0_8-white') 
    sns.set_style("ticks")
    sns.set_palette("deep")

    mpl.rcParams.update({
        # Font settings
        'font.family': 'serif',  # For academic publishing
        'font.size': 8,  # As requested, 10pt font
        'axes.labelsize': 8,
        'axes.titlesize': 8,
        'xtick.labelsize': 7,  # Slightly smaller for better readability
        'ytick.labelsize': 7,
        'legend.fontsize': 7,
        
        # Line and marker settings for consistency
        'axes.linewidth': 0.5,
        'grid.linewidth': 0.5,
        'lines.linewidth': 1.0,
        'lines.markersize': 4,
        
        # Layout to prevent clipped labels
        'figure.constrained_layout.use': True,
        
        # Default DPI (will override when saving)
        'figure.dpi': 600,
        'savefig.dpi': 600,
        
        # Despine - remove top and right spines
        'axes.spines.top': False,
        'axes.spines.right': False,
        
        # Remove legend frame
        'legend.frameon': False,
        
        # Additional trim settings
        'figure.autolayout': True,  # Alternative to constrained_layout
        'savefig.bbox': 'tight',    # Trim when saving
        'savefig.pad_inches': 0.1   # Small padding to ensure nothing gets cut off
    })

def save_for_book(fig, filename, is_vector=True, **kwargs):
    """
    Save a figure with book-optimized settings.
    
    Parameters:
    -----------
    fig : matplotlib figure
        The figure to save
    filename : str
        Filename without extension
    is_vector : bool
        If True, saves as vector at 1000 dpi. If False, saves as raster at 600 dpi.
    **kwargs : dict
        Additional kwargs to pass to savefig
    """    
    # Set appropriate DPI and format based on figure type
    if is_vector:
        dpi = 1000
        ext = '.pdf'
    else:
        dpi = 600
        ext = '.tif'
    
    # Save the figure with book settings
    fig.savefig(f"{filename}{ext}", dpi=dpi, **kwargs)


def make_full_width_fig():
    return plt.subplots(figsize=(4.7, 2.9), constrained_layout=True)

def make_half_width_fig():
    return plt.subplots(figsize=(2.35, 1.45), constrained_layout=True)

if MAKE_BOOK_FIGURES:
    set_book_style()
make_full_width_fig = make_full_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
make_half_width_fig = make_half_width_fig if MAKE_BOOK_FIGURES else lambda: plt.subplots()
!curl -O 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-239-intro-to-data-science/master/data/temperature_raw.xlsx'

import pandas as pd
df = pd.read_excel('temperature_raw.xlsx')
df = df.dropna(axis=0)
df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')
df.head()
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  277k  100  277k    0     0   937k      0 --:--:-- --:--:-- --:--:--  939k
household date score t_out t_unit hvac
0 a1 2018-01-07 100.0 4.283373 66.693229 246.473231
1 a10 2018-01-07 100.0 4.283373 66.356134 5.492116
2 a11 2018-01-07 58.0 4.283373 71.549132 402.094327
3 a12 2018-01-07 64.0 4.283373 73.429514 211.692244
4 a13 2018-01-07 100.0 4.283373 63.923937 0.850536

Recall that we made a simple model between the weekly average of the consumed energy in kWh hvac and the average weekly external temperature t_out. The model was:

def hvac_model(t_out, a, b, t_out_min=60):
    """
    A naïve model of weekly HVAC energy consumption (kWh) as a function of external temperature t_out.
    The mathematical form of the model is:
    
        hvac = a * (t_out - t_out_min)^2 + b
    
    Arguments:
    t_out      -   The average external temperature in degrees F (average over a week).
    a          -   A parameter to be calibrated using observed data. In units of kWh / (deg F)^2.
    b          -   Another parameter to be calibrated using observed data. This is in units of kWh.
                   It is the energy consumption when the HVAC system is not used.
    t_out_min  -   The external temperature above at which the occupants feel comfortable without using
                   their HVAC system.
    
    Returns: The weekly HVAC energy consumption in kWh.
    """
    return a * (t_out - t_out_min) ** 2 + b

and the predictions looked like this:

Hide code cell source
import numpy as np

a = 0.1 # in kWh / F^2
b = 50 # in kWh
fig, ax = make_full_width_fig()

# First the scatter plot of all the data we have
ax.scatter(df['t_out'], df['hvac'], label='Observed data', s=0.5)
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')

# Now pick some temperatures to use our model on:
t_outs = np.linspace(df['t_out'].min(), df['t_out'].max(), 100)
predicted_hvac = hvac_model(t_outs, a, b)
ax.plot(t_outs, predicted_hvac, 'r', label='Model predictons')
plt.legend(loc='best');
../_images/a8875a6aee49c6be3f4468305d7ba09d9e71bf941f76ad3a6241a24fb5766006.svg

Clearly, this model represents some sort of average behavior. Let’s use see how specific households perform. Here is how many households we have:

df['household'].unique()
array(['a1', 'a10', 'a11', 'a12', 'a13', 'a16', 'a2', 'a3', 'a4', 'a5',
       'a6', 'a7', 'a8', 'a9', 'b17', 'b18', 'b19', 'b20', 'b21', 'b22',
       'b23', 'b24', 'b25', 'b26', 'b28', 'b29', 'b30', 'b31', 'b33',
       'c34', 'c35', 'c36', 'c37', 'c38', 'c39', 'c40', 'c41', 'c42',
       'c43', 'c44', 'c45', 'c46', 'c47', 'c48', 'c49', 'c50', 'a15',
       'a14', 'b32', 'b27'], dtype=object)

How can we extract the data for say household a1? You can do it as follows. First, notice that you can use a boolean expression directly on the column household. The result is an an array with True or False indicating the value of the boolean expression on the corresponding rows. Here it is:

df['household'] == 'a1'
0        True
1       False
2       False
3       False
4       False
        ...  
5643    False
5644    False
5646    False
5647    False
5649    False
Name: household, Length: 4887, dtype: bool

Now, if you feed this array of booleans to the dataframe, it will understand that you only want to keep the rows that are True. Here it is:

df[df['household'] == 'a1'].round(2)
household date score t_out t_unit hvac
0 a1 2018-01-07 100.0 4.28 66.69 246.47
50 a1 2018-01-14 98.0 33.44 67.81 116.95
100 a1 2018-01-21 100.0 19.58 66.51 116.13
150 a1 2018-01-28 96.0 41.08 69.58 61.09
200 a1 2018-02-04 96.0 30.07 70.09 61.35
... ... ... ... ... ... ...
4700 a1 2019-10-27 97.0 54.83 73.62 8.56
4750 a1 2019-11-03 98.0 45.93 72.46 14.33
4800 a1 2019-11-10 97.0 40.18 72.45 10.76
4850 a1 2019-11-17 99.0 32.20 71.45 11.83
4900 a1 2019-11-24 97.0 39.33 70.90 8.46

82 rows × 6 columns

So, there are 82 rows corresponding to this household. If you wanted to find rows of another household, say b17, you do this:

df[df['household'] == 'b17'].round(2)
household date score t_out t_unit hvac
16 b17 2018-01-07 73.0 4.28 74.91 330.89
66 b17 2018-01-14 70.0 33.44 75.11 163.87
116 b17 2018-01-21 69.0 19.58 74.96 296.65
166 b17 2018-01-28 70.0 41.08 74.63 146.17
216 b17 2018-02-04 79.0 30.07 74.34 244.35
... ... ... ... ... ... ...
5416 b17 2020-02-02 92.0 33.66 70.94 101.67
5466 b17 2020-02-09 95.0 38.12 71.55 109.25
5516 b17 2020-02-16 92.0 28.55 71.43 182.11
5566 b17 2020-02-23 92.0 33.93 69.29 139.91
5616 b17 2020-02-25 95.0 43.64 72.25 8.29

113 rows × 6 columns

So, we have 113 rows for household b17. The difference in the entries is likely due to sensor malfunction. Now, let’s redo our scatter plot but using different colors for a1 and b17.

fig, ax = make_full_width_fig()

ax.scatter(df['t_out'], df['hvac'], label='All households', s=0.5)

# Do the scatter plot for a1
a1_df = df[df['household'] == 'a1']
ax.scatter(a1_df['t_out'], a1_df['hvac'], label='Household a1', s=10, marker='s')
# and the scatter plot for b17
b17_df = df[df['household'] == 'b17']
ax.scatter(b17_df['t_out'], b17_df['hvac'], label='Household b17', s=10, marker='d')

# The labels
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')

# Plot also our model.
predicted_hvac = hvac_model(t_outs, a, b)
ax.plot(t_outs, predicted_hvac, 'r', label='Model predictons')
plt.legend(loc='best')

save_for_book(fig, 'ch6.fig1');
../_images/75e319edcf724be28e8c9a10e3c96c0766915d9f01a24e5145b2277e5869eaab.svg

Notice the difference in behavior between the two households. Household a1 is using less energy during the heating season, but more during the cooling season.

Questions#

  • Modify the plot above to add data from household c36.

Plotting timeseries data#

Since we have extracted the data for units a1 and b17, we have a good opportunity to demonstrate another set of useful plots that you can do with dataframes. We can now plot t_unit as a function of the date for each one of these units. Let’s do it step by step. First, see this plot:

fig, ax = plt.subplots()
a1_df.plot(x='date', y='t_unit', label='Household a1', ax=ax)
<Axes: xlabel='date'>
../_images/1e96e1ce38b5a53a3084514892782de66a7f7acb5f60846df8fb55f9057910cf.svg

Now, there are some data missing during Spring of 2019 and the dataframe is automatically interpolating. We could change the style of the plot so that we see the individual data points.

fig, ax = plt.subplots()
a1_df.plot(x='date', y='t_unit', label='Household a1', style='.-', fig=fig, ax=ax);
../_images/7d4c05ec6216365c83a8e5e6529a63c01db766baac8ca2ade0bfe2c23b913548.svg

And here is how you can add the temperature of another unit:

fig, ax = make_full_width_fig()
a1_df.plot(x='date', y='t_unit', label='Household a1', style='.-', ax=ax)
b17_df.plot(x='date', y='t_unit', label='Household b17', style='d--', ax=ax)
ax.set_ylabel('Unit temperature (degrees F)')
save_for_book(fig, 'ch6.fig2');
../_images/51df8825890c869a7c49fbdb6d79087631a1072c0baad889b04e18aec40587d0.svg

We confirm that

Questions#

  • In the figure of t_unit vs date above, add data from household c36.

  • For households a1, b17 and c36 plot hvac as a function of the date.

So far we manually extracted the data we needed for each unit. Now, I will show you how you can do it for as many units as you like using a for loop. Let’s say we have a list of households we want to plot. We need to plot the data for each household using a different marker. How can we loop over pairs of household names and markers? The most elegant way to do this is by using the zip function:

households = ['a1', 'a2', 'a3', 'a4', 'a5']
markers = ['o', 's', 'd', 'p', 'x']
for h, m in zip(households, markers):
    print(h, m)
a1 o
a2 s
a3 d
a4 p
a5 x

The zip function takes two lists and returns a list of pairs. Actually, you get back a object called an \emph{iterator}, but you can think of it as a list of pairs. Each pair contains one element from the first list and one from the second. Let’s redo our scatter plot for a list of households:

fig, ax = make_full_width_fig()
ax.scatter(df['t_out'], df['hvac'], label='All households', s=0.5)
households = ['a1', 'a2', 'a3', 'a4', 'a5']
markers = ['o', 's', 'd', 'p', 'x']
for h, m in zip(households, markers):
    df_tmp = df[df['household'] == h]
    ax.scatter(df_tmp['t_out'], df_tmp['hvac'], label='Household ' + h, marker=m)
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best')
save_for_book(fig, 'ch6.fig3');
../_images/71c1c15d14d45a64087d625a3219c123557ae8cf94f36bf541501ab9f499332f.svg

Questions#

  • Modify the code above so that you plot the scatter plot for the first 5 of the c?? households.

  • For the same households as above, plot hvac as a function of date.

Extract rows that satisfy more complicated boolean expressions#

Let’s say that we want to see how the selection of t_unit (temperature setpoint) affects hvac (energy consumption) during the heating season. We are on heating season for sure when the external temperature is smaller than 55 degrees F. So, we need to extract all the rows with df['t_out'] < 55.

df_heating = df[df['t_out'] < 55]
df_heating.round(2)
household date score t_out t_unit hvac
0 a1 2018-01-07 100.0 4.28 66.69 246.47
1 a10 2018-01-07 100.0 4.28 66.36 5.49
2 a11 2018-01-07 58.0 4.28 71.55 402.09
3 a12 2018-01-07 64.0 4.28 73.43 211.69
4 a13 2018-01-07 100.0 4.28 63.92 0.85
... ... ... ... ... ... ...
5643 c44 2020-02-25 59.0 43.64 76.49 19.14
5644 c45 2020-02-25 87.0 43.64 71.17 30.79
5646 c47 2020-02-25 97.0 43.64 68.60 5.34
5647 c48 2020-02-25 92.0 43.64 73.43 18.04
5649 c50 2020-02-25 59.0 43.64 77.72 14.41

2741 rows × 6 columns

Now, we are going to draw the scatter plot between t_out and hvac, but we are going to color the points according to the unit temperature t_unit. Let’s first pick the rows with t_unit between 70 and 72 degrees F.

df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 and
                               df_heating['t_unit'] < 72]
df_heating_70to72.head()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/3n/r5vj11ss7lzcdl10vfhb_mw00000gs/T/ipykernel_93904/3118749747.py in ?()
----> 1 df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 and
      2                                df_heating['t_unit'] < 72]
      3 df_heating_70to72.head()

~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/generic.py in ?(self)
   1575     @final
   1576     def __nonzero__(self) -> NoReturn:
-> 1577         raise ValueError(
   1578             f"The truth value of a {type(self).__name__} is ambiguous. "
   1579             "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
   1580         )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

The problem here is that on the left and on the right of the logical and operator you have arrays of booleans not just booleans. Python does not know what to do with them. The correct operator to use to represent logical AND in this situation is & (which you can also read as AND). Let’s try again. This time we are going to get a really big error message. I’m going to hide it.

df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 &
                               df_heating['t_unit'] < 72]
df_heating_70to72.head()
Hide code cell output
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/ops/array_ops.py:362, in na_logical_op(x, y, op)
    353 try:
    354     # For exposition, write:
    355     #  yarr = isinstance(y, np.ndarray)
   (...)
    360     # Then Cases where this goes through without raising include:
    361     #  (xint or xbool) and (yint or bool)
--> 362     result = op(x, y)
    363 except TypeError:

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/roperator.py:54, in rand_(left, right)
     53 def rand_(left, right):
---> 54     return operator.and_(right, left)

TypeError: ufunc 'bitwise_and' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/ops/array_ops.py:376, in na_logical_op(x, y, op)
    375 try:
--> 376     result = libops.scalar_binop(x, y, op)
    377 except (
    378     TypeError,
    379     ValueError,
   (...)
    382     NotImplementedError,
    383 ) as err:

File ops.pyx:180, in pandas._libs.ops.scalar_binop()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
Cell In[72], line 1
----> 1 df_heating_70to72 = df_heating[df_heating['t_unit'] >= 70 &
      2                                df_heating['t_unit'] < 72]
      3 df_heating_70to72.head()

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/ops/common.py:76, in _unpack_zerodim_and_defer.<locals>.new_method(self, other)
     72             return NotImplemented
     74 other = item_from_zerodim(other)
---> 76 return method(self, other)

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/arraylike.py:74, in OpsMixin.__rand__(self, other)
     72 @unpack_zerodim_and_defer("__rand__")
     73 def __rand__(self, other):
---> 74     return self._logical_method(other, roperator.rand_)

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/series.py:6130, in Series._logical_method(self, other, op)
   6127 lvalues = self._values
   6128 rvalues = extract_array(other, extract_numpy=True, extract_range=True)
-> 6130 res_values = ops.logical_op(lvalues, rvalues, op)
   6131 return self._construct_result(res_values, name=res_name)

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/ops/array_ops.py:454, in logical_op(left, right, op)
    450 else:
    451     # i.e. scalar
    452     is_other_int_dtype = lib.is_integer(rvalues)
--> 454 res_values = na_logical_op(lvalues, rvalues, op)
    456 # For int vs int `^`, `|`, `&` are bitwise operators and return
    457 #   integer dtypes.  Otherwise these are boolean ops
    458 if not (left.dtype.kind in "iu" and is_other_int_dtype):

File ~/.pyenv/versions/3.12.5/lib/python3.12/site-packages/pandas/core/ops/array_ops.py:385, in na_logical_op(x, y, op)
    377         except (
    378             TypeError,
    379             ValueError,
   (...)
    382             NotImplementedError,
    383         ) as err:
    384             typ = type(y).__name__
--> 385             raise TypeError(
    386                 f"Cannot perform '{op.__name__}' with a dtyped [{x.dtype}] array "
    387                 f"and scalar of type [{typ}]"
    388             ) from err
    390 return result.reshape(x.shape)

TypeError: Cannot perform 'rand_' with a dtyped [float64] array and scalar of type [bool]

The problem here is the order in which the expression is evaluated. The operator & has a special meaning in Python (we are not going to get into it) and it has high priority. So, python tries to evaluate first 68 & df_heating['t_unit'] which is meanigless. What needs to happen here is that is should first evaluate df_heating['t_unit'] >= 68 to get an array of booleans, then evaluate df_heating['t_unit'] < 70 to get another array of booleans, and the use & to logically compare the arrays pointwise. We can enforce this order using parentheses. Whatever Python expression is enclosed in parentheses will be evaluated first. So, here is the correct way to do it:

df_heating_70to72 = df_heating[(df_heating['t_unit'] >= 70) &
                               (df_heating['t_unit'] < 72)]
df_heating_70to72.head().round(2)
household date score t_out t_unit hvac
2 a11 2018-01-07 58.0 4.28 71.55 402.09
17 b18 2018-01-07 87.0 4.28 70.61 368.15
25 b26 2018-01-07 97.0 4.28 70.01 36.65
29 b30 2018-01-07 99.0 4.28 71.42 23.68
34 c35 2018-01-07 96.0 4.28 70.56 349.03

Here is the plot:

fig, ax = make_full_width_fig()
ax.scatter(df_heating['t_out'], df_heating['hvac'], label='All households', s=0.5)
ax.scatter(df_heating_70to72['t_out'], df_heating_70to72['hvac'], 
           label='t_unit in [70, 72)', marker='s', s=2)
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best')
save_for_book(fig, 'ch6.fig4');
../_images/117dc7b7917c74b1f19d9ddb7c95875a71597d988302ecbc0b5efd5b6a77aa0f.svg

We see that the units with t_unit between 70 and 72 degrees F tend to consume less energy. There are some outliers, but the trend is clear. What can explain the outliers though? How can two households with the same internal temperature t_unit have very different energy consumption? There are other variables here that we are not observing. For example, the location of the unit in the building. A unit that is at the corner of the building will have different heat loss than a unit that is in the middle of the building. Another explanation is that the occupants may be opening the windows more often in one of the appartments. Such variables are called \emph{confounding variables}. You should always be on the lookout for them.

Let’s demonstrate a logical-or for selecting rows. As you may have guessed or does not work. We need to use the operator |. And yes, we need to enclose everything in parentheses. Let’s select the entries that highlight wasteful energy consuming behavior. That would be units that have a very high t_unit (say above 78 degrees F) during heating or a very low t_unit during cooling (say below 70 degrees F). Let’s see how much energy they consume.

df_bad_behavior = df[((df['t_out'] < 55) & (df['t_unit'] > 78)) |
                     ((df['t_out'] >= 55) & (df['t_unit'] < 70))]
df_bad_behavior.head().round(2)
household date score t_out t_unit hvac
15 a9 2018-01-07 39.0 4.28 78.71 495.23
22 b23 2018-01-07 44.0 4.28 78.62 359.17
65 a9 2018-01-14 41.0 33.44 78.56 321.83
72 b23 2018-01-14 43.0 33.44 78.68 197.92
115 a9 2018-01-21 41.0 19.58 78.57 431.24
fig, ax = make_full_width_fig()
ax.scatter(df['t_out'], df['hvac'], label='All households', s=0.5)
ax.scatter(df_bad_behavior['t_out'], df_bad_behavior['hvac'], 
           label='Bad behavior', marker='s', s=2)
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best')
save_for_book(fig, 'ch6.fig5');
../_images/613b4d54f29099e32689946c14320b638b739a1f69dc2e2f670ba7d3ca3fb1c8.svg

Notice that they are some entries that exhibit bad behavior without being penalized with excessive energy consumption. Can you think of a confounding variable that could explain this? Here is what is happening:

  • Units that are at the corners of the building expose at least two sides to the external environment.

  • Units that are on the top floor expose at least two sides to the external environment.

  • Units that are at a corner and on the top floor expose three sides to the external environment.

  • The rest of the units expose a single side to the external environment. The first three groups (and especially the third group) are loosing a lot of energy to the environment and their energy consumption will in general be higher. The final group is basically insulated from the environment and on top of that they may be getting heat energy from their neighbors. Behaving badly in a top floor, corner unit will be reflected on your energy bill. Behaving badly in an insulated unit may have no effect on your energy bill.

Let’s end with logical-not for selecting rows. Say we wanted to negate the previous selection so that we highlight all those who behaved well. For this, you use the ~ operator in front of the logical expression you want to negate. Like this:

df_good_behavior = df[~(((df['t_out'] < 55) & (df['t_unit'] > 72)) |
                     ((df['t_out'] >= 55) & (df['t_unit'] < 75)))]
df_good_behavior.head().round(2)
household date score t_out t_unit hvac
0 a1 2018-01-07 100.0 4.28 66.69 246.47
1 a10 2018-01-07 100.0 4.28 66.36 5.49
2 a11 2018-01-07 58.0 4.28 71.55 402.09
4 a13 2018-01-07 100.0 4.28 63.92 0.85
7 a16 2018-01-07 56.0 4.28 66.91 529.84

Again, pay attention to the parentheses because they are critical for the correct evaluation of the expression.

fig, ax = make_full_width_fig()
ax.scatter(df['t_out'], df['hvac'], label='All households', s=0.5)
ax.scatter(df_good_behavior['t_out'], df_good_behavior['hvac'], 
           label='Good behavior', color=sns.color_palette()[2], marker='s', s=2)
ax.set_xlabel('t_out (F)')
ax.set_ylabel('hvac (kWh)')
plt.legend(loc='best')
save_for_book(fig, 'ch6.fig6');
../_images/042621ecc2d4af3c86f017be1d071aa2f429938abdfe7bcfccfab705ea0ad6d3.svg

Most households behaving well consume less than 100 kWh of energy per week. However, there are some units that are using excellent temperature set points, but they are still consuming a lot of energy. This could be because of the confounding variables we discussed earlier. For example, the unit could be in a corner and the occupants could be opening the windows more often.