{ "cells": [ { "cell_type": "markdown", "id": "be8cb81f", "metadata": {}, "source": [ "(lecture07:estimating-probabilities-from-data)=\n", "# Estimating probabilities from data - Bootstrapping\n", "\n", "You can use the same idea we used in simulations to estimate probabilities from experiments.\n", "So, if $I$ is the background information and $A$ is a logical proposition that is experimentally testable, then\n", "\n", "$$\n", "p(A|I) \\approx \\frac{\\text{Number of times}\\;A\\;\\text{is True under}\\;I\\;\\text{in}\\;N\\;\\text{experiments}}{N}.\n", "$$\n", "\n", "There is a catch here.\n", "The experiments must be *independently* done.\n", "This means that you should prepare any apparatous you are using in exactly the same way for all experiments and that no experiment should affect any other in any way.\n", "Most of the experiments we run in a lab are independent.\n", "However, this assumption may be wrong for data collected in the wild." ] }, { "cell_type": "markdown", "id": "4629d289", "metadata": {}, "source": [ "(lecture07:example-high-performance buildings)=\n", "## Example - Estimating the probability of excessive energy use\n", "\n", "Let's try this in practice using the high-performance building dataset.\n", "I'm importing the libraries and loading the data below." ] }, { "cell_type": "code", "execution_count": 6, "id": "32ca28be", "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "%matplotlib inline\n", "import seaborn as sns\n", "sns.set(rc={\"figure.dpi\":100, 'savefig.dpi':300})\n", "sns.set_context('notebook')\n", "sns.set_style(\"ticks\")\n", "from IPython.display import set_matplotlib_formats\n", "set_matplotlib_formats('retina', 'svg')\n", "import requests\n", "import os\n", "def download(url, local_filename=None):\n", " \"\"\"\n", " Downloads the file in the ``url`` and saves it in the current working directory.\n", " \"\"\"\n", " data = requests.get(url)\n", " if local_filename is None:\n", " local_filename = os.path.basename(url)\n", " with open(local_filename, 'wb') as fd:\n", " fd.write(data.content)\n", " \n", "# The url of the file we want to download\n", "url = 'https://raw.githubusercontent.com/PurdueMechanicalEngineering/me-297-intro-to-data-science/master/data/temperature_raw.xlsx'\n", "download(url)\n", "import numpy as np\n", "import pandas as pd\n", "df = pd.read_excel('temperature_raw.xlsx')\n", "df = df.dropna(axis=0)\n", "df.date = pd.to_datetime(df['date'], format='%Y-%m-%d')" ] }, { "cell_type": "markdown", "id": "b13a1133", "metadata": {}, "source": [ "The background information $I$ is as follows:\n", "\n", "> A random household is picked on a random week during the heating season.\n", "> The heating season is defined to be the time of the year during which the \n", "> weekly average of the external temperature is less than 55 degrees F.\n", "\n", "The logical proposition $A$ is:\n", "\n", "> The weekly HVAC energy consumption of the household exceeds 400 kWh.\n", "\n", "First, we start by selecting the subset of the data that pertains to the heating season." ] }, { "cell_type": "code", "execution_count": 3, "id": "6ffb31fa", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | household | \n", "date | \n", "score | \n", "t_out | \n", "t_unit | \n", "hvac | \n", "
---|---|---|---|---|---|---|
0 | \n", "a1 | \n", "2018-01-07 | \n", "100.0 | \n", "4.283373 | \n", "66.693229 | \n", "246.473231 | \n", "
1 | \n", "a10 | \n", "2018-01-07 | \n", "100.0 | \n", "4.283373 | \n", "66.356134 | \n", "5.492116 | \n", "
2 | \n", "a11 | \n", "2018-01-07 | \n", "58.0 | \n", "4.283373 | \n", "71.549132 | \n", "402.094327 | \n", "
3 | \n", "a12 | \n", "2018-01-07 | \n", "64.0 | \n", "4.283373 | \n", "73.429514 | \n", "211.692244 | \n", "
4 | \n", "a13 | \n", "2018-01-07 | \n", "100.0 | \n", "4.283373 | \n", "63.923937 | \n", "0.850536 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
5643 | \n", "c44 | \n", "2020-02-25 | \n", "59.0 | \n", "43.642388 | \n", "76.494637 | \n", "19.135139 | \n", "
5644 | \n", "c45 | \n", "2020-02-25 | \n", "87.0 | \n", "43.642388 | \n", "71.165052 | \n", "30.794281 | \n", "
5646 | \n", "c47 | \n", "2020-02-25 | \n", "97.0 | \n", "43.642388 | \n", "68.603287 | \n", "5.339391 | \n", "
5647 | \n", "c48 | \n", "2020-02-25 | \n", "92.0 | \n", "43.642388 | \n", "73.429239 | \n", "18.040078 | \n", "
5649 | \n", "c50 | \n", "2020-02-25 | \n", "59.0 | \n", "43.642388 | \n", "77.716955 | \n", "14.405155 | \n", "
2741 rows × 6 columns
\n", "\n", " | household | \n", "date | \n", "score | \n", "t_out | \n", "t_unit | \n", "hvac | \n", "
---|---|---|---|---|---|---|
5456 | \n", "a15 | \n", "2020-02-09 | \n", "80.0 | \n", "38.123983 | \n", "75.235615 | \n", "103.460516 | \n", "
2384 | \n", "c35 | \n", "2018-12-02 | \n", "98.0 | \n", "36.919444 | \n", "70.374578 | \n", "103.333777 | \n", "
5075 | \n", "b26 | \n", "2019-12-15 | \n", "78.0 | \n", "36.130754 | \n", "73.945914 | \n", "75.234305 | \n", "
2470 | \n", "b21 | \n", "2018-12-16 | \n", "22.0 | \n", "35.100620 | \n", "76.782192 | \n", "0.000000 | \n", "
645 | \n", "c46 | \n", "2018-04-01 | \n", "77.0 | \n", "45.607391 | \n", "72.822520 | \n", "46.597443 | \n", "
... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
5243 | \n", "c44 | \n", "2020-01-05 | \n", "53.0 | \n", "42.495313 | \n", "77.833209 | \n", "117.672719 | \n", "
5033 | \n", "c34 | \n", "2019-12-08 | \n", "94.0 | \n", "37.535392 | \n", "69.029183 | \n", "173.136284 | \n", "
2763 | \n", "a7 | \n", "2019-01-27 | \n", "45.0 | \n", "21.504117 | \n", "77.148845 | \n", "260.506823 | \n", "
2877 | \n", "b28 | \n", "2019-02-10 | \n", "49.0 | \n", "38.192808 | \n", "73.968347 | \n", "12.313202 | \n", "
5022 | \n", "b23 | \n", "2019-12-08 | \n", "66.0 | \n", "37.535392 | \n", "77.679315 | \n", "73.887601 | \n", "
500 rows × 6 columns
\n", "