Homework 7

  • Type your name and email in the “Student details” section below.

  • Develop the code and generate the figures you need to solve the problems using this notebook.

  • For the answers that require a mathematical proof or derivation you can either:

    • Type the answer using the built-in latex capabilities. In this case, simply export the notebook as a pdf and upload it on gradescope; or

    • You can print the notebook (after you are done with all the code), write your answers by hand, scan, turn your response to a single pdf, and upload on gradescope.

  • The total homework points are 100. Please note that the problems are not weighed equally.

Note

  • This is due before the beginning of the next lecture.

  • Please match all the pages corresponding to each of the questions when you submit on gradescope.

# Here are some modules that you may need - please run this block of code:
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_context('paper')
import numpy as np
import scipy
import scipy.stats as st
# A helper function for downloading files
import requests
import os
def download(url, local_filename=None):
    """
    Downloads the file in the ``url`` and saves it in the current working directory.
    """
    data = requests.get(url)
    if local_filename is None:
        local_filename = os.path.basename(url)
    with open(local_filename, 'wb') as fd:
        fd.write(data.content)

Student details

  • First Name:

  • Last Name:

  • Email:

Problem 1 - Blackjack probabilities

Blackjack is a popular card game. The background information \(I\) captures the basic rules of the game relevant to this problem:

We have a deck of 52 cards. The deck includes: Four versions aces (A); Four versions of each number from 2 to 10; Four versions of the figures J, Q, and K. In blackjack all the cards are associated with a number. The cards that have a number on them are associated with that number. The figures J, Q, and K are associated with the number 10. The aces A can either be the number 1 or the number 11. The deck of cards is shuffled adequately.

Now consider the logical proposition \(A\) (blackjack):

You draw two cards at random from the deck without replacement. You either have two aces (AA) or the maximum sum of the numbers associated with the cards is 21. For example: (10, A), (J, A), etc.

2.A - Finding the probability of \(A\) using the principle of inssuficient reason

  • Find the number of ways in which you can choose two unique cards from the deck. Hint: Google “N choose k”.

Answer:

Your answer here.

  • Find the number ways in which you can get two cards that sum to 21. Hint: Enumerate all possibilities.

Answer:

Your answer here.

  • Find the probability that you pick two cards that sum to 21, i.e., find \(p(A|I)\). Hint: Use the principle of insufficient reason.

Answer:

Your answer here.

2.B - Estimating the probability of A by simmulation

In this problem, we are going to use Monte Carlo simulations to estimate the probability of picking two cards that sum to 21, i.e., \(p(A|I)\). Basically, we are going to simulate the process of picking these two cards. First, let’s start by making all the different cards that appear in a deck of 52. In what follows, I use the following conventions:

  • ‘d’ stands for ‘diamonds’.

  • ‘h’ stands for ‘hearts’.

  • ‘s’ stands for ‘spades’.

  • ‘c’ stands for ‘clubs’. Numbers stand for themselves. And finally:

  • ‘A’ for ‘ace’.

  • ‘J’ for ‘jack’.

  • ‘Q’ for ‘queen’.

  • ‘K’ for ‘king’. For example, this is if you see the string ‘2h’ then this is the ‘two of hearts’. If you see ‘Ad’ this is the ‘Ace of diamonds’. And so on. Let’s make a deck of cards:

deck = []
for n in ['A', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K']:
    for c in ['d', 'h', 's', 'c']:
        card = n + c
        deck.append(card)
print(deck)

We can use numpy.random.shuffle to shuffle the deck in place. Here is how:

np.random.shuffle(deck)
print(deck)

Once the deck is shuffled, you can pick two cards at random by just picking the first two cards of the deck:

my_cards = deck[:2]
print(my_cards)

Now, let’s write a function that calculates the sum of the cards. I wrote the function so that it only works with two cards. It will always use 11 for aces.

def count_cards(cards):
    """
    Counts cards according to blackjack conventions.
    
    Arguments:
    cards   -   Two cards. They much be a string from a deck.
    
    Returns: The blackjack value of the cards.
    """
    assert len(cards) == 2, 'This only works for two cards.'
    s = 0
    for c in cards:
        n = c[0]
        if n == 'A':
            s += 11
        elif n == 'J' or n == 'Q' or n == 'K':
            s += 10
        elif len(c) == 3: # this is the case of '10d', '10h', etc.
            s += 10
        else:
            s += int(n)
    return s

Let’s test it a few times:

print(my_cards)
print(count_cards(my_cards))

Do it ten times at random:

for i in range(10):
    np.random.shuffle(deck)
    my_cards = deck[:2]
    sum_of_cards = count_cards(my_cards)
    print(my_cards, ' sum to: ', sum_of_cards)
  • Now, we have everything we need. Complete the following code which use the Monte Carlo method to estimate the probability of randomly picking two cards that sum to 21. Feel free to experiment with the number of simulations so that you get an accurate estimate.

# The number of experiments you want to simulate
num_exp = 1000
# This is a list in which we are going to put the result
# of each "experiment". We will record a 1 (one) if the experiment
# is successful (cards sum to 21) and a 0 (zero) otherwise
result = []
# Loop over experiments
for i in range(num_exp):
    # YOUR CODE HERE (shuffle the deck)
    my_cards = # YOUR CODE HERE (pick the first two cards from the deck)
    sum_of_cards = # YOUR CODE HERE (find the sum of the cards)
    # YOUR CODE HERE (write a conditional statement that appends 1
    #                 to result if the sum of cards is 21 and 0 otherwise)
p_A_g_I = # YOUR CODE HERE (use result to estimate the probability of getting two
        #                 cards that sum to 21)
print('p(A|I) ~= {0:1.5f}'.format(p_A_g_I))
  • Plot the estimate of \(p(A|I)\) as a function of the number of experiments. In the same plot, use a red dashed line to mark the true value of \(p(A|I)\) based on your answer to the very first question. Hint: See the discussion at the very end of Estimating probabilities by simulation.

# Your code here

Problem 2 - Predicting the probability of major earthquakes in Southern California

We will use the Southern California Earthquake Data Center catalog in this problem. The catalog contains all earthquakes recorded from 1932 until now in Southern California. Do not worry about how I get the data. Just run the code and it will produce a nice dataframe that you can play with. Our goal is to estimate the probability of a major earthquake (to be defined below) somewhere in Southern California during a given year.

First, let’s download the data and put them in a dataframe.

for year in range(1932, 2021):
    print('Downloading year', year)
    url = 'https://raw.githubusercontent.com/SCEDC/SCEDC-catalogs/master/SCEC_DC/{0:d}.catalog'.format(year)
    download(url)

Each one of these is a csv file. We will put them all in the same daframe for your convenience:

import pandas as pd
list_of_dfs = []
for year in range(1932, 2021):
    print('Reading: ', filename)
    filename = '{0:d}.catalog'.format(year)
    df_year = pd.read_csv(filename, delim_whitespace=True, comment='#',
                      names=['Date', 'Hour', 'ET', 'GT', 'MAG', 'M', 'LAT', 'LON',
                               'DEPTH', 'Q', 'EVID', 'NPH', 'NGRM'])
    df_year.Date = pd.to_datetime(df_year['Date'], format='%Y/%m/%d')
    list_of_dfs.append(df_year)
df = pd.concat(list_of_dfs, ignore_index=True)
df['Year'] = pd.DatetimeIndex(df['Date']).year
df

Each row in this dataframe corresponds to an earthquake event that happened between 1/1/1932 and 12/31/2020. The meaning of the columns is explained here. But for the purposes of this problem we will only need information from the following columns:

  • Year: This is the year of the event.

  • ET: This is the type of the event. There are various types of events. For example, the seismometers may pick more than earthquakes, e.g., explosions. We are only intersted in earthquake events which are labeled by eq.

  • MAG: This is the magnitude of the event.

Let’s play with the data set to gain some experience. First, let’s extract all data for a random year. Say, year 2019.

df_2019 = df[df['Year'] == 2019]
df_2019

Out of these, we only care about earthquake events. So, let’s filter out everything else:

df_2019_eq = df_2019[df_2019['ET'] == 'eq']
df_2019_eq

Now, let’s see if there was at least one major earthquake during 2019:

test_mag = df_2019_eq['MAG'] >= 6
test_mag

Is there at least one True value in this array?

test_mag.value_counts()

There are exactly 5 major earthquakes. You can extract the number like this:

test_mag.value_counts()[True]

So, to test whether or not there was a major earthquake you need to do:

True in test_mag.value_counts().keys()
  • Now, we will use bootstrapping to estimate the probability of a major earthquake during a randomly picked year. Follow the instructions below completing the code where necessary.

def estimate_probability_of_major_earthquake_during_year(num_years, df):
    """
    Estimate the probability of major earthquake in a random year.
    
    Arguments:
    num_years    -    The number of years to pick at random.
    df           -    The dataframe containing all the observed events.
    
    Returns: The number of years in which we had at least one major earthquake divided by the num_years.
    """
    num_major_eqs = 0
    for i in range(num_years):
        # Pick a year at random between 1932 and 2020
        y = np.random.randint(1932, 2021)
        # Extract all the events that happened in that year
        df_y = # YOUR CODE HERE
        # Find all earthquake events
        df_y_eq = # YOUR CODE HERE
        # Test if there is at least one major earthquake in this year
        test_mag = # YOUR CODE HERE
        test_mag_counts = test_mag.value_counts()
        # Test if there is at least one major event in this year
        # and increase num_major_eqs by one if yes
        if True in test_mag.value_counts():
            num_major_eqs += 1
    return num_major_eqs / num_years

Use the following lines to test your code. We run it 10 times. Notice that everytime you get a slighlty different estimate.

for i in range(10):
    p_major_eq = estimate_probability_of_major_earthquake_during_year(50, df)
    print('p_major_eq = {0:1.2f}'.format(p_major_eq))
# A place to store the estimates
p_major_eqs = []
# Put 1000 estimates in there
for i in range(200):
    print(i)
    p_major_eq = # your code here
    p_major_eqs.append(p_major_eq)
# And now do the histogram
# Your code here