Throughout the semester, we have looked at different levels of processing in AI. One remaining piece that has not been discussed in-depth is decision making. In this week's lectures, we are gaining some knowledge of Bayesian Hierarchical Modelling — a decision making process that can be used as a read-out layer for an artificial neural network, or as a standalone decision algorithm. Given a collection of datapoints that can be arranged in some hierarchy within a feature space, Bayesian Hierarchical Modelling lets us assess the data in a probabilistic manner, at different levels of the hierarchy. The catch is that we must select a prior that accurately describes the data we may see in operation. If we do a bad job in selecting the priors, our predictions will be affected — but the practical implications can vary depending on the scenario. In this assignment we will investigate this effect using a scenario involving the production of chocolate chip cookies in five different locations.

For this assignment, record your responses to the following activities in the README.md file in the homework08 folder of your assignments GitLab repository and push your work by 11:59 PM Wednesday, December 11.

Activity 0: Branching

As discussed in class, each homework assignment must be completed in its own git branch; this will allow you to separate the work of each assignment and for you to use the merge request workflow.

To create a homework08 branch in your local repository, follow the instructions below:

$ cd path/to/cse-40171-fa19-assignments                                                     # Go to assignments repository

$ git remote add upstream https://gitlab.com/wscheirer/cse-40171-fa19-assignments           # Switch back over to the main class repository

$ git fetch upstream                                                                        # Toggle the upstream branch

$ git pull upstream master                                                                  # Pull the files for homework08

$ git checkout -b homework08                                                                # Create homework08 branch and check it out

$ cd homework08                                                                             # Go into homework08 folder

Once these commands have been successfully performed, you are now ready to add, commit, and push any work required for this assignment.

Activity 1: Install Dependencies and Explore Cookie Factory Data (10 Points)

Consider the following scenario. There is an international corporation producing chocolate chip cookies with factories in five locations spread out around the world. Each factory uses the same recipe to produce its cookies, but you, as a cookie quality control specialist, suspect that there may be some difference in the distribution of the number of chips in cookies produced at each location. This leaves you with two immediate levels of analysis to consider:

  1. Assume that the distribution across the locations is similar, since all cookies come from the same recipe. This means that you would not model each location separately.
  2. Assume that not all locations are the same, and that there is some (perhaps small) difference in chip distributions between factories. This means that you would want to model each location separately.
Such a scenario lends itself nicely to Bayesian Hierarchical Modelling. To build such models, we will turn to the PyMC3 probabilistic programming package for Python.

For this assignment, use Python 3 and install the following dependencies (if you don't already have them):

pip3 install numpy
pip3 install scipy
pip3 install matplotlib
pip3 install pandas
pip3 install pymc3

All of the data you need can be found in this file. Create a directory in the current directory you are working from called data/ and place the cookies.dat file within in. Now you are ready to start working with the cookie factory data.

Begin by loading the data into a pandas data frame:

>>> import pandas as pd
>>> import numpy as np
>>> import scipy.stats as ss
>>> import matplotlib.pyplot as plt
>>> import pymc3 as pm
>>> cookies = pd.read_csv('data/cookies.dat', sep = ' ')

The dataframe contains two columns: one for the number of chips, and one for the factory location (an integer between 1 and 5). Verify that the head of the file corresponds to the following:

>>> cookies.head()
    chips location
0     12     1
1     12     1
2     6     1
3     13     1
4     12     1

Now that you have the data, plot the data hierarchy as described above. Generate a histogram with 10 bins showing the distribution of chocolate chips for all cookies produced by all five factories. Also generate a histogram with 10 bins showing the distribution of chocolate chips for all cookies produced by each individual factory. Save these plots and turn them in as your answer for this activity. Now that you have an idea about the shape of the distributions at both levels of the hierarchy, we can make some decisions about the priors for the Bayesian Hierarchical Model.

Activity 2: Generate Three Different Priors (30 Points)

Let's define a hierarchical model for our data. The chocolate chip counts can be modeled by a Poisson distribution with parameter λ. And λ can be conditioned by a gamma prior: chips∼Poiss(λ),     λ∼Γ(a, b).

The shape and scale of the gamma prior can be parameterized with mean μ and variance σ2: a = μ22, b = μ/σ2.

Importantly, priors can be imposed for these parameters: μ∼Γ(a,b), σ∼Exp(c). Not all priors are equally as effective, of course. Let's look at two different priors, where one is better than the other. Run the following code and examine the prior distributions produced. Based on your results from Activity 1, which set of priors is better for our cookie data? Record your answer to this question (including your reasoning) in your README.md file.

x = np.linspace(0,50)

# First set of priors
fig = plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(x, ss.gamma.pdf(x,a=2,scale=9), 'r-')
plt.title('Prior Distribution for mu \n Gamma Density Function for a={} and b={}'.format(2,1/9))

plt.subplot(1, 2, 2)
x1 = np.linspace(0,10)
plt.plot(x1, ss.expon.pdf(x1,0.8), 'r-')
plt.title('Prior Distribution for sigma2 \n Exponential Density Function for c={}'.format(0.8))
plt.xlim(1,10)

plt.show()

# Second set of priors
fig = plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.plot(x, ss.gamma.pdf(x,a=2,scale=2), 'r-')
plt.title('Prior Distribution for mu \n Gamma Density Function for a={} and b={}'.format(2,1/2))

plt.subplot(1, 2, 2)
x2 = np.linspace(0,10)
plt.plot(x2, ss.expon.pdf(x2,2), 'r-')
plt.title('Prior Distribution for sigma2 \n Exponential Density Function for c={}'.format(2))
plt.xlim(1,10)

plt.show()

Better priors exist for the cookie data we are working with. Can you find a set that is a better match to the data? Turn in the plots for your priors as part of your answer to this activity. Explain why you chose the priors that you did in your README.md file.

Activity 3: Fit Three Different Models Based on the Priors (20 Points)

Now it is time to invoke PyMC3 to fit a model for each of the three sets of priors from Activity 2. Use the following code snippet as a template for producing the three models. Each time you do this, the model itself (which you will need for the prediction activities below) will be stored in the trace variable. This code will produce a series of plots. Save the set of plots that corresponds to each set of priors and turn them in as your answer to this activity.

model = pm.Model()

with model:

    # Prior distribution for mu (first set of priors from Activity 2).
    mu = pm.Gamma('mu', alpha=2.0, beta=1.0/9)

    # Prior distribution for sigma2 (first set of priors from Activity 2).
    sigma = pm.Exponential('sigma', 0.8)

    # Parametrization for the shape parameter.
    alpha = mu**2/sigma**2

    # Parametrization for the scale parameter.
    beta = mu/sigma**2

    # Prior distribution for lambda.
    lam = pm.Gamma('lam', alpha=alpha, beta=beta,
        shape=cookies.location.values.max())

    # Likelihood function for the data.
    chips = [pm.Poisson('chips_{}'.format(i),lam[i],
        observed=cookies[cookies.location==i+1].chips.values)
        for i in range(cookies.location.values.max())]

    # Parameters of the simulation:
    # Number of iterations and independent chains.
    n_draws, n_chains = 1000, 3

    n_sim = n_draws*n_chains

    trace = pm.sample(draws=n_draws, chains=n_chains)
    pm.plot_posterior(trace)
    plt.show()

Activity 4: Make Predictions About Cookie Production in Known Locations (20 Points)

Now that we know how to fit models based on our priors, we can use those models to make predictions about known cookie producing locations. Assume that trace is from fitting with the first set of priors. The code below will compute the posterior probability that the next cookie in Location 2 has fewer than 3 chips (a poor quality cookie!). This is accomplished by first generating a number of samples from a Poisson distribution for each value for the Location 2 simulation. The distribution of chips at Location 2 will be plotted, so you can check if the prediction makes sense.

# Generate n_sim samples of a Poisson distribution for
# each value for the lam_1 (location 2) simulation.
y_pred_location_2 = np.random.poisson(lam=trace['lam'][:,1] , size=n_sim)

# Probability the next cookie in location has less than 3 chips.
print((y_pred_location_2<3).astype(int).mean())

plt.figure()
plt.hist(y_pred_location_2, bins=20)
plt.title('Chocolate Chips Distribution for Location 2');
plt.show()

For each of your three models, run the above sampling and prediction process five times, and record the mean and standard deviation of the predicted probabilities for fewer than 10 chips at Location 5. Record these summary statistics in your README.md file. Is there any significant difference in the results from the different models? Record the answer to this question in your README.md file.

Activity 5: Make Predictions About A New Location for Cookie Production in Hierarchical Fashion (20 Points)

With our models we can now ask questions at different levels of the model hierarchy. For instance, let's ask the following question: what is the posterior probability that a new cookie factory we are opening has λ > 12? Here is some code that will get us the samples we need to make a prediction:

# Posterior distribution of for a an b
# from the simulated values of mu and sigma2.
post_a = trace['mu']**2/trace['sigma']**2
post_b = trace['mu']/trace['sigma']**2

# Generate samples of a gamma distribution
# with these generated parameters of a and b.
lambda_pred_dist = np.random.gamma(post_a,1/post_b,n_sim)

Using a model that was fit with the first set of priors, make a prediction that answers the question above. Record the posterior probability in your README.md file. Also plot the predicted λ distribution and turn it in as part of your answer to this question.

Now let's move down the hierarchy and ask a different question: what is the posterior probability for a cookie produced in a new location to have more than 12 chocolate chips? Here is some new code that will get us the samples we need to make a prediction about a new cookie factory:

# Posterior distribution of the chips.
# Here we use the values of lambda obtained above.
cookies_pred_dist = np.random.poisson(lam=lambda_pred_dist, size=n_sim)

Using a model that was fit with the first set of priors, make a prediction that answers the question above. Record the posterior probability in your README.md file. Also plot the predicted chocolate chip distribution and turn it in as part of your answer to this question.

Feedback

If you have any questions, comments, or concerns regarding the course, please provide your feedback at the end of your README.md.

Submission

To submit your assignment, please commit your work to the homework08 folder of your homework08 branch in your assignment's GitLab repository:

$ cd path/to/cse-40171-fa19-assignments   # Go to assignments repository
$ git checkout master                     # Make sure we are in master branch
$ git pull --rebase                       # Make sure we are up-to-date with GitLab
$ git checkout -b homework08              # Create homework08 branch and check it out
$ cd homework08                           # Go to homework08 directory
...
$ $EDITOR README.md                       # Edit appropriate README.md
$ git add README.md                       # Mark changes for commit
$ git commit -m "homework08: complete"    # Record changes
...
$ git push -u origin homework08           # Push branch to GitLab

Procedure for submitting your work: create a merge request by the process that is described here, but make sure to change the target branch from wscheirer/cse-40171-fa19-assignments to your personal fork's master branch so that your code is not visible to other students. Additionally, assign this merge request to our TA (sabraha2) and add wscheirer as an approver (so all class staff can track your submission).