The goal of this homework assignment is for you to gain some experience using the R environment for statistical computing. Specifically, we will look at the differences in probability outcomes when different distributions are considered for a collection of data. R is available for Linux, MacOS and Windows. Feel free to choose any environment that you prefer. If you are using Debian or Ubuntu Linux, R is available via the command
sudo apt-get install r-base. An Ubuntu virtual machine may be the easiest way to get up and running.
For this assignment, record your responses to the following activities in the
README.md file in the
homework04 folder of your assignments GitLab
repository and push your work by 11:59 PM Friday, October 12.
To create a
homework04 branch in your local repository, follow the
$ cd path/to/cse-40171-fa18-assignments # Go to assignments repository $ git remote add upstream https://gitlab.com/wscheirer/cse-40171-fa18-assignments # Switch back over to the main class repository $ git fetch upstream # Toggle the upstream branch $ git pull upstream master # Pull the files for homework04 $ git checkout -b homework04 # Create homework04 branch and check it out $ cd homework04 # Go into homework04 folder
In class, we spent a bit of time discussing central tendency models. The most common model is the Gaussian (a familiar bell shaped curve), defined by two parameters: mean and standard deviation. The central limit theorem tells us that the arithmetic mean of a sufficiently large number of i.i.d. variables, each with a well-defined expected value and variance, will approximately follow a Gaussian distribution, regardless of the underlying distribution. This can be demonstrated by sampling from a specified distribution, and taking the mean of each sample set n number of times.
Write an R script (call it central.R) that samples 1,000 values from a Gaussian distribution (i.e., standard normal distribution) with a mean of 0 and standard deviation of 1, computes the mean for that set of samples, and stores it in a vector entry. Repeat this process 10,000 times. If you are new to R, check out this intro for an explanation of the language constructs and syntax.
Now plot the histogram of the means stored in your vector. Where is the peak of the curve? Why is the peak of the curve where it is? Record your answer in your
README.md file, including the plot.
What happens if we compute and store the maximum of each set of samples, instead of the mean? Write an R script (call it evt.R) that again samples 1,000 values from a Gaussian distribution with a mean of 0 and standard deviation of 1, computes the maximum for that set of samples, and stores it in a vector entry. Repeat this process 10,000 times.
Now plot the histogram of the maximums stored in your vector. Where is the peak of the curve? Does the central limit theorem hold in this case? Record your answer in your
README.md file, including the plot.
Let's look at an alternative distribution that is more helpful when modeling a collection of maximum values. For many problems within AI, the Gaussian is the distribution of choice for statistical modeling. And indeed, it does approximately apply in many scenarios. However, there are scenarios where it will drastically underestimate the probability of an event from occurring. An example of this is the above Gaussian fit to the vector of maximum values. To avoid this modeling pitfall, let's examine the Generalized Extreme Value (GEV) Distribution. Just as the central limit theorem indicates that the random variables generated from certain stochastic processes follow Gaussian distributions, the statistical Extreme Value Theory indicates that given a well-behaved overall distribution of values, the distribution of the maxiumum or minimum values can assume only limited forms. The GEV encapsulates this theory.
The GEV is defined by three parameters: location, scale and shape. In this activity, we will fit two distributions, a Gaussian and a GEV, to the means and maximums, letting R automatically estimate the parameters. First, let's fit the Gaussian. In the first script you wrote for Activity 1, add some code to estimate the mean and standard deviation of the vector of the mean values. You can do this by first loading the R MASS library: library("MASS"). Then call the function
fitdistr() to estimate the parameters.
Given this central tendency model, we can ask what the probability is for a variable x taking on a value that is less than or equal to a specified value point q by using the model's Cumulative Distribution Function (CDF). The R function
pnorm() gives us access to the Gaussian CDF via the two parameters we just estimated. Let's assume that we are interested in the upper tail of the distribution, meaning the probabilities returned by the function are P(x < q) = 1 - CDF. For this, the last parameter of the function
pnorm() should be set to FALSE. Call
pnorm() for the following values of x: 0, 1, 2, 3, 4.
pnorm() should not use hardcoded parameters; make sure it picks up the set of parameters from
fitdistr() each time it is run. Record the probability scores in
Now let us estimate the probabilities for the same random variables, but this time the model will use the GEV. Add a GEV fit to your second script from Activity 1 by first loading the R SpatialExtremes library:
library("SpatialExtremes"). If your R environment is missing this library, it's easy to install from the R command line interface:
install.packages("SpatialExtremes"). Then call the function
gevmle() to estimate the parameters for the vector of maximum scores. For the GEV CDF, we can call the
pgev() function. Its syntax is almost the same as that of
pnorm(), with the difference that it takes the three GEV parameters. Its last argument should also be FALSE. Calling this function for the same five random variables we used above (and making it sure it automatically picks up the estimated parameters from
gevmle()), what are the probability scores? What does this mean compared to the results of the Gaussian model for the same data? Can you think of an example of an AI application where this difference would be important? Record your answers to these questions in
If you have any questions, comments, or concerns regarding the course, please
provide your feedback at the end of your
To submit your assignment, please commit your R scripts and
README.md to the
homework04 branch in your assignment's GitLab repository:
$ cd path/to/cse-40171-fa18-assignments # Go to assignments repository $ git checkout master # Make sure we are in master branch $ git pull --rebase # Make sure we are up-to-date with GitLab $ git checkout -b homework04 # Create homework04 branch and check it out $ cd homework04 # Go to homework04 directory ... $ $EDITOR README.md # Edit appropriate README.md $ git add README.md # Mark changes for commit $ git commit -m "homework04: complete" # Record changes ... $ git push -u origin homework04 # Push branch to GitLab
Procedure for submitting your work: create a merge request by the process that is described here, but make sure to change the target branch from wscheirer/cse-40171-fa18-assignments to your personal fork's master branch so that your code is not visible to other students. Additionally, assign this merge request to your TA and add wscheirer, agraese, and AndroidKitKat as approvers (so all class staff can track your submission). Your assigned TA is agraese if you have a last name st arting with A through Ki, or AndroidKitKat if you have a last name starting with Kl through W.