The goal of this homework assignment is for you to gain some experience using the R environment for statistical computing. Specifically, we will look at the differences in probability outcomes when different distributions are considered for a collection of data. R is available for Linux, MacOS and Windows. Feel free to choose any environment that you prefer. If you are using Debian or Ubuntu Linux, R is available via the command `sudo apt-get install r-base`

. An Ubuntu virtual machine may be the easiest way to get up and running.

For this assignment, record your responses to the following activities in the
`README.md`

file in the `homework04`

folder of your **assignments** GitLab
repository and push your work by **11:59 PM Friday, October 12**.

As discussed in class, each **homework** assignment must be
completed in its own git branch; this will allow you to separate the work
of each assignment and for you to use the merge request workflow.

To create a `homework04`

branch in your local repository, follow the
instructions below:

$ cd path/to/cse-40171-fa18-assignments # Go to assignments repository $ git remote add upstream https://gitlab.com/wscheirer/cse-40171-fa18-assignments # Switch back over to the main class repository $ git fetch upstream # Toggle the upstream branch $ git pull upstream master # Pull the files for homework04 $ git checkout -b homework04 # Create homework04 branch and check it out $ cd homework04 # Go into homework04 folder

Once these commands have been successfully performed, you are now ready to add, commit, and push any work required for this assignment.

In class, we spent a bit of time discussing central tendency models. The most common model is the Gaussian (a familiar bell shaped curve), defined by two parameters: mean and standard deviation. The central limit theorem tells us that the arithmetic mean of a sufficiently large number of i.i.d. variables, each with a well-defined expected value and variance, will approximately follow a Gaussian distribution, regardless of the underlying distribution. This can be demonstrated by sampling from a specified distribution, and taking the mean of each sample set *n* number of times.

Write an R script (call it central.R) that samples 1,000 values from a Gaussian distribution (i.e., standard normal distribution) with a mean of 0 and standard deviation of 1, computes the **mean** for that set of samples, and stores it in a vector entry. Repeat this process 10,000 times. If you are new to R, check out this intro for an explanation of the language constructs and syntax.

Now plot the histogram of the means stored in your vector. Where is the peak of the curve? Why is the peak of the curve where it is? Record your answer in your `README.md`

file, including the plot.

What happens if we compute and store the **maximum** of each set of samples, instead of the mean? Write an R script (call it evt.R) that again samples 1,000 values from a Gaussian distribution with a mean of 0 and standard deviation of 1, computes the maximum for that set of samples, and stores it in a vector entry. Repeat this process 10,000 times.

Now plot the histogram of the maximums stored in your vector. Where is the peak of the curve? Does the central limit theorem hold in this case? Record your answer in your `README.md`

file, including the plot.

Let's look at an alternative distribution that is more helpful when modeling a collection of maximum values. For many problems within AI, the Gaussian is the distribution of choice for statistical modeling. And indeed, it does approximately apply in many scenarios. However, there are scenarios where it will drastically underestimate the probability of an event from occurring. An example of this is the above Gaussian fit to the vector of maximum values. To avoid this modeling pitfall, let's examine the Generalized Extreme Value (GEV) Distribution. Just as the central limit theorem indicates that the random variables generated from certain stochastic processes follow Gaussian distributions, the statistical Extreme Value Theory indicates that given a well-behaved overall distribution of values, the distribution of the maxiumum or minimum values can assume only limited forms. The GEV encapsulates this theory.

The GEV is defined by three parameters: location, scale and shape. In this activity, we will fit two distributions, a Gaussian and a GEV, to the means and maximums, letting R automatically estimate the parameters. First, let's fit the Gaussian. In the first script you wrote for Activity 1, add some code to estimate the mean and standard deviation of the vector of the mean values. You can do this by first loading the R MASS library: library("MASS"). Then call the function `fitdistr()`

to estimate the parameters.

Given this central tendency model, we can ask what the probability is for a variable *x* taking on a value that is less than or equal to a specified value point *q* by using the model's Cumulative Distribution Function (CDF). The R function `pnorm()`

gives us access to the Gaussian CDF via the two parameters we just estimated. Let's assume that we are interested in the upper tail of the distribution, meaning the probabilities returned by the function are *P(x < q) = 1 - CDF*. For this, the last parameter of the function `pnorm()`

should be set to FALSE. Call `pnorm()`

for the following values of *x*: 0, 1, 2, 3, 4. `pnorm()`

should not use hardcoded parameters; make sure it picks up the set of parameters from `fitdistr()`

each time it is run. Record the probability scores in `README.md`

.

Now let us estimate the probabilities for the same random variables, but this time the model will use the GEV. Add a GEV fit to your second script from Activity 1 by first loading the R SpatialExtremes library: `library("SpatialExtremes")`

. If your R environment is missing this library, it's easy to install from the R command line interface: `install.packages("SpatialExtremes")`

. Then call the function `gevmle()`

to estimate the parameters for the vector of maximum scores. For the GEV CDF, we can call the `pgev()`

function. Its syntax is almost the same as that of `pnorm()`

, with the difference that it takes the three GEV parameters. Its last argument should also be FALSE. Calling this function for the same five random variables we used above (and making it sure it automatically picks up the estimated parameters from `gevmle()`

), what are the probability scores? What does this mean compared to the results of the Gaussian model for the same data? Can you think of an example of an AI application where this difference would be important? Record your answers to these questions in `README.md`

.

If you have any questions, comments, or concerns regarding the course, please
provide your feedback at the end of your `README.md`

.

To submit your assignment, please commit your R scripts and `README.md`

to the `homework04`

folder
of your `homework04`

branch in your **assignment's** GitLab repository:

$ cd path/to/cse-40171-fa18-assignments # Go to assignments repository $ git checkout master # Make sure we are in master branch $ git pull --rebase # Make sure we are up-to-date with GitLab $ git checkout -b homework04 # Create homework04 branch and check it out $ cd homework04 # Go to homework04 directory ... $ $EDITOR README.md # Edit appropriate README.md $ git add README.md # Mark changes for commit $ git commit -m "homework04: complete" # Record changes ... $ git push -u origin homework04 # Push branch to GitLab

Procedure for submitting your work: create a merge request by the process that is described here, but make sure to change the **target branch** from
wscheirer/cse-40171-fa18-assignments to **your personal fork's master branch** so that your code is not visible
to other students. Additionally, assign this merge request to your TA and add wscheirer, agraese, and AndroidKitKat
as approvers (so all class staff can track your submission). Your assigned TA is agraese if you have a last name st
arting with A through Ki, or AndroidKitKat if you have a last name starting with Kl through W.