from n00b to pro

Jack Davis

Andrew Henrey

FROM N00B TO PRO

PURPOSECreate a simulator from scratch that:

• Generates data from a variety of distributions

• Makes a response variable from a known function of the data (plus an error term)

• Constructs a linear model that estimates the coefficients of the function

• Repeats generation and modeling many times to compare the average estimates of the linear model to the known parameters.

• Package the whole thing nicely into a function that we can call in a single line in later work.

• If you’re experienced, the commands themselves may seem trivial

OUTLINE• 1) Learning how to learn

• 2) Randomly Generating Data

• 3) Data Frames and Manipulation

• 4) Linear Models

• BREAK – Quality of presenter improves

• 5) Running loops

• 6) Function Definition

• 7) More advanced function topics

• 8) Using functions

• 9) A short simulation study

LEARNING HOW TO LEARN – JACK DAVIS• Google CRAN Packages to get the package list

• From here you can get a description of every command in a package.

• ??<term> searches for commands related to <term>

• ??plot will find commands related to plot

• ?<command> calls up the help file for that command

• ?abline gives the help file for the abline() command.

LEARNING HOW TO LEARN – JACK DAVIS• Exercises:

• Name one function in the darts game package.

• What is the e-mail of the author of the Texas Holdem simulation package?

• (Bonus) Tell the author about your day via e-mail; s/he likes hearing from fans.

• Find a function to make a histogram

• Find some example code on the heatmap() command.

RANDOMLY GENERATING DATA – JACK DAVIS• The r<dist> commands randomly generate data from a distribution

• rnorm( n , mean, sd) Generates from normal distribution (default N(0,1))

• rexp( n, rate)• rbinom( n, size, prob)• rt( n, df) From Student’s T. (Mean is zero, so setting a mean is up to you)

• set.seed() Allows you to generate the same data every time, so you or others can verify work.

RANDOMLY GENERATING DATA – JACK DAVIS• Set a random seed

• Generate a vector of 50 values from the Normal (mean=10,sd=4) distribution, name the vector x1.

• Do the same with

• Poisson ( lambda = 5), named x2,

• Exponential (rate = 1/7) named x3,

• Student’s t distribution (df =5), with a mean of 5, named x4,

• Normal (mean=0, sd=20), named err

• Make a new variable y, let it be 3 + 20x1 + 15x2 – 12x3 – 10x4 + err

DATA FRAMES – JACK DAVIS• data.frame() makes a dataframe object of the vectors listed in the ()

• The advantage of having a data frame is that it can be treated as a single object..

• Data frames, models, and even matrix decompositions can be objects in R.

• You can call parts of objects by name using $

• model$coef or model$coefficient will bring up the estimated coefficients

• If no such aspect exists, then you’ll get a null response.

• Example: Andrew$height

DATA FRAMES – JACK DAVIS• Exercises:

• Make a data.frame() of x1,x2,x3,x4, and y Name it dat

• (if you’re stuck from the last part, run “Q3-dataframethis.txt” first)

• Use index indicators like dat[4,3], dat [2:7,3], dat [4,], and dat [4,-1] to get

• The 3rd row, 5th entry of dat

• The 2nd – 7th values of the 5th column

• The entire 3rd row

• The 3rd row without the 1st entry

LINEAR MODELS – JACK DAVIS• The results of the lm() function are an object.

• Example: mod = lm(y ~ x1 + I(x2^2) + x1:x2, data=dat)

• Useful aspects

• mod$fitted• mod$residuals

• Useful functions

• summary(mod)• predict(mod, newdata)

LINEAR MODELS – JACK DAVIS• Use the lm command to create a linear model of y as a function of x1,x2,x3, and x4

additively using dat data, name it mod. (No interactions or transformations)

• Get the summary of mod

• Display the estimated coefficients with no other values.

BREAK• This slide unintentionally left 98% blank

OUTLINE• 1) Learning how to learn

• 2) Randomly Generating Data

• 3) Data Frames and Manipulation

• 4) Linear Models

• BREAK – Quality of presenter improves deteriorates

• 5) Running loops

• 6) Function Definition

• 7) More advanced function topics

• 8) Using functions

• 9) A short simulation study

RUNNING YOU FOR A LOOP – ANDREW HENREY• Similar to other programming languages, loops in R allow you to repeat the same block of

code several times

• Unlike other programming languages, large loops in R are exceedingly slow

• Any loop of less than about 100,000 total iterations is not going to give you much trouble in terms of time

RUNNING YOU FOR A LOOP – ANDREW HENREY• An R loop that executes a million commands takes about a second. Conditions vary wildly

• Generating 100,000 data sets of size 50,000 and looping through the dataset to calculate a mean for each one would take longer to run than Jack Davis heading up Burnaby Mountain (ouch)

Sup d00dz late for tutorial Jack

RUNNING YOU FOR A LOOP – ANDREW HENREY• Loop syntax:

for (i in 1:n)

{

#TellVicEverything

}

RUNNING YOU FOR A LOOP – ANDREW HENREY• No need to run from 1:K

• Can use an arbitrary vector instead

• Runs for length(vect) iterations

• Takes on the ith value of the vector each iteration

• e.g.

• V = c(1,5,3,-6)• for (count in V) {print(count);}• ## 1, 5 , 3 , -6

RUNNING YOU FOR A LOOP – ANDREW HENREY• Exercises:

• A) Define a variable runs to be the number 10,000

• B) Define a matrix() called mat with 5 columns and runs rows (10,000 rows)

• C) Put a for() loop around the code found in q5-loopthis.txt . Loop from 1:runs.

Use index indicators like a[k,] to save the estimated coefficients of the model in a new row of mat.

OR, if you think you are a total coding BOSS, then put the loop around your code in parts 2-4 that generates data and finds the linear model estimates of the betas.

FUNCTION DEFINITION – ANDREW HENREY• Functions are a slightly abstract concept

• Mathematics: f(x) = x2+4x-16

• Computing: mean(x) = sum(x)/length(x) – All 3 are functions!

• Functions map INPUTS to OUTPUT

• Possibly no inputs

• In one way or another, always some form of output

• Example functions:

• SORT, MEDIAN, OPTIM / NLM, LM/GLM

FUNCTION DEFINITION – ANDREW HENREY• Function syntax

• Simple function:

F = function()

{

return (5)

}

>> F()

5

FUNCTION DEFINITION – ANDREW HENREY• Exercises:

• Make a function out of the code you wrote in part 5. The syntax should be similar to the previous slide. The function:

• Should be called simulate.lm

• Should include everything needed to generate the data several times, find a linear model, and extract the coefficients

• Does NOT take any inputs

• Should return the matrix of 10,000 runs of coefficients

• Use the function and save the results to a matrix called test

• If nothing is working (), you can use the example code in q6 – function this.txt

ADVANCED FUNCTIONS – ANDREW HENREY• A more complicated example:

MSE = function(X=c(0,3,11),Y)

{

return (mean((X-Y)^2))

}

• Observe that X has default values

>>MSE(Y=c(4,5,6))

15

ADVANCED FUNCTIONS – ANDREW HENREY• If an input argument to a function has default values, you don’t have to specify them when

calling the function

• If an input argument has no default values, running the function without specifying them gives you an error

ADVANCED FUNCTIONS – ANDREW HENREY• Exercises:

• Modify simulate.lm() by adding input parameters. Include:

• nruns, the number of runs in the simulation, with no default

• seed, the random initial seed of the simulation, defaulting to 1337

• verbose, a Boolean true/false to report progress, defaulting to FALSE (caps matter)

• Set runs to nruns at the beginning of your function

• Use set.seed(seed) in your function

• Add code that prints out how far along you are in the loop, but only when verbose is true

• Run this new function to overwrite the old one

USING FUNCTIONS – ANDREW HENREY• Exercises:

• A) Run your simulate.lm() with 25000 runs. Store these results as a variable called betas

• B) Use hist() on the first column of betas to see the sampling distribution of the intercept

• C) Use summary(), mean() , and sd() on this column as well

• D) Use par(mfrow=c(2,2)) and then some hist() commands to display histograms of the other four sampling distributions in a 2x2 grid

• E) Compare your results to the known values (The means of the sampling distributions to the true values, and the standard deviations to the estimated values in Q4)

SIMULATION STUDY – ANDREW HENREY• Idea: You have a binomial experiment with 9 successes and 3 failures.

• You would like to construct a 95% CI for the true proportion of successes

• You DON’T know whether the normal approximation is appropriate

• How can we find out whether or not it’s OK?

SIMULATION STUDY – ANDREW HENREY• Overall procedure:

• Construct a LOT of samples from a population with 12 trials and p=0.75

• For each sample, calculate the 95% CI using the normal approximation

• For each sample, see whether the CI overlaps with 0.75

• Count the number of samples for which the CI overlaps with 0.75

• The proportion of the samples that have a CI that overlaps is called the “true coverage probability”

• If the true coverage probability is close to 95% , the normal approximation to the sampling distribution of p is a good one.

SIMULATION STUDY – ANDREW HENREY• Steps:

• Generate 100,000 samples of binomial(12,0.75) data using rbinom()• For each sample, calculate the usual estimate of p

• For each sample, calculate SE = sqrt(p*(1-p)/12)• For each sample, calculate the lower and upper bounds of the 95% CI

• Find out how many intervals actually contain 0.75

• Optional: Look at hist(x) to gain intuition of why the normal approximation isn’t perfect

THE END• Leave plz tyvm

from n00b to pro

Documents

data jack davisset

data frames jack davisdata

data3 data frames

data jack davisthe r

lm function

known function

normal mean

frame of x1