r introduction and training patrick gurian, drexel university camra 1st qmra summer institute august...

27
R Introduction and Training Patrick Gurian, Drexel University CAMRA 1st QMRA Summer Institute August 7, 2006

Upload: eunice-wright

Post on 03-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

R Introduction and Training

Patrick Gurian, Drexel UniversityCAMRA

1st QMRA Summer InstituteAugust 7, 2006

Objective

• Learn maximum likelihood estimation-a method of fitting statistical models

• Learn the R statistical programming language

• All in 2 hours!– Idea is to make you aware of techniques and

tools

Statistical Model Estimation

• Statistical models all contain parameters

• Parameters are constants that can be “tuned” to make a general class of models applicable to a specific dataset

Example: Normal distribution

PDF is:

f(x) = 1/{σ(2π)1/2} exp{(x-µ)2/ 2σ2}

µ and σ are constants that define a particular normal distribution

We want to pick values that match a given dataset

One simple way is arithmetic mean=µ

s=σ (method of moments)

But there are other ways including…

Maximum likelihood estimation

• Assume a probability model

• Calculate the probability (likelihood) or obtaining the observed data

• Now adjust parameter values until you find the values that maximize the probability of obtaining the observed data

Likelihood Function

• Observe data x, a vector of values

• Assume some pdf f(x|θ)

• Where θ is a vector of model parameters

• Probability of any particular value is f(xi| θ) where i is an index indicating a particular observation in the data

Likelihood of the data

• Generally assume data is independent and identically distributed

• Same f(x) for all data• For independent data:

prob [Event A ∩ Event B] = prob[A] Prob[B] • So multiple probability of individual observations

to get joint probability of data

• L=π f(xi| θ)

• Now find θ that maximizes L

MLE example

• A team plays 3 games: W L L

• Binomial model: what is p?

• L=(3:1)p(1-p)2

• Suppose we know sequence of wins and losses then we can say

• L=p(1-p)2

MLE example (cont)

• Suppose p=.5

• L=0.5^2*0.5=0.125

Example for class

• Now suppose p=.3

• What is likelihood of data?

• Do we prefer p=0.5 or p=0.3?

Answer

• L=0.3*0.7*0.7=0.147

• Likelihood of data is greater with p=0.3 than with p=0.5

• We prefer p=0.3 as a parameter estimate to 0.5

Example: Maximizing the likelihood

dL/dp= 3(1-p)2 - 6p(1-p)

Find maximum at dL/dp=0

0=3(1-p)2 - 6p(1-p)

0=1-p -2p

3p=1

p=1/3

MLE example: Conclusion

Can verify that this is a maxima by looking at second derivative

Note that method of moments would give us p=x/n=1/3

So we get the same result by both methods

Ln Likelihood

• Product of many numbers each <1 is quite small

• Often easiest to work with ln (L)• Since ln is a monotonic transformation of

L, the largest ln (L) will correspond to the largest L value

Ln L=ln π f(xi| θ) Applying log laws

Ln L=Σ ln f(xi| θ)

Parameter uncertainty in MLE

• MLE gives us a general way to estimate parameters of many types of models

• But how do we make inferences about these parameters?

• There is a general method for large sample sizes– Huge advantage of MLE approach

Information Matrix

I = {d2ln(L) / dθ12 d2ln(L)/dθ1dθ2 ….

d2ln(L)/dθ1dθ2 d2ln(L)/dθ2

2 …..

d2ln(L)/dθ1dθk … …. d2ln(L)/dθk2 }

k = number of parametersAll the second derivatives of ln(L) with respect

to parameters

Interpreting I• Note that at MLE dL/ dθi =0 for all i

– L(θ^) is at a maximum or peak• Large second derivative indicates a sharp peak

– Parameter value of θ^ + Δ is much less likely than θ^

• Small second derivative indicates a gradual slope– Parameter value of θ^ + Δ is almost as likely

as θ^

Uncertainty of MLE parameters

θ^ ~N(θ, I-1)

MLE parameter estimates are asymptotically normal and unbiased with variance given by the inverse of the information matrix

Once you have the variance of θ^i use a Z

distribution to test hypotheses about θi and set CI for true value of θi

MLE in practice

• Usually work with log likelihood

• Often work with models for which it is not tractable develop analytical solutions

• Instead use numerical analysis to identify maximum likelihood through gradient search methods and invert I to get standard errors of parameters

• Need a software tool…

R language

• Freely available software– http://www.r-project.org/

• R is a programming language, not a set of pre-programmed routines with drop down menus– It operates from a command line prompt and has

exact syntax requirements

• Large library of functions developed by users• Develop your own code specific to your needs

R Resources

• An Introduction to R– available from the help menu

• Searchable help menu

• From command line: help(<function>)

• Simple R

• R for Beginners

• R for Dummies-no such luck

Strategies for dealing with R

• Use another software for standard analyses (JMP, SPSS, etc.)

• Have your favorite text editor open when you are working, if you actually succeed in getting a command to work, save it for posterity!

• Remember R offers incredible flexibility and speed

• Remember you’re not the only one R is driving up a tree

Your assignment

• Data on Cryptosporidium oocyst concentrations from Table 6-5, page 176 of Haas, Rose, and Gerba

Assignment (cont)

• Assume data follows a Poisson distribution

• Use MLE to fit parameter estimates to this data

• For practice use numerical approach even thought closed form solution exists for this case

• Repeat with negative binomial distribution

Poisson Distribution

• Independently occurring discrete random data– Often used for microbes– Assumes they don’t systematically cluster or spread

out

• Think of each glass of water as a independent trial that may or may not have a bug in it

• Think of each sip from each glass of water as an independent trial that may or may not have a bug in it.

Negative Binomial Distribution

• Another distribution to describe the number of occurrences of discrete random events

• Negative binomial has two parameters which allows for more flexibility, ability to fit more complex data

Next Session

• We will talk about comparing these two models

• Expand on uncertainty analysis

• For now learn how to fit them