10 throws - arbeitxcelab.net/rm/wp-content/uploads/2010/03/week9.pdf · 2 3 4 5 6 7 8 9 10 11 12...

2 3 4 5 6 7 8 9 10 11

10 throws

dice total

frequency

0.0

0.5

1.0

1.5

2.0

> length( p[p==7] ) / throws[1] 0.1

2 3 4 5 6 7 8 9 10 11 12

100 throws

dice total

frequency

05

10

15

20

> length( p[p==7] ) / throws[1] 0.14

2 3 4 5 6 7 8 9 10 11

10 throws

dice total

frequency

0.0

0.5

1.0

1.5

2.0

2 3 4 5 6 7 8 9 10 11 12

100 throws

dice total

frequency

05

10

15

20

> length( p[p==7] ) / throws[1] 0.195

2 3 4 5 6 7 8 9 10 11

10 throws

dice total

frequency

0.0

0.5

1.0

1.5

2.0

2 3 4 5 6 7 8 9 10 11 12

1000 throws

dice total

frequency

050

100

150

2 3 4 5 6 7 8 9 10 11 12

100 throws

dice total

frequency

05

10

15

20

> length( p[p==7] ) / throws[1] 0.17865

2 3 4 5 6 7 8 9 10 11

10 throws

dice total

frequency

0.0

0.5

1.0

1.5

2.0

2 3 4 5 6 7 8 9 10 11 12

1000 throws

dice total

frequency

050

100

150

2 3 4 5 6 7 8 9 10 11 12

1e+05 throws

dice total

frequency

05000

10000

15000

resampling & empirical likelihood estimation

• most sta's'cal es'ma'on premised on repeated “experiments”:

• if data generated many 'mes, what’s the expected outcome?

• instead of actually repea'ng, write formula that computes the expecta'on (and likelihood of observed data)

resampling & empirical likelihood estimation

• what if we can’t write the formula?

• some'mes impossible

• oBen very hard

• can simulate repeat sampling and compute the expecta'on EMPIRICALLY (some'mes called “monte carlo” likelihood es'ma'on)

binomial example

• frequencies of dead tadpoles (again) in pools of 5

• what is chance of death?

• easy problem, but suppose we can’t write the formula...

0 1 2 3 4 50

510

15

20

25

30

binomial example

• even when can’t write likelihood expression, can usually simulate data, condi'onal on parameters

• strategy

• (1) generate a datum, condi'onal on parameters

• (2) do (1) a bunch of 'mes

• (3) observe freq of real data in distribu'on from (2). this is the likelihood es'mate.

binomial example

prob = 0.4

10 000 replicates

0 1 2 3 4 5

0500

1500

2500

0 1 2 3 4 5

0500

1500

2500

Likelihood of 1

> length( k[k==1] )/10000[1] 0.2626

empirical likelihood estimation

• at each set of parameter values, need to simulate the distribu'on

• may need many replicates to get a smooth picture of likelihood surface

• careful of returning zero (0) likelihoods. NO EVENT should ever have zero chance of happening.


• This func'on does the same thing as dbinom(), but it does it via simula'on.

dsimbinom <- function( x , prob , size , log=TRUE , R=99 ) { e <- rbinom( R , prob=prob , size=size ) p <- log( sapply( x , function(y) length(e[e==y])/R ) ) p}


0.2 0.3 0.4 0.5 0.6

150

160

170

180

190

200

99 replicates

prob

-logLik

0.2 0.3 0.4 0.5 0.6

150

160

170

180

190

200

999 replicates

prob

-logLik

0.2 0.3 0.4 0.5 0.6

150

160

170

180

190

200

9999 replicates

prob-logLik

Red curve is real analy/cal likelihood func/on

empirical likelihood estimation• “jaggies” bad. helps to use

SIMULATED ANNEALING (SA) (method=”SANN”)

• SA hill-‐climbs, like most algorithms, but also climbs DOWN, with slowly decreasing probability (as it “cools”)

m.prob <- mle2( k ~ dbinom( prob=1/(1+exp(z)) , size=5 ) , start=list(z=0) )

m.sim <- mle2( k ~ dsimbinom( prob=1/(1+exp(z)) , size=5 , R=999 ) , start=list(z=0) , method="SANN" )

0.2 0.3 0.4 0.5 0.6

160

170

180

190

200

210

prob

-logLik


k <- rbinom( 100 , size=5 , prob=0.4 )

> sum(k)/500[1] 0.388

> logit(coef(m.prob)) z 0.388 > logit(coef(m.sim)) z 0.390865

m.prob <- mle2( k ~ dbinom( prob=1/(1+exp(z)) , size=5 ) , start=list(z=0) )m.sim <- mle2( k ~ dsimbinom( prob=1/(1+exp(z)) , size=5 , R=999 ) , start=list(z=0) , method="SANN" )

more complex example

• beta-‐binomial distribu'on:

• binomial probabili'es sampled from beta distribu'on

• has an analy'cal solu'on, but we’ll do it empirically now

0.0 0.2 0.4 0.6 0.8 1.0

probability of death

pro

babili

ty o

f pro

babili

ty o

f death

p1 = 0.4 p2 = 0.65 p3 = 0.12

40%60% 65%35% 12%88%

beta distributed chances of mortality

binomial trials determine actual deaths in each pool

0 1 2 3 4 5 6 7 8 9 10

count of dead tadpoles in pool

num

ber

of pools

050

100

150

200

0 1 2 3 4 5 6 7 8 9 10


num

ber

of pools

020

40

60

80

100

0.0 0.2 0.4 0.6 0.8 1.0


pro

babili

ty o

f pro

babili

ty o

f death

rbinom( 1000 , prob=0.5 , size=10 )

rbetabinom( 1000 , shape1=0.9 , shape2=0.9 , size=10 )

p = 0.5

binomial beta-‐binomial

0 1 2 3 4 5 6 7 8 9 10


num

ber

of pools

050

100

150

200

rbinom( 1000 , prob=0.5 , size=10 )

rbetabinom( 1000 , shape1=2 , shape2=2 , size=10 )

p = 0.5

binomial beta-‐binomial

0.0 0.2 0.4 0.6 0.8 1.0


pro

babili

ty o

f pro

babili

ty o

f death

0 1 2 3 4 5 6 7 8 9 10


num

ber

of pools

020

60

100

heterogeneous tadpoles

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.005

0.010

0.015

0.020

x

y/i

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.005

0.010

0.015

0.020

x

y/i

a = 2 , b = 2

a = 0.7 , b = 0.7

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.005

0.010

0.015

0.020

x

y/i

0.0 0.2 0.4 0.6 0.8 1.0

0.000

0.005

0.010

0.015

0.020

x

y/i

a = 1 , b = 2

a = 1 , b = 0.7

probability has rot

prob

ability of p

robability has rot

empirical beta-binomial

• out of 5 tadpoles, how many dead?

• assume that mortality correlated WITHIN pools

0 1 2 3 4 5

dead tadpolesfrequency

05

10

15

20

25


dsimbetabinom <- function( x , shape1 , shape2 , size , log=TRUE , R=99 ) {

# sample R probabilities from betabinom p <- rbeta( R , shape1=shape1 , shape2=shape2 )

# sample each event from p e <- rbinom( R , size=size , prob=p )

# observe log-freq of each x in distribution of e log( sapply( x , function(y) length(e[e==y])/R ) ) }


• The analy'cal way:

> library(emdbook)> m.prob <- mle2( k ~ dbetabinom( size=5 , shape1=exp(s1) , shape2=exp(s2) ) , start=list( s1=1,s2=1 ) )

> exp(coef(m.prob)) s1 s2 1.967890 2.010719


• The empirical way:

> m.sim <- mle2( k ~ dsimbetabinom( shape1=exp(s1) , shape2=exp(s2) , size=5 , R=999 ) , start=list( s1=1 , s2=1 ) , method="SANN" )

> exp(coef(m.sim)) s1 s2 2.013872 1.964869


• Problems that require empirical likelihood methods

• complex phylogene'c models

• complex popula'on structure models

• almost all Bayesian analyses

• almost all network models

• many “mixed effects” models

• many 'me series models

bootstrapping

• a special kind of resampling aimed at es'ma'ng variance of an es'mate (confidence intervals)

• suppose we can’t es'mate confidence from likelihood surface (can’t write a formula, perhaps)

• can treat sample like a popula'on, and take many samples of same size from it

• theory tells us that as sample size increases, variance in resampled es'mates converges to true variance

bootstrapping

• (1) sample n data from original size n sample, WITH REPLACEMENT

• (2) do (1) many 'mes

• (3) as n increases, histogram from (2) approaches true likelihood surface

• (4) find values of parameter in histogram that mark different confidence limits

bootstrap estimates

• simplest confidence intervals are just read from the histogram

• e.g. 95% intervalslow: value just above 2.5% of the valueshigh: value just above 97.5% of the values

Histogram of 1/(1 + exp(b$t))

1/(1 + exp(b$t))F

req

ue

ncy

0.25 0.30 0.35

05

01

00

15

02

00

bootstrapping

• Original data:

• Es'mate parameter:

k <- rbinom( 100 , size=5 , prob=0.3 )

m <- mle2( k ~ dbinom( prob=logit(z) , size=5 ) , start=list(z=0) )

bootstrapping

• Resample 999 sets of data from original data, and re-‐es'mate mle for each:

plist <- replicate( 999 , coef(mle2( sample(k,100,TRUE) ~ dbinom( prob=logit(z) , size=5 ) , start=list(z=coef(m)[1]) , method="Nelder-Mead" ) ) )

logit( quantile( plist , probs=c(0.025,0.975) ) )

bootstrapping

• Can find 95% confidence interval just by cuhng off lower and upper 2.5%

• Here: 0.251, 0.335

• confint() gives: 0.259, 0.339

Histogram of logit(plist)

logit(plist)

Frequency

0.25 0.30 0.35

050

100

150

200

logit( quantile( plist , probs=c(0.025,0.975) ) )

bootstrapping

• More complicated models are easier to do with the boot library.

• Consider modeling log body mass against log brain mass, for various species (at right).

0 5 10

02

46

8log body mass

log b

rain

mass

Dinosaurs

bootstrapping

0 5 10

02

46

8

log body mass

log b

rain

mass

plot( log(d$brain) ~ log(d$body) , xlab="log body mass" , ylab="log brain mass" )

abline( lm( log(d$brain) ~ log(d$body) ) , col="red" )

bootstrapping

• Now write a func'on that accepts the original data and a collec'on of row numbers as parameters:

f.coef <- function( d , i ) { # make a new data frame that contains the resampled rows in i nd <- d[i,] # fit our model to the resampled data m <- lm( log(brain) ~ log(body) , data=nd ) # return coefficients coef(m)}

bootstrapping

• Then tell the boot library to resample and collect coefficients from that func'on:

library(boot)boot.animals <- boot( d , f.coef , R=9999 )

boot.object <- boot( ORIGINAL.DATA , YOUR.FUNCTION , R=NUM.RESAMPLES )

bootstrappingplot( boot.animals , index=2 )

Histogram of the resampled beta coefficients

Histogram of t

t*

Density

0.0 0.2 0.4 0.6 0.8

01

23

-4 -2 0 2 40.0

0.2

0.4

0.6

0.8

Quantiles of Standard Normal

t*

Comparison of resampled distribu/on to normal

bootstrapping

• Convenient func'on to extract confidence intervals:

> boot.ci( boot.animals , type="perc" , index=2 )

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 9999 bootstrap replicates

CALL : boot.ci(boot.out = boot.animals, type = "perc", index = 2)

Intervals : Level Percentile 95% ( 0.2905, 0.7491 ) Calculations and Intervals on Original Scale

> confint( lm( log(d$brain) ~ log(d$body) ) ) 2.5 % 97.5 %(Intercept) 1.7056829 3.4041133log(d$body) 0.3353152 0.6566742

10 throws - arbeitxcelab.net/rm/wp-content/uploads/2010/03/week9.pdf · 2 3 4 5 6 7 8 9 10 11 12...

Documents