the triangle of statistical inference: likelihoood data scientific model probability model inference

Post on 12-Jan-2016

220 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Triangle of Statistical Inference: Likelihoood

Data

Scientific Model

Probability Model

Inference

An example...

The Data:xi = measurements of DBH on 50 treesyi = measurements of crown radius on those trees

The Scientific Model:yi = xi + (linear relationship, with 2 parameters ( and an error term () (the residuals))

The Probability Model: is normally distributed, with E[ ] and variance estimated from the observed variance of the residuals...

Data

Scientific Model (hypothesis)

Probability Model

Inference

Data

Scientific Model (hypothesis)

Probability Model

Inference

So what is likelihood –and what is it good for?

1. Probability based (“inverse probability”).“ mathematical quantity that appears to be

appropriate for measuring our order of preference among different possible populations but does not in fact obey the laws of probability”

--RA. Fischer

2. Foundation of theory of statistics.

3. Enables comparison of alternate models.

So what is likelihood –and what is it good for?

Scientific hypotheses cannot be treated as outcomes of trials (probabilities) because we will never have the full set of possible outcomes.

However, we can calculate the probability of obtaining the results, given our model (scientific hypothesis (P(data|model).

Likelihood is proportional to this probability.

Likelihood is proportional to probability

P(data|hypothesis ( )) L(hyp|data)

P(data|hypothesis ( )) = kL(| data)

In plain English: “The likelihood (L) of the set of parameters () (in the scientific model), given the data (x), is proportional to the probability of observing the data, given the parameters...”

{and this probability is something we can calculate, using the appropriate underlying probability model (i.e. a PDF)}

Parameter values can specify your hypotheses

P(datai|θ) = kL(θ |data)

Parameter is fixed, datavariable. What is the prob. of observing the data if our model andparameters are correct?

Parameter is variable, data fixed. What is the likelihood of the parametergiven the data?

General Likelihood Function

L(θ|x) = cg(x|θ )

Likelihood function Data (xi )

Parameters in probability model

Probability density function or discrete density function

c is a constant, and thus, unimportant in comparison of alternate hypotheses or models as long as the data remain constant.

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-4 -3 -2 -1 0 1 2 3 4 5

Pro

ba

bili

ty

General Likelihood Function

L(θ|x) = g(xi|θ )

Likelihood function Data (xi )

Parameters in probability model

Probability density function or discrete density function

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

-4 -3 -2 -1 0 1 2 3 4 5

Pro

ba

bili

ty

n

i 1

The parameters of the pdf are determined by the data and by the value of the parameters in scientific model!!

Likelihood Axiom

“Within the framework of a statistical model, a

set of data supports one statistical

hypothesis better than other if the likelihood

of the first hypothesis, on the data, exceeds

the likelihood of the second hypothesis”.

(Edwards 1972, p.)

How to derive a likelihood function: Binomial

The most likely parameter value is 10/50 = 0.20

105010 11010

1

)p(p)p|(g)|p(L

)p|x(cg)x|p(L

)p(px

n)x(g xnx Probability Density Function

Likelihood

Event 10 trees die out of a population of 50

Question: What is the mortality rate (p)?

Likelihood Profile: Binomial

105010 11010 )p(p)p|(g)|p(L

-2E-12

0

2E-12

4E-12

6E-12

8E-12

1E-11

1.2E-11

1.4E-11

1.6E-11

0 0.2 0.4 0.6 0.8 1

Value of estimated parameter (p)

lik

eli

ho

od

The model (parameter p) is defined by the data!!

An example: Can we predict tree fecundity as a function of tree size?

The Data:xi = measurements of DBH on 50 treesyi = counts of seeds produced by trees

The Scientific Model:yi = DBH + exponential relationship, with 1 parameter ( and an error term ()

The Probability Model:Data follow a Poisson distribution, with E[x] and variance = λ

Data

Scientific Model

Probability Model

Inference

Iterative process

1. Pick a value for the parameter in your scientific model, Recall scientific model is yi = DBH

2. For each data point, calculate the expected (predicted) value for that value of

3. Calculate the probability of observing what you observed given that parameter value and your probability model.

4. Multiply the probabilities of individual observations.

5. Go back to 1 until you find maximum likelihood estimate for parameter

Data

Scientific Model (hypothesis)

Probability Model

Inference

Data

Scientific Model (hypothesis)

Probability Model

Inference

Likelihood Poisson Process

!x

)(e)xX(P

x E[x]= λ

First pass…

Model yi = DBH

Predicted = 0.0617Observed = 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3

Number of seeds

Pro

ba

bilit

y

Poisson randomVariable with E[x1]=0.0617

0 1 2

Do for n observations……

!x

)(e)xX(P

x

001702 .!x

)pred(e)X(P

xpred

E[x]= λ

Pick a new value of beta...

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6

Number of seeds

Pro

ba

bili

ty Poisson randomVariable with E[x1]=0.498

0 1 2 3 4

Do for n observations……

!x

)(e)xX(P

x

07502 .!x

)pred(e)X(P

xpred

Model yi = DBH

Predicted = 0.498Observed = 2

Probability and Likelihood

1. Multiplying probabilities is not convenient from a computational point of view.

2. We take the log of the probabilities and we maximize that number.

3. This gives us the Maximum Likelihood Estimate of the parameter.

Likelihood Profile

-170.5

-170.0

-169.5

-169.0

-168.5

-168.0

-167.5

-167.0

-166.5

-166.0

-165.5

-165.0

0 0.2 0.4 0.6 0.8 1 1.2

Lo

g L

ike

liho

od ML estimate

Beta

Model yi = DBH

Model comparison

Data

Scientific Model (hypothesis)

Probability Model

Inference

Data

Scientific Model (hypothesis)

Probability Model

Inference

The Data:xi = measurements of DBH on 50 treesyi = counts of seed produced by trees

The Scientific Models:yi = DBH + exponential relationship, with 1 parameter (

OR

yi = DBH + linear relationship with 1 parameter (

The Probability Model:Data follow a Poisson distribution, with E[x] and variance = λ

Model comparison

Data

Scientific Model (hypothesis)

Probability Model

Inference

Data

Scientific Model (hypothesis)

Probability Model

Inference

The Data:xi = measurements of DBH on 50 treesyi = counts of seed produced by trees

The Scientific Models:yi = DBH + exponential relationship, with 1 parameter (

The Probability Model:Data follow a Poisson distribution, with E[x] and variance = λ

OR

Data follow a negative binomial distribution with E[x]=m and clumping parameter k. (Variance is defined by m and k (estimated).

FIRST PRINCIPLES1. Proportions Binomial2. Several categories Multinomial3. Count events Poisson, Neg. binomial4. Continuous data, additive processes Normal5. Quantities from multiplicative probabilities Lognormal,

Gamma.EMPIRICAL1. Examine residuals.2. Tests different probability distributions for model errors.

Determination of appropriate likelihood function

Probability models can be thought of as competing hypotheses in exactly the same way that different parameter values (structural models) are competing hypotheses.

Likelihood functions: An aside about logarithms

1

)a(log

)blog(*a)blog(

)blog()alog(b

alog

)blog()alog()b*alog(

a

a

Taking the logarithm in base a of a number is the inverse of raising that number to the power a. Example: log101000= 3

Basic Log Operations

Poisson Likelihood Function

!x

)(e)xX(P

x

!x

)(e)x|(L

i

xn

i

i

1

)]!xln(lnx[)x|(oodLoglikelih ii

n

i

1

Likelihood

Discrete Density Function

Variance]X[E

Negative Binomial Distribution Likelihood Function

)]kmln()kn()n(ln)mln(n)kn(ln[

)]kln()kln(k[N)|x(oodLoglikelih

iiiiii

N

i

11

Likelihood

Discrete Density Functionnk

km

m

k

m

!n)k(

)nk()nXPr(

1

)kn(i

ki

i

iN

ii

i)km(

km

)k()n(

)kn(L

11

k is an estimated parameter!!

2

k

mmVariancem]X[E

Normal Distribution Likelihood Function

))x(

exp()x(f2

2

2 22

1

))x(

exp()x|,(L in

i2

2

21 22

1

))x(

()]ln()[ln(n

)x|,(oodLogLikelih

in

i2

2

1 22

2

1

Prob. Density Function

Likelihood

E[x] = μVariance = δ2

Lognormal Distribution Likelihood Function

2

2

2 2

1

2

1

)xln(exp

x)x(f

2

2

21 2

1

2

1

)xln(

expx

)x|,(L in

i

)]xln(

)x̂ln()xln([)]ln()ln([n

)x|,(oodLoglikelih

i

n

i

ii

1

2

22

2

22

2

1

Likelihood

Prob. Density Function

n

)]x̂ln()x[ln()x̂ln()(E ii

n

ii

2

1

22

2

Gamma Distribution Likelihood Function

Prob. Density Function

n

iii

sxaa

sxxasaa

shapea

asXVar

asxE

exas

xf

1

2

/1

/)ln()1()ln()ln(LogLik

parameter scales

parameter

][

][

)(

1)(

Exponential Distribution Likelihood Function

i

n

i

n

i

x

x

x)ln()|x(oodLogLikelih

eL

e)x(f

i

1

1

Prob. Density Function

Likelihood

2

11

Variance]x[E

Evaluating the strength of evidence for the MLE

Now that you have an MLE, how should you evaluate it?

Two purposes of support/confidence intervals

• Measure of support for alternate parameter estimates.

• Help with fitting when something goes wrong.

Methods of calculating support intervals

• Bootstrapping

• Likelihood curves and profiles

Bootstrapping

• Resample the data with replacement and record the number of times that the parameter estimate fell within an interval.

• Frequentist approach: If I sampled my data a large number of times, what would my confidence in the estimate be?

General method

• Draw the likelihood curve (one parameter) or surface (two parameters) or n-dimensional space (n-parameters).

• Figure out how much the likelihood changes as the parameter of interest moves away from the MLE.

Strength of evidence for particular parameter estimates – “Support”

• Likelihood provides an objective measure of the strength of evidence for different parameter estimates...

Log-likelihood = “Support” (Edwards 1992)

-155

-153

-151

-149

-147

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Parameter Estimate

Lo

g-L

ikel

iho

od

Asymptotic vs. Simultaneous M-Unit Support Limits

• Asymptotic:– Hold all other parameters at their MLE values, and

systematically vary the remaining parameter until likelihood declines by a chosen amount (m)

-155

-153

-151

-149

-147

2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8

Parameter Estimate

Lo

g-L

ike

liho

od

2-unit support interval

Maximum likelihood estimate

What should “m” be? (1.92

is a good number, and is

roughly analogous to a

95% CI)

An aside on the Likelihood Ratio Test

• Ratios of log-likelihoods (R) follow a chi-square distribution with degrees of freedom equal to the difference in the number of parameters between models A and B.

)]M|Y(L)M|Y(L[R BA 2

Asymptotic vs. Simultaneous M-Unit Support Limits

• Simultaneous:– Resampling method: draw a very large number of

random sets of parameters and calculate log-likelihood. M-unit simultaneous support limits for parameter xi are the upper and lower limits that don’t differ by more than m units of support.

• Set the focal parameter to a range of values and for each value optimize the likelihood for all the other parameters:

In practice, it can require an enormous number of iterations to do this if there are more than a few parameters

Asymptotic vs. Simultaneous Support Limits

Parameter 1

Par

amet

er 2

2-unit dropin support

A hypothetical likelihood surface for 2 parameters

Asymptotic 2-unitsupport limits for P1

Simultaneous 2-unitsupport limits for P1

Other measures of strength of evidence for different parameter estimates

• Edwards (1992; Chapter 5)– Various measures of the “shape” of the

likelihood surface in the vicinity of the MLE...

How pointed is the peak?...

Evaluating Support for Parameter Estimates

• Traditional confidence intervals and standard errors of the parameter estimates can be generated from the Hessian matrix

– Hessian = matrix of second partial derivatives of the likelihood function with respect to parameters, evaluated at the maximum likelihood estimates

– Also called the “Information Matrix” by Fisher

– Provides a measure of the steepness of the likelihood surface in the region of the optimum

– Evaluated at MLE points it is the observed information matrix

– Can be generated in R using optim

An example from R

• The Hessian matrix (when maximizing a log likelihood) is a numerical approximation for Fisher's Information Matrix (i.e. the matrix of second partial derivatives of the likelihood function), evaluated at the point of the maximum likelihood estimates. Thus, it's a measure of the steepness of the drop in the likelihood surface as you move away from the MLE.

> res$hessiana b sd

a -150.182 -2758.360 -0.201b -2758.360 -67984.416 -5.925Sd -0.202 -5.926 -299.422

The Hessian CI

• Now invert the negative of the Hessian matrix to get the matrix of parameter variance and covariance

• The square roots of the diagonals of the inverted negative Hessian are the standard errors

• Are we reverting to a frequentist framework?

> solve(-1*res$hessian)a b sd

a 2.613229e-02 -1.060277e-03 3.370998e-06b -1.060277e-03 5.772835e-05 -4.278866e-07sd 3.370998e-06 -4.278866e-07 3.339775e-03

(*and 1.96 * S.E. is a 95% CI)

> sqrt(diag(solve(-1*res$hessian)))a b sd0.1616 0.007597 0.05779

Some references

A.W.F. Edwards. 1972. Likelihood. Cambridge University Press.

Feller, W. 1968. An introduction to probability theory and its application. Wiley & Sons.

top related