saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

52
Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference Jared Tobin (BSc, MAS) Department of Statistics The University of Auckland Auckland, New Zealand February 24, 2011 Jared Tobin Approximate conditional inference

Upload: jaredtobin

Post on 20-Aug-2015

729 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Saddlepoint approximations, likelihoodasymptotics, and approximate conditional inference

Jared Tobin (BSc, MAS)

Department of StatisticsThe University of AucklandAuckland, New Zealand

February 24, 2011

Jared Tobin Approximate conditional inference

Page 2: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

./helloWorld

I’m from St. John’s, Canada

Jared Tobin Approximate conditional inference

Page 3: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

./helloWorld

It’s a charming city known for

A street containing the most pubsper square foot in North America

What could be the worst weatheron the planet

these characteristics are probablyrelated..

Jared Tobin Approximate conditional inference

Page 4: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

./tellThemAboutMe

I recently completed my Master’s degree in Applied Statistics atMemorial University.

I’m also a Senior Research Analyst with the Government ofNewfoundland & Labrador.

This basically means I do a lot of programming and statistics..

(thank you for R!)

Here at Auckland, my main supervisor is Russell.

I’m also affiliated with Fisheries & Oceans Canada (DFO) via myco-supervisor, Noel.

Jared Tobin Approximate conditional inference

Page 5: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

./whatImDoingHere

Today I’ll be talking about my Master’s research, as well as what Iplan to work on during my PhD studies here in Auckland.

So let’s get started.

Jared Tobin Approximate conditional inference

Page 6: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Likelihood inference

Consider a probabilistic model Y ∼ f (y ; ζ), ζ ∈ R, and a sample yof size n from f (y ; ζ).

How can we estimate ζ from the sample y?

Everybody knows maximum likelihood.. (right?)

It’s the cornerstone of frequentist inference, and remains quitepopular today.

(ask Russell)

Jared Tobin Approximate conditional inference

Page 7: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Likelihood inference

Define L(ζ; y) =∏n

j=1 f (yj ; ζ) and call it the likelihood function ofζ, given the sample y.

If we take R to be the (non-extended) real line, then ζ ∈ R lets usdefine ζ = argmaxζ L(ζ; y) to be the maximum likelihoodestimator (or MLE) of ζ.

We can think of ζ as the value of ζ that maximizes the probabilityof observing the sample y.

Jared Tobin Approximate conditional inference

Page 8: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Nuisance parameter models

Now consider a probabilistic model Y ∼ f (y ; ζ) where

ζ = (θ,ψ), ζ ∈ RP

θ ∈ Rψ ∈ RP−1

If we are only interested in θ, then ψ is called a nuisanceparameter, or incidental parameter.

It turns out that these things are aptly named..

Jared Tobin Approximate conditional inference

Page 9: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

The nuisance parameter problem: example

Take a very simple example; a H-strata model with twoobservations per stratum. Yh1 and Yh2 are iid N(µh, σ

2) randomvariables, h = 1, . . . ,H, and we are interested in estimating σ2.

Define µ = (µ1, . . . , µH). Assuming σ2 is known, thelog-likelihood for µ is

l(µ; σ2, y) =∑h

{− log 2πσ2 −

[(yh1 − µh)2 + (yh2 − µh)2

]2σ2

}

and the MLE for the hth stratum, µh, is that stratum’s samplemean (yh1 + yh2)/2.

Jared Tobin Approximate conditional inference

Page 10: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Example, continued..

To estimate σ2, however, we must use that estimate for µ.

It is common to use the profile likelihood, defined for this exampleas

l (P)(σ2; µ, y) = supµ

l(µ, σ2; y)

to estimate σ2. Maximizing yields

σ2 =1

H

∑h

[(yh1 − µh)2 + (yh2 − µh)2

]2

as the MLE for σ2.

Jared Tobin Approximate conditional inference

Page 11: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Example, continued..

Let’s check for bias.. let S2h = [(Yh1 − µh)2 + (Yh2 − µh)2]/2 and

note that S2h = (Yh1 − Yh2)2/4 and σ2 = H−1

∑h S2

h .

Some algebra shows that

E [S2h ] =

1

4(varYh1 + µ2h + varYh2 + µ2h − 2E [Yh1Yh2])

and since Yh1, Yh2 are independent, E [Yh1Yh2] = µ2h so thatE [S2

h ] = σ2/2.

Jared Tobin Approximate conditional inference

Page 12: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Example, continued..

Put it together and we have

E [σ2] =1

H

∑h

E [S2h ]

=1

H

∑h

σ2

2

=σ2

2

No big deal - everyone knows the MLE for σ2 is biased..

But notice the implication for consistency..

limn→∞

P(|σ2 − σ2| < ε) = 0

Jared Tobin Approximate conditional inference

Page 13: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Neyman-Scott problems

This result isn’t exactly new..

It was described by Neyman & Scott as early as 1948.

That’s merit for the name; this type of problem is typically knownas a Neyman-Scott problem in the literature.

The problem is that one of the required regularity conditions is notmet. We usually require that the dimension of (µ, σ2) remainconstant for increasing sample size..

But notice that n =∑H

h=1 2, so n→∞ iff H →∞ iffdim(µ)→∞.

Jared Tobin Approximate conditional inference

Page 14: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

The profile likelihood

In a general setting where Y ∼ f (y ;ψ, θ) with nuisance parameterψ, consider the partial information iθθ|ψ = iθθ − iθψ i−1ψψ iψθ.

It can be shown that the profile expected information i(P)θθ is

first-order equivalent to iθθ..

so iθθ|ψ < i(P)θθ in an asymptotic sense.

In other words, the profile likelihood places more weight oninformation about θ than it ’ought’ to be doing.

Jared Tobin Approximate conditional inference

Page 15: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Two-index asymptotics for the profile likelihood

Take a general stratified model again.. Yh ∼ f (yh;ψh, θ) with Hstrata. Remove the per-stratum sample size restriction of nh = 2.

Now, if we let both nh and H approach infinity, we get differentresults depending on the speed at which nh →∞ and H →∞.

If nh →∞ faster than H does, we have that θ − θ = Op(n−1/2).

If H →∞ faster, the same difference is of order Op(n−1h ).

In other words, we can probably expect the profile likelihood tomake relatively poor estimates of θ if H > nh on average.

Jared Tobin Approximate conditional inference

Page 16: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Alternatives to using the profile likelihood

This type of model comes up a lot in practice.. (example later)

What solutions are available to tackle the nuisance parameterproblem?

For the normal nuisance parameter model, the method of momentsestimator is an option (and happens to be unbiased & consistent).

Jared Tobin Approximate conditional inference

Page 17: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Alternatives to using the profile likelihood

But what if the moments themselves involve multiple parameters?

... then it may be difficult or impossible to construct a method ofmoments estimator.

Also, likelihood-based estimators generally have desirable statisticalproperties, and it would be nice to retain those.

There may be a way to patch up the problem with ‘standard’ MLin these models..

hint: there is. my plane ticket was $2300 CAD, so I’d better havesomething entertaining to tell you..

Jared Tobin Approximate conditional inference

Page 18: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Motivating example

This research started when looking at a stock assessment problem.

Take the waters of the coast of Newfoundland & Labrador, whichhistorically supported the largest fishery of Atlantic cod in theworld.

(keyword: historically)

Jared Tobin Approximate conditional inference

Page 19: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Motivating example (continued)

Here’s the model: divide these waters (the stock area) into Nequally-sized sampling units, where the j th unit, j = 1, . . . ,Ncontains λj fish. Each sampling unit corresponds to the area overthe ocean bottom covered by a standardized trawl tow made at afixed speed and duration,

Then the total number of fish in the stock is λ =∑N

j=1 λj , and wewant to estimate this.

In practice we estimate a measure of trawlable abundance. Weweight λ by the probability of catching a fish on any given tow, q,and estimate µ = qλ.

Jared Tobin Approximate conditional inference

Page 20: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Motivating example (continued)

DFO conducts two research trawl surveys on these waters everyyear using a stratified random sampling scheme..

Jared Tobin Approximate conditional inference

Page 21: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Motivating example (continued)

For each stratum h, h = 1, . . . ,H, we model an observed catch asYh ∼ negbin(µh, k).

The negative binomial mass function is

P(Yh = yh;µh, k) =Γ(yh + k)

Γ(yh + 1)Γ(k)

(µh

µh + k

)yh(

k

µh + k

)k

and it has mean µh and variance µh + k−1µ2h.

Jared Tobin Approximate conditional inference

Page 22: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Motivating example (continued)

We have a nuisance parameter model.. if we want to make intervalestimates for µ, we must estimate the dispersion parameter k .

Breezing through the literature will suggest any number ofincreasingly esoteric ways to do this..

method of moments

pseudo-likelihood

optimal quadratic estimating equations

extended quasi-likelihood

adjusted extended quasi-likelihood

double extended quasi-likelihood

etc.

WTF

Jared Tobin Approximate conditional inference

Page 23: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Motivating example (continued)

Which of these is best??

(and why are there so many??)

Noel and I wrote a paper that tried to answer the first question..

.. unfortunately, we also wound up adding another long-windedestimator to the list, and so increased the scope of the second.

We coined something called the ‘adjusted double-extendedquasi-likelihood’ or ADEQL estimator, which performed best in oursimulations.

Jared Tobin Approximate conditional inference

Page 24: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Motivating example (fini)

When writing my Master’s thesis, I wanted to figure out why thisestimator worked as well as it did?

And what exactly are all the other ones?

This involved looking into functional approximations and likelihoodasymptotics..

.. but I managed to uncover some fundamental answers thatsimplified the whole nuisance parameter estimation mumbojumbo.

Jared Tobin Approximate conditional inference

Page 25: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Conditional inference

For simplicity, recall the general nonstratified nuisance parametermodel.. i.e. Y ∼ f (y ;ψ, θ).

Start with some theory. Let (t1, t2) be jointly sufficient for (ψ, θ)and let a be ancillary.

If we could factorize the likelihood as

L(ψ, θ) ≈ L(θ; t2|a)L(ψ, θ; t1|t2, a)

orL(ψ, θ) ≈ L(θ; t2|t1, a)L(ψ, θ; t1|a)

then we could maximize L(θ; t2|a), called the marginal likelihood,or L(θ; t2|t1, a), the conditional likelihood, to obtain an estimate ofθ.

Jared Tobin Approximate conditional inference

Page 26: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Conditional inference

Each of these functions condition on statistics that contain all ofthe information about θ, and negligible information about ψ.

They seek to eliminate the effect of ψ when estimating θ, and thustheoretically solve the nuisance parameter problem.

(both the marginal and conditional likelihoods are special cases of Cox’spartial likelihood function)

Jared Tobin Approximate conditional inference

Page 27: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Approximate conditional inference?

Disclaimer: theory vs. practice

It is pretty much impossible to show that a factorization like thiseven exists in practice.

The best we can usually do is try to approximate the conditional ormarginal likelihood.

Jared Tobin Approximate conditional inference

Page 28: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Approximations

There are many, many ways to approximate functions andintegrals..

We’ll briefly touch on an important one..

(recall the title of this talk for a hint)

.. the saddlepoint approximation, which is a highly accurateapproximation to arbitrary functions.

Often it’s capable of outperforming more computationallydemanding methods, i.e. Metropolis Hastings/Gibbs MCMC.

For a few cases (normal, gamma, inverse gamma densities), it’seven exact.

Jared Tobin Approximate conditional inference

Page 29: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Laplace approximation

Familiar with the Laplace approximation? We need it first. We’reinterested in

∫ ba f (y ; θ)dy for some a < b, and the idea is to use

e−g(y ;θ), for g(y ; θ) = − log f (y ; θ) to do it.

Truncate a Taylor expansion of e−g(y ;θ) about y , wherey = argmaxy g on (a, b).

Then integrate over (a, b).. we wind up integrating the kernel of aN(y ,−1/g ′′(y)) density and get∫ b

af (y ; θ)dy ≈ exp {g(y)}

{− 2π

g ′′(y)

} 12

It works because the value of the integral depends mainly on g(y)(the function’s maximum on (a, b)) and g ′′(y) (its curvature at themaximum).

Jared Tobin Approximate conditional inference

Page 30: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Saddlepoint approximation

We can then do some nifty math in order to refine Taylor’sapproximation to a function. Briefly, we want to relate thecumulant generating function to its corresponding density bycreating an approximate inverse mapping.

For K (t) the cumulant generating function, the momentgenerating function can be written

eK(t) =

∫Y

ety+log f (y ;θ)dy

so fix t and let g(t, y) = −ty − log f (y ; θ). Laplace’sapproximation yields

eK(t) ≈

√2π

g ′′(t, yt)etyt f (yt ; θ)

where yt solves g ′(t, yt) = 0Jared Tobin Approximate conditional inference

Page 31: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Saddlepoint approximation

Sparing the nitty gritty, we do some solving and rearranging(particularly involving the saddlepoint equation K ′(t) = y), and wecome up with

f (yt ; θ) ≈[2πK ′′(t)

]−1/2exp {K (t)− tyt}

where yt solves the saddlepoint equation.

This guy is called the unnormalized saddlepoint approximation tof , and is typically denoted f .

Jared Tobin Approximate conditional inference

Page 32: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Saddlepoint approximation

We can normalize the saddlepoint approximation by usingc =

∫Y f (y ; θ)dy .

Call f ∗(y ; θ) = c−1f (y ; θ) the renormalized saddlepointapproximation.

How does it perform relative to Taylor’s approximation?

For a sample of size n, Taylor’s approximation is typically accurateto O(n−1/2).. the unnormalized saddlepoint approximation isaccurate to O(n−1), while the renormalized version does evenbetter at O(n−3/2).

If small samples are involved, the saddlepoint approximation canmake a big difference.

Jared Tobin Approximate conditional inference

Page 33: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

The p∗ formula

We could use the saddlepoint approximation to directlyapproximate the marginal likelihood.

(or could we? where would we start?)

Best to continue from Barndorff-Nielsen’s idea.. he and Cox didsome particularly horrific math in the 80’s and came up with asecond-order approximation to the distribution of the MLE.

Briefly, it involves taking a regular exponential family and thenusing an unnormalized saddlepoint approximation to approximatethe distribution of a minimal sufficient statistic.. make a particularreparameterization and renormalize, and you get the p∗ formula:

p∗(θ; θ) = κ(θ)|j(θ)|1/2 L(θ; y)

L(θ; y)

where κ is a renormalizing constant (depending on θ) and j is theobserved information.

Jared Tobin Approximate conditional inference

Page 34: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Putting likelihood asymptotics to work

So how does the p∗ formula help us approximate the marginallikelihood?

Let t = (t1, t2) be a minimal sufficient statistic and u be a statisticsuch that both (ψ, u) and (ψ, θ) are one-to-one transformations oft, with the distribution of u depending only on θ.

Barndorff-Nielsen (who else) showed that the marginal density of ucan be written as

f (u; θ) =

{f (ψ, θ;ψ, θ)

∣∣∣∣∣∂(ψ, θ)

∂(ψ, u)

∣∣∣∣∣}/

{f (ψθ; ψ, θ|u)

∣∣∣∣∣∂ψθ∂ψ

∣∣∣∣∣}

Jared Tobin Approximate conditional inference

Page 35: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Putting likelihood asymptotics to work

It suffices that we know∣∣∣∣∣ ∂ψ∂ψθ∣∣∣∣∣ =

∣∣∣jψψ(λθ)∣∣∣ ∣∣∣lψ;ψ(λθ; ψ, u)

∣∣∣−1where λθ = (θ, ψθ) and |lψ;ψ(λθ; ψ, u)| is the determinant of a

sample space derivative, defined as the matrix ∂2l(λ; λ, u)/∂ψ∂ψT .

We don’t need to worry about the other term |∂(ψ, θ)/∂(ψ, u)|. Itdoesn’t depend on θ.

Jared Tobin Approximate conditional inference

Page 36: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Putting likelihood asympotics to work

We can use the p∗ formula to approximate both f (ψ, θ;ψ, θ) andf (ψθ; ψ, θ|u). Doing so, we get

L(θ; u) ∝

∣∣∣j(ψ, θ)∣∣∣1/2∣∣∣j(ψθ, θ)∣∣∣1/2

L(ψ, θ)

L(ψ, θ)

L(ψθ, θ)

L(ψ, θ)

∣∣∣j(ψθ, θ)∣∣∣ ∣∣∣lψ;ψ(ψθ, θ)

∣∣∣−1∝ L(ψθ, θ)

∣∣∣j(ψθ, θ)∣∣∣1/2 ∣∣∣lψ;ψ(ψθ, θ)

∣∣∣−1= L(P)(θ; ψθ)

∣∣∣jψψ(θ, ψθ)∣∣∣1/2 ∣∣∣lψ;ψ(ψθ, θ)

∣∣∣−1

Jared Tobin Approximate conditional inference

Page 37: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Modified profile likelihood (MPL)

Taking the logarithm, we get

l(θ; u) ≈ l (P)(θ; ψθ) +1

2log∣∣∣jψψ(θ, ψθ)

∣∣∣− log∣∣∣lψ;ψ(θ, ψθ)

∣∣∣ .known as the modified profile likelihood for θ and denoted l (M)(θ).

As it’s based on the saddlepoint approximation, it is a highlyaccurate approximation to the marginal likelihood l(θ; u) and thus(from before) L(θ; t2|a).

In cases where the marginal or conditional likelihood do not exist,it can be thought of as an approximate conditional likelihood for θ.

Jared Tobin Approximate conditional inference

Page 38: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Two-index asymptotics for the MPL

Recall the stratified model with H strata.. how does the modifiedprofile likelihood perform in a two-index asymptotic setting?

If nh →∞ faster than H, we have a similar bound as before:θ(M) − θ = Op(n−1/2)

The difference this time is that nh must only increase withoutbound faster than H1/3, which is a much weaker condition.

If H →∞ faster than nh, then we have a boost in performanceover the profile likelihood in that θ(M) − θ = Op(n−2h ) (as opposed

to Op(n−1h )).

Jared Tobin Approximate conditional inference

Page 39: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Modified profile likelihood (MPL)

The profile observed information term jψψ(θ, ψθ) in the MPLcorrects the profile likelihood’s habit of putting excess informationon θ.

What about the sample space derivative term lψ;ψ?

.. this preserves the structure of the parameterization. If θ and ψare not parameter orthogonal, this term ensures thatparameterization invariance holds.

What if θ and ψ are parameter orthogonal?

Jared Tobin Approximate conditional inference

Page 40: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Adjusted profile likelihood (APL)

If θ and ψ are orthogonal, we can do without the sample spacederivative..

.. we can define

l (A)(θ) = l (P)(θ)− 1

2log∣∣∣jψψ(θ, ψθ)

∣∣∣as the adjusted profile likelihood, which is equivalent to the MPLwhen θ and ψ are parameter orthogonal.

As a special case of the MPL, the APL has comparableperformance as long as θ and ψ are approximately orthogonal.

Jared Tobin Approximate conditional inference

Page 41: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

MPL vs APL

It’s interesting to note the nature of the difference between theMPL and APL..

While the MPL arises via the p∗ formula, the APL can actually bederived via a lower-order Laplace approximation to the integratedlikelihood

L(θ) =

∫R

L(ψ, θ)dψ

≈ exp{

l (P)(θ; ψθ)}− 2π

∂2l(ψ,θ)∂ψ2

∣∣ψ=ψθ

= L(P)(θ; ψθ)

[jψψ(ψθ)−1/2

]

Jared Tobin Approximate conditional inference

Page 42: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

MPL vs APL

In practice we can often get away with using the APL.

May require assuming that θ and ψ are parameter orthogonal, butthis is often the case anyway (i.e. joint mean/dispersion GLMs,mixed models - i.e. REML).

In particular, if θ is a scalar, then an orthogonal reparameterizationcan always be found.

The applicability means that the adjustment term

−12 log

∣∣∣jψψ(θ, ψθ)∣∣∣ can be broadly used in GLMs, quasi-GLMs,

HGLMs, etc.

Jared Tobin Approximate conditional inference

Page 43: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Getting back to the problem..

In mine & Noel’s paper, we compared a bunch of estimators forthe negative binomial dispersion parameter k .. the most relevantmethods to us are

maximum (profile) likelihood (ML)

adjusted profile likelihood (AML)

extended quasi-likelihood (EQL)

adjusted extended quasi-likelihood (AEQL)

double extended quasi-likelihood (DEQL)

adjusted double extended quasi-likelihood (ADEQL)

What insight did the whole likelihood asymptotics exercise shed onthis?

It showed two branches of estimators and developed a theoreticalhierarchy in each..

Jared Tobin Approximate conditional inference

Page 44: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Insight

The EQL function is actually a saddlepoint approximation to anexponential family likelihood..

q+(P)(k) =

∑h,i

{[yhi log yh + (yhi + k) log

yhi + k

yh + k

]− 1

2log(yhi + k) +

1

2log k − yhi

12k(yhi + k)

}.. so it should perform similarly to (but worse than) the MLE.

The double extended quasi-likelihood function is actually the EQLfunction for the strata mean model.And the AEQL function is actually an approximation to theadjusted profile likelihood.. so the adjusted profile likelihood shouldintuitively perform better.

Jared Tobin Approximate conditional inference

Page 45: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Insight

In our paper, the results didn’t exactly follow this theoreticalpattern..

Jared Tobin Approximate conditional inference

Page 46: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Insight

.. but in that paper we had capped estimates of k at 10k.

I decided to throw out (and not resample) nonconverging estimatesof k in my thesis.

This means I had some information about how many estimatesfailed to converge, but those estimates didn’t throw off mysimulation averages.

Surely enough, upon doing that, the estimators performedaccording to the theoretical hierarchy.

Jared Tobin Approximate conditional inference

Page 47: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Estimator performance (thesis)

Table: Average perfomance measures across all factor combinations, byestimator.

ML AML EQL CREQL LNEQL CTEQL

Avg. abs. % bias 109.00 30.00 110.00 31.00 33.00 26.00Avg. MSE 10.21 0.47 10.22 0.47 1.58 0.55

Avg. prop. NC 0.09 0.00 0.09 0.00 0.02 0.00

Table: Ranks of estimator by criterion. Overall rank is calculated as theranked average of all other ranks.

ML AML EQL CREQL LNEQL CTEQL

Avg. abs. % bias 5 2 6 3 4 1Avg. MSE 5 1 6 2 4 3

Avg. prop. NC. 5.5 2 5.5 2 4 2

Overall rank 5 1 6 3 4 2

Jared Tobin Approximate conditional inference

Page 48: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

End of the story..

The odd estimator out was the one we originally called ADEQL inour paper. It’s the k-root of this guy:

∑h,i

{2k log

yh + k

yhi + k+

k

yhi + k− yhi (yhi + 2k)

6k(yhi + k)2

}− (n − H) = 0

The whole saddlepoint deal revealed the DEQL part was really justEQL.. so really it’s an adjustment of an approximation to theprofile likelihood, and the adjustment itself is degrees-of-freedombased.

It performed very well; best in our paper, and second only to theadjusted profile likelihood in my thesis. I since called it theCadigan-Tobin EQL (or CTEQL) estimator for k .

Jared Tobin Approximate conditional inference

Page 49: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Future research direction

Integrated likelihood

Should the adjusted profile likelihood be adopted as a‘standard’ way to remove nuisance parameters?

How does the adjusted profile likelihood compare to the MPLif we depart from parameter orthogonality?

If we do find poor performance under parameternon-orthogonality, how difficult is it to approximate a samplespace derivative in general?

Can autodiff assist with this? Or is there some neatsaddlepoint-like approximation that will do the trick?

Jared Tobin Approximate conditional inference

Page 50: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Acronym treadmill..

Russell calls integrated likelihood’GREML’, for Generalized REstrictedMaximum Likelihood.

Tacking on ‘INference’ shows thedangers of acronymization..

We’re oh-so-close to GREMLIN..

(hopefully that won’t be my greatestcontribution to statistics.. )

Jared Tobin Approximate conditional inference

Page 51: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Future research direction

Other interests

Machine learning

Information geometry and asymptotics

Quantum information theory and L2-norm probability (?)

I would be happy to work with anyone in any of these areas!

Jared Tobin Approximate conditional inference

Page 52: Saddlepoint approximations, likelihood asymptotics, and approximate conditional inference

Contact

Email: [email protected]: jahredtobin

Website/Blog:http://jtobin.ca

MAS Thesis:http://jtobin.ca/jTobin MAS thesis.pdf

Jared Tobin Approximate conditional inference