big model, big data

Post on 16-Apr-2017

5.848 Views

Category:

Science

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

To avoid fainting, keep repeating “It’s only amodel”...

Daniel Simpson

Department of Mathematical SciencesUniversity of Bath

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

“We are tied down to a language which makes up inobscurity what it lacks in style.”

Never mind the big data, here come the big models

“Eternity is a terrible thought. I mean, where’s it going toend?”

Folk definitionA model becomes big when the methods I want to use no longerwork.

I Solving “big models” require investment in modelling (priorsand likelihoods).

I Solving “big models” require investment in computation(scalable, tailored solutions).

I Solving “big models” requires compromise (exactness isn’t anoption).

“We’re more of the love, blood, and rhetoric school”

Question for the agesIs my model really “big” if it only has one infinite dimensionalparameter?

I I spend a lot of effort trying to convince myself that mymodels aren’t too big

I Most of the games that I play are with space, time, or somehorrid combination of the two

I I am addicted to GAMsI I was once at a party where all the cool kids were doing inverse

problems. I swear I didn’t inhale.

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

An archipelago off the coast of Finland(with Haakon Bakka, Håvard Rue, and Jarno Vanhatalo)

I Smelt (other: herring, perch, pikeperch)

I Commercially important fish speciesI Survey data on fish larvae

abundanceI Complex models, high uncertainty

An archipelago off the coast of Finland

200 400 600 800 1000 1200 1400 1600

200

400

600

800

1000

1200

x

y

I’ll tell you what I want (what I really really want)

The questions for this dataset revolve around conservation. (e.g.should we protect some regions?)

Statistical questionsI “Interpolation” (where are the smelt?)I “Extrapolation” (where would we expect the smelt to be in a

similar, but different environment)I Model the nonlinear effect of environmental variables.

(scenario forecasts)

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

What’s a homophone between friends?

Antisocial behaviour in Wales. Social behaviour in whales.

It’s kinda like a point pattern...(Finn Lindgren, Fabian Bachl, Joyce Yuan, David Borchers, Janine Illian)

I You can treat whale pods as a point processI Partially observed along transects (unknown detection

probability!)I Each observed pod has an noisy measurement of size

Gimme! Gimme! Gimme! (A man after midnight)(I’d already used the Spice Girls one...)

The scientific questions are “How many whales are there?” and“Where are they?”.

Statistical questions:I How do you estimate the detection probability?I How do you do inference with a filtered point process?I How do you deal with error in the mark observation?

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

Like any good academic statistician, I will now present theabstract problem(And never return to the interesting one)

Observed fieldyi ∼ π(y i | η,θ)

Latent fieldη ∼ N(0,Q−1)

Parametersθ ∼ π(θ)

I y iNi=1 can benon-Gaussian,multivariate, etc.

I θ contains anynon-Gaussian paramter

I η ∈ Rn containseverything that is jointlyGaussian.

Big N = “Big Data”. Big n = “Big Model”

How do we control the chaos?

Eight out of Ten* dentistsrecommend using Bayes toperform meaningful inferenceon your big model!

* indicates prior belief. May conflict with data.

“My n is ∞!” “Well my n is eight times ∞!”

It is not hard to make your model big.I Each non-linear component in a GAM is infinite dimensionalI Every spatial or spatio-temporal GRF is infinite dimensionalI Random effects, and “region” effects add up...

A vital pointAny prior π(η | θ) adds an enormous amount of information.

“The word "optimal" brings a lot of dreary baggage thatthese authors may be too young to remember and would dowell to avoid.”

A toy problem is GP regression

yi ∼ N(x(si ), 1).

I sini=1 knownI Unknown function x(·) taken a priori to be a Gaussian Process

(GP), i.e.

(x(s1), x(s2), . . . , x(sn))T ∼ N(0,Σ)

I Covariance matrix Σij = kθ(si , sj)

ResultIf the GP prior represents a genuine and correct a priori belief aboutthe smoothness of the true function x0(·), estimators based on theposterior will be asymptotically optimal.

“The word "optimal" brings a lot of dreary baggage thatthese authors may be too young to remember and would dowell to avoid.”

A toy problem is GP regression

yi ∼ N(x(si ), 1).

I sini=1 knownI Unknown function x(·) taken a priori to be a Gaussian Process

(GP), i.e.

(x(s1), x(s2), . . . , x(sn))T ∼ N(0,Σ)

I Covariance matrix Σij = kθ(si , sj)

ResultIf the GP prior represents a genuine and correct a priori belief aboutthe smoothness of the true function x0(·), estimators based on theposterior will be asymptotically optimal.

“except as a source of approximations [...] asymptotics haveno practical utility at best and are misleading at worst”

I So to make the asymptotics work as we get more data, weneed to specify the GP smoothness correctly

I Awkward!I A work around: Choose a very smooth GP e.g.

k(s, t) = exp(−κ(s − t)2)

and put a good prior on κ. (van der Vaart & van Zanten,2009)

I Bayes to the rescue!

Smooth operator

There’s a small problem...I Computing the prior density requires the calculation of

xTΣ−1xI Observation: f TΣg ≈

∫f (s)

[∫K (s, t)g(t) dt

]ds

I So Σ “is” the discrete version of the integral operator withkernel K (·, ·).

I So the eigenvalues of Σ are going to be related to thepower-spectrum of x(·) (i.e. the Fourier transform of K (0, t))

Just Dropped In (To See What Condition My ConditionWas In)

Result (Shaback, 1995)

cond2(Σ) = O(hd exp(kd2/h2),I h is the minimum separation between two points in the design.I d is the dimension of s (take it to be 1)

I By the usual interpretation of the condition number, we loseO(h−2) digits of accuracy every time we compute xTΣ−1x

I We can think of this as an uncertainty principle. As we getmore statistically accurate, we are unable to compute thesolution.

I Note: Things are better for the posterior covariance matrixunder Gaussian observations. But you have to be very verycareful!

I want you, but I don’t need you

So we hit a snag with being optimal... But it’s not entirelycatastrophic

I Use lower smoothness (better conditioning)I or use a different basisI or, ask a question: what’s breaking my big model?

Where it all went wrongThe problems come in resolving the high-frequency behaviour. Dowe care about this?

Just nip the tip

Replace x(·) with xn(·) =∑n

i=1 xiφi (·), x ∼ N(0,Σn)

I Fourier basis (truncated Karhunen-Loéve expansion)I WaveletsI Piecewise polynomials.

Piecewise linear approximation of surfaces

NB: The basis functions are only non-zero on a small part of thedomain.

“Which is a kind of integrity, if you look on every exit beingan entrance somewhere else.”

ResultThe error in posterior functionals induced by using afinite-dimensional basis instead of a GP of optimal smoothness is ofthe same order as the approximation error

I Sometimes our big models are bigger than we needI By using a finite basis, we separate the world into things we

care about and things we don’tI We trade (asymptotic) bias for robustness against

mis-specificationI (We still need to watch the conditioning, though)

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

No Repliates, Mo’ Problems

I Presence only data occurs frequently in ecologyI Simplest question to ask: How does covariate (xxx) change

the local risk of a sighting?I Basically, is a covariate effect “significant”?I One big problem: No possibility of replicates.

Protium Tenuifolium (4294 trees)

0 500 1000

0250

500

A useful example: Log-Gaussian Cox processes

The likelihood in the most boring case is

log(π(Y |x(s))) = |Ω| −∫

ΩΛ(s) ds +

∑si∈Y

Λ(si ),

where Y is the set of observed locations and Λ(s) = exp(x(s)), andx(s) is a Gaussian random field.

The is very different from the Gaussian examples: it requires thefield everywhere!

If you liked it then you should’ve put a grid on it

An approximate likelihood

NB: The number of points in a region R is Poisson distributed withmean

∫R Λ(s) ds.

I Divide the ‘observationwindow’ into rectangles.

I Let yi be the number ofpoints in rectangle i .

yi |xi ,θ ∼ Po(exi ),

I The log-risk surface isreplaced with

x|θ ∼ N(µ(θ),Σ(θ)).

Introduction Case study I Case study II Summary Resolution Spatial e↵ect Interaction Estimated e↵ects

Andersonia heterophylla: 55 55 grid

Sigrunn Holbek Sørbye, University of Tromsø Spatial point patterns - simple case studies

But does this lead to valid inference?

Yes—we have perturbation bounds.

I Loosely, the error in the likelihood is transferred exactly (orderof magnitude) to the Hellinger distance between the trueposterior and the computed posterior.

I This is conditional on parameters.I For the LGCP example, it follows that, for smooth enough

fields x(s), the error is O(n−1)

The approximation turns an impossible problem into a difficult, butstill useful, problem.

Covariate strength

-0.4

-0.2

0.0

0.2

0.4

0.6

Slope Al Cu Fe Mn P Zn N pH

Covariate strength (with spatial effect)

−0.

4−

0.2

0.0

0.2

0.4

0.6

Slope Al Cu Fe Mn P Zn N pH

Oh dear!

I Adding a spatial random effect, which accounts for“un-modelled covariates” massively changes the scientificconclusions

I One solution: Make spatial effect orthogonal to covariatesI Pro: Cannot “steal” significanceI Cons: Interpretability, Poor coverage

I This is basically the “maximal covariate effect”I Without replicates, we cannot calibrate the smoothing

parameter to get coverage.

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Different random effect strengths

Advantages

I Once again, an interpretable prior allows us to control ourinference in a sensible way

I We can talk about updating knowledgeI Explicitly conditioning on the prior allows us to communicate

modelling assumptionsI Interpretation without appeals to asymptotics (but well

behaved if more observations come)I Prior and interpretation can/should be made independent of

the lattice

Disadvantages

Challenges

An additive model for the effect of Age, Blood Pressure andCholesterol on the probability of having a heart attack .

β1 × Age f1(Age)

g1(Age)

1− φ1 φ1

β2 × BP f2(BP)

g2(BP)

1− φ2 φ2

β3 × CR f3(CR)

g3(CR)

1− φ3 φ3

g(Age,BP,CR)

w1 w2 w3

How do we build π(w1,w2,w3, φ1, φ2, φ3) to avoid over-fitting?

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

You gotta get a gimmickA Savage QuotationYou should build your model as big as an elephant

A von Neumann quoteWith four parameters I can fit an elephant, and with five I canmake him wiggle his trunk.

Placating pugilistic pachydermsPriors that Penalise Complexity Prevent Poor Performance

I Under everything, this was a talk aboutsetting prior distributions

I This is hard.I This is even harder for big modelsI We must constantly guard against the

Ganzfeld effectI While being flexible enough to find things

that are thereI Big models are more than just a

computational challenge, they require agreat deal of new investment in ourmodelling infrastructure.

Mayer, Khairy, andHoward,Am. J. Phys. 78,648 (2010)

References

I D. P. Simpson, H. Rue, T. G. Martins, A. Riebler, and S. H.Sørbye (2014) Penalising model component complexity: Aprincipled, practical approach to constructing priors.arxiv:1403.4630

I D. P. Simpson, J. Illian, F. K. Lindgren, S. H. Sørbye, andH. Rue (2015) Going off grid: Computationally efficientinference for log-Gaussian Cox processes. Biometrika,Forthcoming. (arXiv:1111.0641)

I Geir-Arne Fuglstad, Daniel Simpson, Finn Lindgren, andHåvard Rue (2015) Interpretable Priors for Hyperparametersfor Gaussian Random Fields. arXiv:15xx:xxx.

top related