big model, big data

50
To avoid fainting, keep repeating “It’s only a model”... Daniel Simpson Department of Mathematical Sciences University of Bath

Upload: christian-robert

Post on 16-Apr-2017

5.848 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Big model, big data

To avoid fainting, keep repeating “It’s only amodel”...

Daniel Simpson

Department of Mathematical SciencesUniversity of Bath

Page 2: Big model, big data

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

Page 3: Big model, big data

“We are tied down to a language which makes up inobscurity what it lacks in style.”

Never mind the big data, here come the big models

Page 4: Big model, big data

“Eternity is a terrible thought. I mean, where’s it going toend?”

Folk definitionA model becomes big when the methods I want to use no longerwork.

I Solving “big models” require investment in modelling (priorsand likelihoods).

I Solving “big models” require investment in computation(scalable, tailored solutions).

I Solving “big models” requires compromise (exactness isn’t anoption).

Page 5: Big model, big data

“We’re more of the love, blood, and rhetoric school”

Question for the agesIs my model really “big” if it only has one infinite dimensionalparameter?

I I spend a lot of effort trying to convince myself that mymodels aren’t too big

I Most of the games that I play are with space, time, or somehorrid combination of the two

I I am addicted to GAMsI I was once at a party where all the cool kids were doing inverse

problems. I swear I didn’t inhale.

Page 6: Big model, big data

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

Page 7: Big model, big data

An archipelago off the coast of Finland(with Haakon Bakka, Håvard Rue, and Jarno Vanhatalo)

I Smelt (other: herring, perch, pikeperch)

I Commercially important fish speciesI Survey data on fish larvae

abundanceI Complex models, high uncertainty

Page 8: Big model, big data

An archipelago off the coast of Finland

200 400 600 800 1000 1200 1400 1600

200

400

600

800

1000

1200

x

y

Page 9: Big model, big data

I’ll tell you what I want (what I really really want)

The questions for this dataset revolve around conservation. (e.g.should we protect some regions?)

Statistical questionsI “Interpolation” (where are the smelt?)I “Extrapolation” (where would we expect the smelt to be in a

similar, but different environment)I Model the nonlinear effect of environmental variables.

(scenario forecasts)

Page 10: Big model, big data

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

Page 11: Big model, big data

What’s a homophone between friends?

Antisocial behaviour in Wales. Social behaviour in whales.

Page 12: Big model, big data

It’s kinda like a point pattern...(Finn Lindgren, Fabian Bachl, Joyce Yuan, David Borchers, Janine Illian)

I You can treat whale pods as a point processI Partially observed along transects (unknown detection

probability!)I Each observed pod has an noisy measurement of size

Page 13: Big model, big data

Gimme! Gimme! Gimme! (A man after midnight)(I’d already used the Spice Girls one...)

The scientific questions are “How many whales are there?” and“Where are they?”.

Statistical questions:I How do you estimate the detection probability?I How do you do inference with a filtered point process?I How do you deal with error in the mark observation?

Page 14: Big model, big data

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

Page 15: Big model, big data

Like any good academic statistician, I will now present theabstract problem(And never return to the interesting one)

Observed fieldyi ∼ π(y i | η,θ)

Latent fieldη ∼ N(0,Q−1)

Parametersθ ∼ π(θ)

I y iNi=1 can benon-Gaussian,multivariate, etc.

I θ contains anynon-Gaussian paramter

I η ∈ Rn containseverything that is jointlyGaussian.

Big N = “Big Data”. Big n = “Big Model”

Page 16: Big model, big data

How do we control the chaos?

Eight out of Ten* dentistsrecommend using Bayes toperform meaningful inferenceon your big model!

* indicates prior belief. May conflict with data.

Page 17: Big model, big data

“My n is ∞!” “Well my n is eight times ∞!”

It is not hard to make your model big.I Each non-linear component in a GAM is infinite dimensionalI Every spatial or spatio-temporal GRF is infinite dimensionalI Random effects, and “region” effects add up...

A vital pointAny prior π(η | θ) adds an enormous amount of information.

Page 18: Big model, big data

“The word "optimal" brings a lot of dreary baggage thatthese authors may be too young to remember and would dowell to avoid.”

A toy problem is GP regression

yi ∼ N(x(si ), 1).

I sini=1 knownI Unknown function x(·) taken a priori to be a Gaussian Process

(GP), i.e.

(x(s1), x(s2), . . . , x(sn))T ∼ N(0,Σ)

I Covariance matrix Σij = kθ(si , sj)

ResultIf the GP prior represents a genuine and correct a priori belief aboutthe smoothness of the true function x0(·), estimators based on theposterior will be asymptotically optimal.

Page 19: Big model, big data

“The word "optimal" brings a lot of dreary baggage thatthese authors may be too young to remember and would dowell to avoid.”

A toy problem is GP regression

yi ∼ N(x(si ), 1).

I sini=1 knownI Unknown function x(·) taken a priori to be a Gaussian Process

(GP), i.e.

(x(s1), x(s2), . . . , x(sn))T ∼ N(0,Σ)

I Covariance matrix Σij = kθ(si , sj)

ResultIf the GP prior represents a genuine and correct a priori belief aboutthe smoothness of the true function x0(·), estimators based on theposterior will be asymptotically optimal.

Page 20: Big model, big data

“except as a source of approximations [...] asymptotics haveno practical utility at best and are misleading at worst”

I So to make the asymptotics work as we get more data, weneed to specify the GP smoothness correctly

I Awkward!I A work around: Choose a very smooth GP e.g.

k(s, t) = exp(−κ(s − t)2)

and put a good prior on κ. (van der Vaart & van Zanten,2009)

I Bayes to the rescue!

Page 21: Big model, big data

Smooth operator

There’s a small problem...I Computing the prior density requires the calculation of

xTΣ−1xI Observation: f TΣg ≈

∫f (s)

[∫K (s, t)g(t) dt

]ds

I So Σ “is” the discrete version of the integral operator withkernel K (·, ·).

I So the eigenvalues of Σ are going to be related to thepower-spectrum of x(·) (i.e. the Fourier transform of K (0, t))

Page 22: Big model, big data

Just Dropped In (To See What Condition My ConditionWas In)

Result (Shaback, 1995)

cond2(Σ) = O(hd exp(kd2/h2),I h is the minimum separation between two points in the design.I d is the dimension of s (take it to be 1)

I By the usual interpretation of the condition number, we loseO(h−2) digits of accuracy every time we compute xTΣ−1x

I We can think of this as an uncertainty principle. As we getmore statistically accurate, we are unable to compute thesolution.

I Note: Things are better for the posterior covariance matrixunder Gaussian observations. But you have to be very verycareful!

Page 23: Big model, big data

I want you, but I don’t need you

So we hit a snag with being optimal... But it’s not entirelycatastrophic

I Use lower smoothness (better conditioning)I or use a different basisI or, ask a question: what’s breaking my big model?

Where it all went wrongThe problems come in resolving the high-frequency behaviour. Dowe care about this?

Page 24: Big model, big data

Just nip the tip

Replace x(·) with xn(·) =∑n

i=1 xiφi (·), x ∼ N(0,Σn)

I Fourier basis (truncated Karhunen-Loéve expansion)I WaveletsI Piecewise polynomials.

Page 25: Big model, big data

Piecewise linear approximation of surfaces

NB: The basis functions are only non-zero on a small part of thedomain.

Page 26: Big model, big data

“Which is a kind of integrity, if you look on every exit beingan entrance somewhere else.”

ResultThe error in posterior functionals induced by using afinite-dimensional basis instead of a GP of optimal smoothness is ofthe same order as the approximation error

I Sometimes our big models are bigger than we needI By using a finite basis, we separate the world into things we

care about and things we don’tI We trade (asymptotic) bias for robustness against

mis-specificationI (We still need to watch the conditioning, though)

Page 27: Big model, big data

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

Page 28: Big model, big data

No Repliates, Mo’ Problems

I Presence only data occurs frequently in ecologyI Simplest question to ask: How does covariate (xxx) change

the local risk of a sighting?I Basically, is a covariate effect “significant”?I One big problem: No possibility of replicates.

Page 29: Big model, big data

Protium Tenuifolium (4294 trees)

0 500 1000

0250

500

Page 30: Big model, big data

A useful example: Log-Gaussian Cox processes

The likelihood in the most boring case is

log(π(Y |x(s))) = |Ω| −∫

ΩΛ(s) ds +

∑si∈Y

Λ(si ),

where Y is the set of observed locations and Λ(s) = exp(x(s)), andx(s) is a Gaussian random field.

The is very different from the Gaussian examples: it requires thefield everywhere!

Page 31: Big model, big data

If you liked it then you should’ve put a grid on it

Page 32: Big model, big data

An approximate likelihood

NB: The number of points in a region R is Poisson distributed withmean

∫R Λ(s) ds.

I Divide the ‘observationwindow’ into rectangles.

I Let yi be the number ofpoints in rectangle i .

yi |xi ,θ ∼ Po(exi ),

I The log-risk surface isreplaced with

x|θ ∼ N(µ(θ),Σ(θ)).

Introduction Case study I Case study II Summary Resolution Spatial e↵ect Interaction Estimated e↵ects

Andersonia heterophylla: 55 55 grid

Sigrunn Holbek Sørbye, University of Tromsø Spatial point patterns - simple case studies

Page 33: Big model, big data

But does this lead to valid inference?

Yes—we have perturbation bounds.

I Loosely, the error in the likelihood is transferred exactly (orderof magnitude) to the Hellinger distance between the trueposterior and the computed posterior.

I This is conditional on parameters.I For the LGCP example, it follows that, for smooth enough

fields x(s), the error is O(n−1)

The approximation turns an impossible problem into a difficult, butstill useful, problem.

Page 34: Big model, big data

Covariate strength

-0.4

-0.2

0.0

0.2

0.4

0.6

Slope Al Cu Fe Mn P Zn N pH

Page 35: Big model, big data

Covariate strength (with spatial effect)

−0.

4−

0.2

0.0

0.2

0.4

0.6

Slope Al Cu Fe Mn P Zn N pH

Page 36: Big model, big data

Oh dear!

I Adding a spatial random effect, which accounts for“un-modelled covariates” massively changes the scientificconclusions

I One solution: Make spatial effect orthogonal to covariatesI Pro: Cannot “steal” significanceI Cons: Interpretability, Poor coverage

I This is basically the “maximal covariate effect”I Without replicates, we cannot calibrate the smoothing

parameter to get coverage.

Page 37: Big model, big data

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Page 38: Big model, big data

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Page 39: Big model, big data

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Page 40: Big model, big data

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Page 41: Big model, big data

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Page 42: Big model, big data

Subjective Bayes to the rescue!

I Key idea: If we can interpret the model, we can talk about thecredible intervals as updates of knowledge

I The random field has two parameters: one controlling therange (unimportant) and one controlling the in-cell variance(IMPORTANT!)

I A prior the variance can be constructed such that

Pr(std(xi ) > U) < α

I Changing U changes interpretationThe effect of Aluminium is significantly negativewhen U < 1, but the credible crosses zero for allU > 1.

I We can relate U to the “degrees of freedom”...

Page 43: Big model, big data

Different random effect strengths

Page 44: Big model, big data

Advantages

I Once again, an interpretable prior allows us to control ourinference in a sensible way

I We can talk about updating knowledgeI Explicitly conditioning on the prior allows us to communicate

modelling assumptionsI Interpretation without appeals to asymptotics (but well

behaved if more observations come)I Prior and interpretation can/should be made independent of

the lattice

Page 45: Big model, big data

Disadvantages

Page 46: Big model, big data

Challenges

An additive model for the effect of Age, Blood Pressure andCholesterol on the probability of having a heart attack .

β1 × Age f1(Age)

g1(Age)

1− φ1 φ1

β2 × BP f2(BP)

g2(BP)

1− φ2 φ2

β3 × CR f3(CR)

g3(CR)

1− φ3 φ3

g(Age,BP,CR)

w1 w2 w3

How do we build π(w1,w2,w3, φ1, φ2, φ3) to avoid over-fitting?

Page 47: Big model, big data

Outline

Amuse-bouche

What is this? A model for ants?!

Moisture is the essence of wetness, and wetness is the essence of beauty.

With low power comes great responsibility

Miss Jackson if you’re nasty

The Ganzfeld effect

Page 48: Big model, big data

You gotta get a gimmickA Savage QuotationYou should build your model as big as an elephant

A von Neumann quoteWith four parameters I can fit an elephant, and with five I canmake him wiggle his trunk.

Page 49: Big model, big data

Placating pugilistic pachydermsPriors that Penalise Complexity Prevent Poor Performance

I Under everything, this was a talk aboutsetting prior distributions

I This is hard.I This is even harder for big modelsI We must constantly guard against the

Ganzfeld effectI While being flexible enough to find things

that are thereI Big models are more than just a

computational challenge, they require agreat deal of new investment in ourmodelling infrastructure.

Mayer, Khairy, andHoward,Am. J. Phys. 78,648 (2010)

Page 50: Big model, big data

References

I D. P. Simpson, H. Rue, T. G. Martins, A. Riebler, and S. H.Sørbye (2014) Penalising model component complexity: Aprincipled, practical approach to constructing priors.arxiv:1403.4630

I D. P. Simpson, J. Illian, F. K. Lindgren, S. H. Sørbye, andH. Rue (2015) Going off grid: Computationally efficientinference for log-Gaussian Cox processes. Biometrika,Forthcoming. (arXiv:1111.0641)

I Geir-Arne Fuglstad, Daniel Simpson, Finn Lindgren, andHåvard Rue (2015) Interpretable Priors for Hyperparametersfor Gaussian Random Fields. arXiv:15xx:xxx.