rimini discussion

85
The 21st Bayesian Century “The 21st Century belongs to Bayes” as argued by a discussion on Bayesian testing and Bayesian model choice Christian P. Robert Universit´ e Paris Dauphine and CREST-INSEE http://www.ceremade.dauphine.fr/ ~ xian http://xianblog.wordpress.com July 1, 2009

Upload: christian-robert

Post on 10-May-2015

2.292 views

Category:

Education


1 download

DESCRIPTION

Temporary and preliminary version of the slides of my talk at the Riminin workshop on Bayesian Econometrics, July 1, 2009

TRANSCRIPT

Page 1: Rimini discussion

The 21st Bayesian Century

“The 21st Century belongs to Bayes”as argued by a discussion on Bayesian testing and

Bayesian model choice

Christian P. Robert

Universite Paris Dauphine and CREST-INSEEhttp://www.ceremade.dauphine.fr/~xian

http://xianblog.wordpress.com

July 1, 2009

Page 2: Rimini discussion

The 21st Bayesian Century

A consequence of Bayesian statistics being given a propername is that it encourages too much historical deference

from people who think that the bibles of Jeffreys, deFinetti, Jaynes, and others have all the answers.

—Gelman, Bayesian Analysis 3(3), 2008

Page 3: Rimini discussion

The 21st Bayesian Century

Outline

Anyone not shocked by the Bayesian theory of inference has notunderstood it

Senn, BA., 2008

Introduction

Tests and model choice

Bayesian Calculations

A Defense of the Bayesian Choice

Page 4: Rimini discussion

The 21st Bayesian Century

Introduction

Vocabulary and concepts

Bayesian inference is a coherent mathematical theorybut I don’t trust it in scientific applications.

Gelman, BA, 2008

IntroductionModelsThe Bayesian frameworkImproper prior distributionsNoninformative prior distributions

Tests and model choice

Bayesian Calculations

A Defense of the Bayesian Choice

Page 5: Rimini discussion

The 21st Bayesian Century

Introduction

Models

Parametric model

Bayesians promote the idea that a multiplicity of parameters can behandled via hierarchical, typically exchangeable, models, but it seems

implausible that this could really work automatically [instead of] givingreasonable answers using minimal assumptions.

Gelman, BA, 2008

Observations x1, . . . , xn generated from a probability distributionfi(xi|θi, x1, . . . , xi−1) = fi(xi|θi, x1:i−1)

x = (x1, . . . , xn) ∼ f(x|θ), θ = (θ1, . . . , θn)

Associated likelihoodℓ(θ|x) = f(x|θ)

[inverted density & starting point]

Page 6: Rimini discussion

The 21st Bayesian Century

Introduction

Models

And [B] nonparametrics?!Equally very active and definitely very 21st, thank you,but not mentioned in this talk!

7th Workshop on Bayesian Nonparametrics - Collegio... http://bnpworkshop.carloalberto.org/

21 - 25 June 2009, Moncalieri

The 7th Workshop on Bayesian Nonparametrics will be held atthe Collegio Carlo Alberto from June 21 to 25, 2009. The Collegio is aResearch Institution housed in an historical building located inMoncalieri on the outskirts of Turin, Italy.

The meeting will feature the latest developments in the area and willcover a wide variety of both theoretical and applied topics such as:foundations of the Bayesian nonparametric approach, constructionand properties of prior distributions, asymptotics, interplay withprobability theory and stochastic processes, statistical modelling,computational algorithms and applications in machine learning,biostatistics, bioinformatics, economics and econometrics.

The Workshop will be structured in 4 tutorials on special topics, aseries of invited talks and contributed posters sessions.

NewsTentative Workshop ScheduleAbstract Book (last updated 27th May 2009)Workshop Poster

Page 7: Rimini discussion

The 21st Bayesian Century

Introduction

The Bayesian framework

Bayes theorem 101

Bayes theorem = Inversion of probabilities

If A and E are events such that P (E) 6= 0, P (A|E) and P (E|A)are related by

P (A|E) =

P (E|A)P (A)

P (E|A)P (A) + P (E|Ac)P (Ac)

=P (E|A)P (A)

P (E)

[Thomas Bayes (?)]

Page 8: Rimini discussion

The 21st Bayesian Century

Introduction

The Bayesian framework

Bayesian approach

The impact of treating x as a fixed constantis to increase statistical power as an artefact

Templeton, Molec. Ecol., 2009

New perspective

◮ Uncertainty on the parameters θ of a model modeled througha probability distribution π on Θ, called prior distribution

◮ Inference based on the distribution of θ conditional on x,π(θ|x), called posterior distribution

π(θ|x) =f(x|θ)π(θ)∫f(x|θ)π(θ) dθ

.

Page 9: Rimini discussion

The 21st Bayesian Century

Introduction

The Bayesian framework

[Nonphilosophical] justifications

Ignoring the sampling error of x underminesthe statistical validity of all inferences made by the method

Templeton, Molec. Ecol., 2009

◮ Semantic drift from unknown to random

◮ Actualization of the information on θ by extracting theinformation on θ contained in the observation x

◮ Allows incorporation of imperfect information in the decisionprocess

◮ Unique mathematical way to condition upon the observations(conditional perspective)

◮ Unique way to give meaning to statements like P(θ > 0)

Page 10: Rimini discussion

The 21st Bayesian Century

Introduction

The Bayesian framework

Posterior distribution

Bayesian methods are presented as an automatic inference engine,and this raises suspicion in anyone with applied experience

Gelman, BA, 2008

π(θ|x) central to Bayesian inference

◮ Operates conditional upon the observations

◮ Incorporates the requirement of the Likelihood Principle

◮ Avoids averaging over the unobserved values of x

◮ Coherent updating of the information available on θ

◮ Provides a complete inferential machinery

Page 11: Rimini discussion

The 21st Bayesian Century

Introduction

Improper prior distributions

Improper distributions

If we take P (dσ) ∝ dσ as a statement that σ may have any valuebetween 0 and ∞ (...), we must use ∞ instead of 1 to denote certainty.

Jeffreys, ToP, 1939

Necessary extension from a prior distribution to a prior σ-finitemeasure π such that

Θπ(θ) dθ = +∞

Improper prior distribution[Weird? Inappropriate?? report!! ]

Page 12: Rimini discussion

The 21st Bayesian Century

Introduction

Improper prior distributions

Justifications

If the parameter may have any value from −∞ to +∞,its prior probability should be taken as uniformly distributed

Jeffreys, ToP, 1939

Automated prior determination often leads to improper priors

1. Similar performances of estimators derived from thesegeneralized distributions

2. Improper priors as limits of proper distributions in many[mathematical] senses

Page 13: Rimini discussion

The 21st Bayesian Century

Introduction

Improper prior distributions

More justifications

There is no good objective principle for choosing a noninformative prior(even if that concept were mathematically defined, which it is not)

Gelman, BA, 2008

4. Robust answer against possible misspecifications of the prior

5. Frequencial justifications, such as:

(i) minimaxity(ii) admissibility(iii) invariance (Haar measure)

6. Improper priors [much] prefered to vague proper priors likeN (0, 106)

Page 14: Rimini discussion

The 21st Bayesian Century

Introduction

Improper prior distributions

Validation

The mistake is to think of them as representing ignoranceLindley, JASA, 1990

Extension of the posterior distribution π(θ|x) associated with animproper prior π as given by Bayes’s formula

π(θ|x) =f(x|θ)π(θ)∫

Θ f(x|θ)π(θ) dθ,

when ∫

Θf(x|θ)π(θ) dθ < ∞

Delete all emotional names

Page 15: Rimini discussion

The 21st Bayesian Century

Introduction

Noninformative prior distributions

Noninformative priors

...cannot be expected to represent exactly total ignorance about theproblem, but should rather be taken as reference priors, upon which

everyone could fall back when the prior information is missing.Kass and Wasserman, JASA, 1996

What if all we know is that we know “nothing” ?!In the absence of prior information, prior distributions solelyderived from the sample distribution f(x|θ)Difficulty with uniform priors, lacking invariance properties.

Page 16: Rimini discussion

The 21st Bayesian Century

Introduction

Noninformative prior distributions

Jeffreys’ prior

If we took the prior density for the parameters to be proportional to|I(θ)|1/2, it could be stated for any law that is differentiable with respectto all parameters that the total probability in any region of the θi would

be equal to the total probability in the corresponding region of the θ′iJeffreys, ToP, 1939

Based on Fisher information

I(θ) = Eθ

[∂ℓ

∂θT

∂ℓ

∂θ

]

Jeffreys’ prior distribution is

π∗(θ) ∝ |I(θ)|1/2

Page 17: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Tests and model choice

The Jeffreys-subjective synthesis betrays a much more dangerousconfusion than the Neyman-Pearson-Fisher synthesis as regards

hypothesis testsSenn, BA, 2008

Introduction

Tests and model choiceBayesian testsBayes factorsOpposition to classical testsModel choiceCompatible priorsVariable selection

Bayesian Calculations

Page 18: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayesian tests

Construction of Bayes tests

What is almost never used, however, is the Jeffreys significance test.Senn, BA, 2008

Definition (Test)

Given an hypothesis H0 : θ ∈ Θ0 on the parameter θ ∈ Θ0 of astatistical model, a test is a statistical procedure that takes itsvalues in {0, 1}.

Example (Normal mean)

For x ∼ N (θ, 1), decide whether or not θ ≤ 0.

Page 19: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayesian tests

Decision-theoretic perspective

Loss functions [are] not relevant to statistical inferenceGelman, BA, 2008

Theorem (Optimal Bayes decision)

Under the 0 − 1 loss function

L(θ, d) =

0 if d = IΘ0(θ)

a0 if d = 1 and θ 6∈ Θ0

a1 if d = 0 and θ ∈ Θ0

the Bayes procedure is

δπ(x) =

{1 if Prπ(θ ∈ Θ0|x) ≥ a0/(a0 + a1)

0 otherwise

Page 20: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayes factors

A function of posterior probabilities

The method posits two or more alternative hypotheses and tests theirrelative fits to some observed statistics

Templeton, Mol. Ecol., 2009

Definition (Bayes factors)

For hypotheses H0 : θ ∈ Θ0 vs. Ha : θ 6∈ Θ0

B01 =π(Θ0|x)

π(Θc0|x)

/π(Θ0)

π(Θc0)

=

Θ0

f(x|θ)π0(θ)dθ

Θc0

f(x|θ)π1(θ)dθ

[Good, 1958 & Jeffreys, 1961]

Goto Poisson example

Page 21: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayes factors

Self-contained concept

Having a high relative probability does not mean that a hypothesis is trueor supported by the data

Templeton, Mol. Ecol., 2009

Non-decision-theoretic:

◮ eliminates choice of π(Θ0)

◮ Bayesian/marginal equivalent to the likelihood ratio

◮ Jeffreys’ scale of evidence:◮ if log10(B

π10) between 0 and 0.5, evidence against H0 weak,

◮ if log10(Bπ10) 0.5 and 1, evidence substantial,

◮ if log10(Bπ10) 1 and 2, evidence strong and

◮ if log10(Bπ10) above 2, evidence decisive

Page 22: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayes factors

A major modification

Considering whether a location parameter α is 0. The prior is uniformand we should have to take f(α) = 0 and B10 would always be infinite

Jeffreys, ToP, 1939

When the null hypothesis is supported by a set of measure 0,π(Θ0) = 0 and thus π(Θ0|x) = 0.

[End of the story?!]

Page 23: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayes factors

Changing the prior to fit the hypotheses

Requirement

Defined prior distributions under both assumptions,

π0(θ) ∝ π(θ)IΘ0(θ), π1(θ) ∝ π(θ)IΘ1

(θ),

(under the standard dominating measures on Θ0 and Θ1)

Using the prior probabilities π(Θ0) = 0 and π(Θ1) = 1,

π(θ) = 0π0(θ) + 1π1(θ).

Page 24: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayes factors

Point null hypotheses

I have no patience for statistical methods that assign positive probabilityto point hypotheses of the θ = 0 type that can never actually be true

Gelman, BA, 2008

Take ρ0 = Prπ(θ = θ0) and g1 prior density under Ha. Then

π(Θ0|x) =f(x|θ0)ρ0∫

f(x|θ)π(θ) dθ=

f(x|θ0)ρ0

f(x|θ0)ρ0 + (1 − ρ0)m1(x)

and Bayes factor

Bπ01(x) =

f(x|θ0)ρ0

m1(x)(1 − ρ0)

/ρ0

1 − ρ0=

f(x|θ0)

m1(x)

Page 25: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Bayes factors

Point null hypotheses (cont’d)

Example (Normal mean)

Test of H0 : θ = 0 when x ∼ N (θ, 1): we take π1 as N (0, τ2)

m1(x)

f(x|0)=

√σ2

σ2 + τ2exp

{τ2x2

2σ2(σ2 + τ2)

}

and the posterior probability is

τ/x 0 0.68 1.28 1.96

1 0.586 0.557 0.484 0.35110 0.768 0.729 0.612 0.366

Page 26: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Opposition to classical tests

Comparison with classical tests

The 95 percent frequentist intervals will live up to their advertisedcoverage claims

Wasserman, BA, 2008

Standard answer

Definition (p-value)

The p-value p(x) associated with a test is the largest significancelevel for which H0 is rejected

Page 27: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Opposition to classical tests

Problems with p-values

The use of P implies that a hypothesis that may be true may be rejectedbecause it had not predicted observable results that have not occurred

Jeffreys, ToP, 1939

◮ Evaluation of the wrong quantity, namely the probability toexceed the observed quantity.(wrong conditioning)

◮ Evaluation only under the null hypothesis

◮ Huge numerical difference with the Bayesian range of answers

Page 28: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Opposition to classical tests

Bayesian lower bounds

If the Bayes estimator has good frequency behaviorthen we might as well use the frequentist method.

If it has bad frequency behavior then we shouldn’t use it.Wasserman, BA, 2008

Least favourable Bayesian answer is

B(x, GA) = infg∈GA

f(x|θ0)∫Θ f(x|θ)g(θ) dθ

,

i.e., if there exists a mle for θ, θ(x),

B(x, GA) =f(x|θ0)

f(x|θ(x))

Page 29: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Opposition to classical tests

Illustration

Example (Normal case)

When x ∼ N (θ, 1) and H0 : θ0 = 0, the lower bounds are

B(x, GA) = e−x2/2 and P(x, GA) =(1 + ex2/2

)−1,

i.e.p-value 0.10 0.05 0.01 0.001

P 0.205 0.128 0.035 0.004B 0.256 0.146 0.036 0.004

[Quite different!]

Page 30: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Model choice

Model choice and model comparison

There is no null hypothesis, which complicates the computation ofsampling error

Templeton, Mol. Ecol., 2009

Choice among modelsSeveral models available for the same observation(s)

Mi : x ∼ fi(x|θi), i ∈ I

where I can be finite or infinite

Page 31: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Model choice

Bayesian resolution

The posterior probabilities are constructed by using a numerator that is afunction of the observation for a particular model, then divided by a

denominator that ensures that the ”probabilities” sum to oneTempleton, Mol. Ecol., 2009

Probabilise the entire model/parameter space

◮ allocate probabilities pi to all models Mi

◮ define priors πi(θi) for each parameter space Θi

◮ compute

π(Mi|x) =

pi

Θi

fi(x|θi)πi(θi)dθi

j

pj

Θj

fj(x|θj)πj(θj)dθj

Page 32: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Model choice

Bayesian resolution(2)

The numerators are not co-measurable across hypotheses, and thedenominators are sums of non-co-measurable entities. This means that it

is mathematically impossible for them to be probabilities.Templeton, Mol. Ecol., 2009

◮ take largest π(Mi|x) to determine “best” model,or use averaged predictive

j

π(Mj |x)

Θj

fj(x′|θj)πj(θj |x)dθj

Page 33: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Model choice

Natural Ockham’s razor

Pluralitas non est ponenda sine neccesitate

Variation is random until thecontrary is shown; and newparameters in laws, when theyare suggested, must be testedone at a time, unless there isspecific reason to the contrary.

Jeffreys, ToP, 1939

The Bayesian approach naturally weights differently models withdifferent parameter dimensions (BIC).

Page 34: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Compatible priors

Compatibility principle

Further complicating dimensionality of test statistics is the fact that themodels are often not nested, and one model may contain parameters that

do not have analogues in the other models and vice versaTempleton, Mol. Ecol., 2009

Difficulty of finding simultaneously priors on a collection of modelsEasier to start from a single prior on a “big” [encompassing] modeland to derive others from a coherence principle

[Dawid & Lauritzen, 2000]Raw regression output

Page 35: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Compatible priors

An illustration for linear regression

In the case M1 and M2 are two nested Gaussian linear regressionmodels with Zellner’s g-priors and the same variance σ2 ∼ π(σ2):

◮ M1 : y|β1, σ2 ∼ N (X1β1, σ

2) with

β1|σ2 ∼ N

(s1, σ

2n1(XT1 X1)

−1)

where X1 is a (n × k1) matrix of rank k1 ≤ n

◮ M2 : y|β2, σ2 ∼ N (X2β2, σ

2) with

β2|σ2 ∼ N

(s2, σ

2n2(XT2 X2)

−1)

,

where X2 is a (n × k2) matrix with span(X2) ⊆ span(X1)

[ c©Marin & Robert, Bayesian Core]

Page 36: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Compatible priors

Compatible g-priors

I don’t see any role for squared error loss, minimax, or the rest of what issometimes called statistical decision theory

Gelman, BA, 2008

Since σ2 is a nuisance parameter, minimize the Kullback-Leiblerdivergence between both marginal distributions conditional on σ2:m1(y|σ

2; s1, n1) and m2(y|σ2; s2, n2), with solution

β2|X2, σ2 ∼ N

(s∗2, σ

2n∗2(X

T2 X2)

−1)

withs∗2 = (XT

2 X2)−1XT

2 X1s1 n∗2 = n1

Page 37: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Variable selection

Regression setup where y regressed on aset {x1, . . . , xp} of p potentialexplanatory regressors (plus intercept)

Corresponding 2p submodels Mγ , whereγ ∈ Γ = {0, 1}p indicatesinclusion/exclusion of variables by abinary representation,e.g. γ = 101001011 means that x1, x3,x5, x7 and x8 are included.

Page 38: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Notations

For model Mγ ,

◮ qγ variables included

◮ t1(γ) = {t1,1(γ), . . . , t1,qγ (γ)} indices of those variables andt0(γ) indices of the variables not included

◮ For β ∈ Rp+1,

βt1(γ) =[β0, βt1,1(γ), . . . , βt1,qγ (γ)

]

Xt1(γ) =[1n|xt1,1(γ)| . . . |xt1,qγ (γ)

].

Submodel Mγ is thus

y|β, γ, σ2 ∼ N(Xt1(γ)βt1(γ), σ

2In

)

Page 39: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Global and compatible priors

Use Zellner’s g-prior, i.e. a normal prior for β conditional on σ2,

β|σ2 ∼ N (β, cσ2(XTX)−1)

and a Jeffreys prior for σ2,

π(σ2) ∝ σ−2

Noninformative g

Resulting compatible prior

βt1(γ) ∼ N

((XT

t1(γ)Xt1(γ)

)−1XT

t1(γ)Xβ, cσ2(XT

t1(γ)Xt1(γ)

)−1)

Page 40: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Posterior model probability

Can be obtained in closed form:

π(γ|y) ∝ (c+1)−(qγ+1)/2

[yTy −

cyTP1y

c + 1+

βTXTP1Xβ

c + 1−

2yTP1Xβ

c + 1

]−n/2

.

Conditionally on γ, posterior distributions of β and σ2:

βt1(γ)|σ2, y, γ ∼ N

[c

c + 1(U1y + U1Xβ/c),

σ2c

c + 1

(XT

t1(γ)Xt1(γ)

)−1

],

σ2|y, γ ∼ IG

[n

2,yTy

2−

cyTP1y

2(c + 1)+

βTXTP1Xβ

2(c + 1)−

yTP1Xβ

c + 1

].

Page 41: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Noninformative case

Use the same compatible informative g-prior distribution withβ = 0p+1 and a hierarchical diffuse prior distribution on c,

π(c) ∝ c−1IN∗(c) or π(c) ∝ c−1

Ic>0

Recall g-prior

The choice of this hierarchical diffuse prior distribution on c is dueto the model posterior sensitivity to large values of c:

Taking β = 0p+1 and c large does not work

Page 42: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Processionary caterpillar

Influence of some forest settlement characteristics on thedevelopment of caterpillar colonies

Response y log-transform of the average number of nests ofcaterpillars per tree on an area of 500 square meters (n = 33 areas)

[ c©Marin & Robert, Bayesian Core]

Page 43: Rimini discussion

x1 x2 x3

x4 x5 x6

x7 x8 x9

The 21st Bayesian Century

Tests and model choice

Variable selection

Processionary caterpillar (cont’d)

Potential explanatory variables

x1 altitude (in meters), x2 slope (in degrees),

x3 number of pines in the square,

x4 height (in meters) of the tree at the center of the square,

x5 diameter of the tree at the center of the square,

x6 index of the settlement density,

x7 orientation of the square (from 1 if southb’d to 2 ow),

x8 height (in meters) of the dominant tree,

x9 number of vegetation strata,

x10 mix settlement index (from 1 if not mixed to 2 if mixed).

Page 44: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Bayesian regression outputEstimate BF log10(BF)

(Intercept) 9.2714 26.334 1.4205 (***)X1 -0.0037 7.0839 0.8502 (**)X2 -0.0454 3.6850 0.5664 (**)X3 0.0573 0.4356 -0.3609X4 -1.0905 2.8314 0.4520 (*)X5 0.1953 2.5157 0.4007 (*)X6 -0.3008 0.3621 -0.4412X7 -0.2002 0.3627 -0.4404X8 0.1526 0.4589 -0.3383X9 -1.0835 0.9069 -0.0424X10 -0.3651 0.4132 -0.3838

evidence against H0: (****) decisive, (***) strong, (**)subtantial, (*) poor

Page 45: Rimini discussion

The 21st Bayesian Century

Tests and model choice

Variable selection

Bayesian variable selection

t1(γ) π(γ|y, X)

0,1,2,4,5 0.09290,1,2,4,5,9 0.03250,1,2,4,5,10 0.02950,1,2,4,5,7 0.02310,1,2,4,5,8 0.02280,1,2,4,5,6 0.02280,1,2,3,4,5 0.02240,1,2,3,4,5,9 0.01670,1,2,4,5,6,9 0.01670,1,2,4,5,8,9 0.0137

Noninformative G-prior model choice

Page 46: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayesian Calculations

Bayesian methods seem to quickly move to elaborate computationGelman, BA, 2008

Introduction

Tests and model choice

Bayesian CalculationsImplementation difficultiesBayes factor approximationABC model choice

A Defense of the Bayesian Choice

Page 47: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Implementation difficulties

B Implementation difficulties

◮ Computing the posterior distribution

π(θ|x) ∝ π(θ)f(x|θ)

◮ Resolution of

arg min

ΘL(θ, δ)π(θ)f(x|θ)dθ

◮ Maximisation of the marginal posterior

arg max

Θ−1

π(θ|x)dθ−1

Page 48: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Implementation difficulties

B Implementation further difficulties

A statistical test returns a probability value, but rarely is the probabilityvalue per se the reason for an investigator performing the test

Templeton, Mol. Ecol., 2009

◮ Computing posterior quantities

δπ(x) =

Θh(θ) π(θ|x)dθ =

Θh(θ) π(θ)f(x|θ)dθ∫

Θπ(θ)f(x|θ)dθ

◮ Resolution (in k) of

P (π(θ|x) ≥ k|x) = α

Page 49: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Implementation difficulties

Monte Carlo methods

Bayesian simulation seems stuck in an infinite regress of inferentialuncertainty

Gelman, BA, 2008

Approximation of

I =

Θg(θ)f(x|θ)π(θ) dθ,

takes advantage of the fact that f(x|θ)π(θ) is proportional to adensity: If the θi’s are from π(θ),

1

m

m∑

i=1

g(θi)f(x|θi)

converges (almost surely) to I

Page 50: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Implementation difficulties

Importance function

A simulation method of inference hides unrealistic assumptionsTempleton, Mol. Ecol., 2009

No need to simulate from π(·|x) or from π: if h is a probabilitydensity,

Θg(θ)f(x|θ)π(θ) dθ =

∫g(θ)f(x|θ)π(θ)

h(θ)h(θ) dθ

and ∑mi=1 g(θi)ω(θi)∑m

i=1 ω(θi)with ω(θi) =

f(x|θi)π(θi)

h(θi)

approximates Eπ[g(θ)|x]

Page 51: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Bayes factor approximation

ABC’s When approximating the Bayes factor

B12 =

Θ1

f1(x|θ1)π1(θ1)dθ1

Θ2

f2(x|θ2)π2(θ2)dθ2

=Z1

Z2

use of importance functions 1 and 2 and

B12 =n−1

1

∑n1

i=1 f1(x|θi1)π1(θ

i1)/1(θ

i1)

n−12

∑n2

i=1 f2(x|θi2)π2(θi

2)/2(θi2)

θij ∼ j(θ)

[Chopin & Robert, 2007]

Page 52: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Bridge sampling

Special case:If

π1(θ1|x) ∝ π1(θ1|x)π2(θ2|x) ∝ π2(θ2|x)

live on the same space (Θ1 = Θ2), then

B12 ≈1

n

n∑

i=1

π1(θi|x)

π2(θi|x)θi ∼ π2(θ|x)

[Gelman & Meng, 1998; Chen, Shao & Ibrahim, 2000]

Page 53: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

(Further) bridge sampling

In addition

B12 =

∫π2(θ|x)α(θ)π1(θ|x)dθ

∫π1(θ|x)α(θ)π2(θ|x)dθ

∀ α(·)

1

n1

n1∑

i=1

π2(θ1i|x)α(θ1i)

1

n2

n2∑

i=1

π1(θ2i|x)α(θ2i)

θji ∼ πj(θ|x)

Page 54: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Optimal bridge sampling

The optimal choice of auxiliary function is

α⋆(θ) =n1 + n2

n1π1(θ|x) + n2π2(θ|x)

leading to

B12 ≈

1

n1

n1∑

i=1

π2(θ1i|x)

n1π1(θ1i|x) + n2π2(θ1i|x)

1

n2

n2∑

i=1

π1(θ2i|x)

n1π1(θ2i|x) + n2π2(θ2i|x)

Back later!

Page 55: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Approximating Zk from a posterior sample

Use of the [harmonic mean] identity

Eπk

[ϕ(θk)

πk(θk)Lk(θk)

∣∣∣∣ x

]=

∫ϕ(θk)

πk(θk)Lk(θk)

πk(θk)Lk(θk)

Zkdθk =

1

Zk

no matter what the proposal ϕ(·) is.[Gelfand & Dey, 1994; Bartolucci et al., 2006]

Direct exploitation of the MCMC output

Page 56: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Comparison with regular importance sampling

Harmonic mean: Constraint opposed to usual importance samplingconstraints: ϕ(θ) must have lighter (rather than fatter) tails thanπk(θk)Lk(θk) for the approximation

Z1k = 1

/1

T

T∑

t=1

ϕ(θ(t)k )

πk(θ(t)k )Lk(θ

(t)k )

to have a finite variance.E.g., use finite support kernels (like Epanechnikov’s kernel) for ϕ

Page 57: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Approximating Z using a mixture representation

Bridge sampling redux

Design a specific mixture for simulation [importance sampling]purposes, with density

ϕk(θk) ∝ ω1πk(θk)Lk(θk) + ϕ(θk) ,

where ϕ(·) is arbitrary (but normalised)Note: ω1 is not a probability weight

Page 58: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Approximating Z using a mixture representation (cont’d)

Corresponding MCMC (=Gibbs) sampler

At iteration t

1. Take δ(t) = 1 with probability

ω1πk(θ(t−1)k )Lk(θ

(t−1)k )

/ (ω1πk(θ

(t−1)k )Lk(θ

(t−1)k ) + ϕ(θ

(t−1)k )

)

and δ(t) = 2 otherwise;

2. If δ(t) = 1, generate θ(t)k ∼ MCMC(θ

(t−1)k , θk) where

MCMC(θk, θ′k) denotes an arbitrary MCMC kernel associated

with the posterior πk(θk|x) ∝ πk(θk)Lk(θk);

3. If δ(t) = 2, generate θ(t)k ∼ ϕ(θk) independently

Page 59: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Evidence approximation by mixtures

Rao-Blackwellised estimate

ξ =1

T

T∑

t=1

ω1πk(θ(t)k )Lk(θ

(t)k )

/ω1πk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ

(t)k ) ,

converges to ω1Zk/{ω1Zk + 1}Deduce Z3k from ω1Z3k/{ω1Z3k + 1} = ξ ie

Z3k =

∑Tt=1 ω1πk(θ

(t)k )Lk(θ

(t)k )

/ω1π(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ

(t)k )

∑Tt=1 ϕ(θ

(t)k )

/ω1πk(θ

(t)k )Lk(θ

(t)k ) + ϕ(θ

(t)k )

[Bridge sampler]

Page 60: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Chib’s representation

Direct application of Bayes’ theorem: givenx ∼ fk(x|θk) and θk ∼ πk(θk),

Zk = mk(x) =fk(x|θk)πk(θk)

πk(θk|x)

Use of an approximation to the posterior

Zk = mk(x) =fk(x|θ

∗k)πk(θ

∗k)

πk(θ∗k|x)

.

Page 61: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

Bayes factor approximation

Case of latent variables

For missing variable z as in mixture models, natural Rao-Blackwellestimate

πk(θ∗k|x) =

1

T

T∑

t=1

πk(θ∗k|x, z

(t)k ) ,

where the z(t)k ’s are Gibbs sampled latent variables

Page 62: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Approximate Bayesian Computation

Simulation target is π(θ)f(x|θ) with likelihood f(x|θ) not inclosed form.Likelihood-free rejection technique:

ABC algorithm

For an observation y ∼ f(y|θ), under the prior π(θ), keep jointlysimulating

θ′ ∼ π(θ) , x ∼ f(x|θ′) ,

until the auxiliary variable x is equal to the observed value, x = y.

[Pritchard et al., 1999]

Page 63: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

A as approximative

When y is a continuous random variable, equality x = y is replacedwith a tolerance condition,

(x, y) ≤ ǫ

where is a distance between summary statisticsOutput distributed from

π(θ)Pθ{(x, y) < ǫ} ∝ π(θ|(x, y) < ǫ)

Page 64: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Gibbs random fields

Gibbs distribution

The rv y = (y1, . . . , yn) is a Gibbs random field associated withthe graph G if

f(y) =1

Zexp

{−

c∈C

Vc(yc)

},

where Z is the normalising constant, C is the set of cliques of G

and Vc is any function also called potentialU(y) =

∑c∈C Vc(yc) is the energy function

c© Z is usually unavailable in closed form

Page 65: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Potts model

Potts model

Vc(y) is of the form

Vc(y) = θS(y) = θ∑

l∼i

δyl=yi

where l∼i denotes a neighbourhood structure

In most realistic settings, summation

Zθ =∑

x∈X

exp{θTS(x)}

involves too many terms to be manageable and numericalapproximations cannot always be trusted

[Cucala, Marin, CPR & Titterington, JASA, 2009]

Page 66: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Neighbourhood relations

Choice to be made between M neighbourhood relations

im∼ i′ (0 ≤ m ≤ M − 1)

withSm(x) =

im∼i′

I{xi=xi′}

driven by the posterior probabilities of the models.

Page 67: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Model index

Formalisation via a model index M, new parameter with priordistribution π(M = m) and π(θ|M = m) = πm(θm)Computational target:

P(M = m|x) ∝

Θm

fm(x|θm)πm(θm) dθm π(M = m)

Page 68: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Sufficient statisticsIf S(x) sufficient statistic for the joint parameters(M, θ0, . . . , θM−1),

P(M = m|x) = P(M = m|S(x)) .

For each model m, sufficient statistic Sm(·) makesS(·) = (S0(·), . . . , SM−1(·)) also sufficient.For Gibbs random fields,

x|M = m ∼ fm(x|θm) = f1m(x|S(x))f2

m(S(x)|θm)

=1

n(S(x))f2

m(S(x)|θm)

wheren(S(x)) = ♯ {x ∈ X : S(x) = S(x)}

c© S(x) also sufficient for the joint parameters[Specific to Gibbs random fields!]

Page 69: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

ABC model choice Algorithm

ABC-MC◮ Generate m∗ from the prior π(M = m).

◮ Generate θ∗m∗ from the prior πm∗(·).

◮ Generate x∗ from the model fm∗(·|θ∗m∗).

◮ Compute the distance ρ(S(x0), S(x∗)).

◮ Accept (θ∗m∗ , m∗) if ρ(S(x0), S(x∗)) < ǫ.

[Cornuet, Grelaud, Marin & Robert, BA, 2008]

Note When ǫ = 0 the algorithm is exact

Page 70: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Toy example

iid Bernoulli model versus two-state first-order Markov chain, i.e.

f0(x|θ0) = exp

(θ0

n∑

i=1

I{xi=1}

)/{1 + exp(θ0)}

n ,

versus

f1(x|θ1) =1

2exp

(θ1

n∑

i=2

I{xi=xi−1}

)/{1 + exp(θ1)}

n−1 ,

with priors θ0 ∼ U(−5, 5) and θ1 ∼ U(0, 6) (inspired by “phasetransition” boundaries).

Page 71: Rimini discussion

The 21st Bayesian Century

Bayesian Calculations

ABC model choice

Toy example (2)

(left) Comparison of the true BFm0/m1(x0) with BFm0/m1

(x0)(in logs) over 2, 000 simulations and 4.106 proposals from theprior. (right) Same when using tolerance ǫ corresponding to the1% quantile on the distances.

Page 72: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

A Defense of the Bayesian Choice

Given the advances in practical Bayesian methods in the past twodecades, anti-Bayesianism is no longer a serious option

Gelman, BA, 2009

Bayesians are of course their own worst enemies. They makenon-Bayesians accuse them of religious fervour, and an unwillingness to

see another point of view.Davidson, 2009

Page 73: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

1. Choosing a probabilistic representation

Bayesian statistics is about making probability statementsGelman, BA, 2009

Bayesian Statistics appears as the calculus of uncertainty

Reminder:A probabilistic model is nothing but an interpretation of a givenphenomenon

What is the meaning of RD’s t test example?!

Page 74: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

1. Choosing a probabilistic representation (2)

Inference is impossible.Davidson, 2009

The Bahadur–Savage problem stems from the inability to makechoices about the shape of a statistical model, not from animpossibility to draw [Bayesian] inference.

Further, a probability distribution is more than the sum of itsmoments. Ill-posed problems thus highlight issues with the model,not the inference.

Page 75: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

2. Conditioning on the data

Bayesian data analysis is a method for summarizing uncertainty andmaking estimates and predictions using probability statements conditional

on observed data and an assumed modelGelman, BA, 2009

At the basis of statistical inference lies an inversion processbetween cause and effect. Using a prior distribution brings anecessary balance between observations and parameters and enableto operate conditional upon x

What is the data in RD’s t test example?! U ’s? Y ’s?

Page 76: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

3. Exhibiting the true likelihood

Frequentist statistics is an approach for evaluating statistical proceduresconditional on some family of posited probability models

Gelman, BA, 2009

Provides a complete quantitative inference on the parameters andpredictive that points out inadequacies of frequentist statistics,while implementing the Likelihood Principle.

There needs to be a true likelihood, including innon-parametric settings

[Rousseau, Van der Vaart]

Page 77: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

4. Using priors as tools and summaries

Bayesian techniques allow prior beliefs to be tested and discarded asappropriate

Gelman, BA, 2009

The choice of a prior distribution π does not require any kind ofbelief in this distribution: rather consider it as a tool thatsummarizes the available prior information and the uncertaintysurrounding this information

Non-identifiability is an issue in that the prior may stronglyimpact inference about identifiable bits

Page 78: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

4. Using priors as tools and summaries (2)

No uninformative prior exists for such models.Davidson, 2009

Reference priors can be deduced from the sampling distribution byan automated procedure, based on a minimal information principlethat maximises the information brought by the data.

Important literature on prior modelling for non-parametricproblems, incl. smoothness constraints.

Page 79: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

5. Accepting the subjective basis of knowledge

Knowledge is a critical confrontation between a prioris andexperiments. Ignoring these a prioris impoverishes analysis.

We have, for one thing, to use a language and ourlanguage is entirely made of preconceived ideas and has to beso. However, these are unconscious preconceived ideas, whichare a million times more dangerous than the other ones. Werewe to assert that if we are including other preconceived ideas,consciously stated, we would aggravate the evil! I do notbelieve so: I rather maintain that they would balance oneanother.

Henri Poincare, 1902

Page 80: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

6. Choosing a coherent system of inference

Bayesian data analysis has three stages: formulating a model,splitting the model to data, and checking the model fit.

The second step—inference—gets most of the attention,but the procedure as a whole is not automatic

Gelman, BA, 2009

To force inference into a decision-theoretic mold allows for aclarification of the way inferential tools should be evaluated, andtherefore implies a conscious (although subjective) choice of theretained optimality.Logical inference process Start with requested properties, i.e.loss function and prior distribution, then derive the best solutionsatisfying these properties.

Page 81: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

6. Choosing a coherent system of inference (2)

Asymptopia annoys Bayesians.Davidson, 2009

Asymptotics [for inference] sounds for a proxy for not specifyingcompletely the model and thus for using another model. Whileasymptotics [for simulation] is quite acceptable. Bayesian inferencedoes not escape asymptotic difficulties, see e.g. mixtures.

NP Bootstrap aims at inference with no[t enough]modelling, while P Bayesian bootstrap is essentially using theBayesian predictive

Page 82: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

7. Looking for optimal frequentist procedures

At intermediate levels of a Bayesian model, frequency properties typicallytake care of themselves. It is typically only at the top level of

unreplicated parameters that we have to worryGelman, BA, 2009

Bayesian inference widely intersects with the three notions ofminimaxity, admissibility and equivariance (Haar). Looking for anoptimal estimator most often ends up finding a Bayes estimator.

Optimality is easier to attain through the Bayes “filter”

Page 83: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

8. Solving the actual problem

Frequentist methods have coverage guarantees; Bayesian methods don’t.In science, coverage matters

Wasserman, BA, 2009

Frequentist methods justified on a long-term basis, i.e., from thestatistician viewpoint. From a decision-maker’s point of view, onlythe problem at hand matters! That is, he/she calls for an inferenceconditional on x.

Page 84: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

9. Providing a universal system of inference

Bayesian methods are presented as an automatic inference engineGelman, BA, 2009

Given the three factors

(X , f(x|θ), (Θ, π(θ)), (D , L(θ, d)) ,

the Bayesian approach validates one and only one inferentialprocedure

Page 85: Rimini discussion

The 21st Bayesian Century

A Defense of the Bayesian Choice

10. Computing procedures as a minimization problem

The discussion of computational issues should not be allowed to obscurethe need for further analysis of inferential questions

Bernardo, BA, 2009

Bayesian procedures are easier to compute than procedures ofalternative theories, in the sense that there exists a universalmethod for the computation of Bayes estimators

Convergence assessment is an issue, but recent developmentsin adaptive MCMC allow for more confidence in the output