bayes and robust bayes scott ferson, [email protected] 9 october 2007, stony brook university, mar...

Bayes androbust Bayes

Scott Ferson, [email protected] October 2007, Stony Brook University, MAR 550, Challenger 165

Outline• Introduction• Bayes’ rule• Distributions

– Getting priors and specifying likelihoods– Conjugate pairs

• Model uncertainty• Updating• Multinomial sampling• Advantages and disadvantages• Robust Bayes

– Bounds on cumulatives, density-ratio bounds• Imprecise Dirichlet model (IDM)

– Walley’s marbles, Laplace’s sunrises, (fault trees)• Conclusions and synopsis• Exercises

Bayesians are like snowflakes

• Huge diversity among practitioners• Most, but not all, are subjectivists• Most, but not all, regard all statistical

problems as part of decision theory• Most, but not all, believe that precise

probabilities should and can be elicited• Most but, amazingly, not all employ Bayes’

rule to obtain results

Bayesian domains and concerns

• Statistics• Decision analysis• Risk analysis• Science itself

• Updating and aggregation, rarely convolutions• Disparate lines of evidence, data, judgments• Little or no data

Derivation of Bayes’ rule

P(A & B) = P(A|B) P(B) = P(B|A) P(A)

P(A|B) = P(A) P(B|A) / P(B)

A

B

A&B

Disease example

Prevalence in the general population is 0.01%Tests of diseased people are positive 99.9% of the timeTest of healthy people are negative 99.99% of the time

Given that you’ve tested positive, what’s the chance that you actually have the disease?

Almost all doctors say 99% or greater, but the true answer is only about 50%.

Apply Bayes’ rule

P(disease) P(positive | disease)P(disease | positive) = —————————————— P(positive)

where the normalization factor P(positive) is computed as the sum

P(disease) P(positive | disease) + (1 P(disease)) (1 P(negative | healthy))

prevalence = 0.01% = P(disease) sensitivity = 99.9% = P(positive | disease) specificity = 99.99% = P(negative | healthy)

prevalencesensitivity/(prevalencesensitivity+(1prevalence)(1specificity))= 0.01% 99.9% / (0.01% 99.9% + (1 0.01%) (1 99.99%))= 0.4998

To see why this is so…

• Imagine 10,000 people without any risk factors

• The prevalence of 0.01% suggests about one person from this group has the disease

• The sensitivity of the test (99.9%) says that this person will test positive almost surely

• The specificity of 99.99% suggests that, of the 9,999 people who do not have the disease, another will also test positive

• Thus, we’d expect two people to test positive for the disease, but only one of them actually has it

Paternity probability from blood tests

• Mother has type O blood; girl has type B blood (she got the B allele from her father)• An alleged father has type AB so could have been the donor of the B allele

P(B | F) = probability the girl would inherit the B allele if the man were her fatherP(B | not F) = probability she could inherit the B allele from another manP(F) = prior probability the man is the father of the girlP(F | B) = probability the man is her father given that they share the B allele

• Genetics says P(B | F) = ½• Background frequency of the B allele is 0.09, so P(B | not F) = 0.09• Prior unknown, so set P(F) = ½

P(F) P(B | F) ½ ½P(F | B) = —————————————— = —————————— = 0.847 P(F)P(B|F)+(1P(F))P(B|not F) ½ ½ + (1 ½) 0.09

So is he the babydaddy?

Waiting for a bus

• You got to the bus stop early, where the bus is now 10 minutes late. It might not come at all, and the next one is in two hours. You might need to walk.

• 90% of buses do their rounds, and 80% of those that do are less than 10 min late • 10% chance that bus won’t show up at all • Given that it’s already 10 minutes late, what’s the probability the bus will come?

Event B: bus comes Event W: you’re walkingPrior probabilities P(B) = 90% P(W) = 10%Likelihoods given it’s late P(10 min | B) = 20% P(10 min | W) = 100%

P(B|10 min) = P(B) P(10 min|B) / (P(B) P(10 min|B) + P(W) P(10 min|W))

= 90% 20% / (90% 20% + 10% 100%) = 18 / 28 0.643A decision (which is where Bayesians really want to be) about whether to continue to wait or

start walking would consider the costs of missing your appointment and how tired you are.

P(X | Y) versus P(Y | X)

• Not interchangeable– No matter how convenient that’d be

• Error makers– Stupid people– Smart people (Laplace, de Morgan)

• Bayes’ rule converts one to the other

invented Bayesian inference wrote the first treatise on probability in English

Bayes’ rule on distributions

-5 0 5 10 15 20-5 0 5 10 15 20-5 0 5 10 15 20

likelihood

posterior (normalized)

prior

posterior prior likelihood

ExampleExample .

• One observation x = 13 drawn from normal(, 2)

• Prior for is normal(6,1):

• Likelihood of normal(,2):

• Multiply:

• Posterior exp((32+50)/4) = normal(8⅓, ⅔ )

2

6θexp

π2

1)θ(P

2

4

θexp

π4

1)θ|(P

2xx

4

θ)12(2θ3exp)θ|(P)θ(P)|θ(P

2 xxx

2

2

2 σ2

μexp

πσ2

1)(P

yy

Density of normal(, ) is

Answer

0

1

0 5 10 15 20

0

0.5D

ensi

ty

Cum

ulat

ive

Interpretation of the posterior

• Estimates value of θ (a distribution mean)– Shrinkage (pull towards the prior’s mean)

• Uncertainty expressed as distributions

• Peaky density (steeper CDF) means surer

• Posterior is more certain than the prior– Even though the datum is surprising to the prior

Credible intervals

0

1

0 5 10 15 20

0

0.5

Den

sity

Cum

ulat

ive

90% credible interval(Bayesian confidence interval)

Bayesian confidence intervals

• Express probability the value is in an interval– No complex Neyman-Pearson interpretation

• No guarantee about performance– Performance “90% of the time…” doesn’t apply– Don’t call them “confidence intervals”

• Asymmetric intervals– Narrowest (mode)– Symmetric (median) – Mean-centered

Prosecutor’s fallacy

Where does the likelihood come from?

• Mechanism that generated the data

• ‘Inverse’ probability– Density: probability of x as a function of x, given θ– Likelihood: probability of x as a function of θ, given x

• Not a distribution (doesn’t integrate to one)– Equivalence class of functions, can scale however we want

• Appropriateness of a probability model that justifies a particular likelihood function is always at best a matter of opinion (Edwards 1972, page 51)

“Libby is Billy spelled sideways, sorta”

Where do priors come from?

• Whatever You say• Maximum entropy• EDFs or empirical Bayes (Newman 2000)

– Data used to estimate prior and also likelihood

• Uniformative or reference priors– No prior can represent ignorance (Bernardo 1979)

– Just a way to minimize the prior’s influence to discover what the data themselves are saying about the posterior

• Conjugate pairs

Conjugate pairingsLikelihood Prior Posteriorbernoulli() uniform(0, 1) beta(1+xi,1+nxi)

bernoulli() beta(, ) beta(+xi, +nxi)

poisson() gamma(, ) gamma(+xi,+n)

normal(, s) normal(, ) normal((/2 + xi /s2)/v, 1/v) normal((s2+2xi)/(s2+n2),1/(1/2+n/s2))

exponential() gamma(, ) gamma(+n,+xi)

binomial(k, ) beta(, ) betauniform(0, ) pareto(b, c) paretonegbinomial beta(, ) betanormal(, s) gamma(, ) gammaexponential() invgamma(, ) inversegammamultinomial(k,j) dirichlet(s, tj) dirichlet(n+s,(xj+stj)/(n+s))

n=xj, j{1,...,k}

v=1/2+n/s2Likelihood parameters other than θ are assumed to be known

Model uncertainty

Bayesian model averaging

• Similar to the probabilistic mixture

• Updates prior probabilities to get weights

• Takes account of available data

(Draper 1995)


• Assume it’s actually the first model• Compute probability distribution under that model• Read off probability density of observed data

– Product if multiple data; it’s the likelihood for that model

• Repeat above steps for each model• Compute posterior prior likelihood

– This gives the Bayes’ factors

• Use the posteriors as weights for the mixture

ExampleEither

f(A,B) = fPlus(A,B) = A + Bor

f(A,B) = fTimes(A,B) = A B

whereA ~ normal(0,1), B ~ normal(5, 1)

Single datum: f(A,B) = 7.59

Priors: P(fPlus) = 0.6, P(fTimes) = 0.4

Likelihoods

fPlus(A,B) ~ A+B ~ normal(5, 2) LPlus(7.59) = 0.05273

fTimes(A,B) ~ A B ~ normal(0, 26) LTimes(7.59) = 0.02584

fTimes

fPlus

0

0.1

0.2

0.3

-10 0 10 20f (A ,B )

Pro

bab

ility

den

sity 7.59

R: dnorm(7.59,5,sqrt(2))Excel: =NORMDIST(7.59, 5, SQRT(2), FALSE)

Bayes’ factors

Posterior probabilities for the two models posterior prior likelihood

Plus0.60.05273/(0.60.05273+0.40.02584)=0.7538

Times0.40.02584/(0.60.05273+0.40.02584)=0.2462

These are the weights for the mixture distribution

normalization factor

0

0.1

0.2

0.3

-20 -15 -10 -5 0 5 10 15 20 25

f (A ,B )

Pro

babi

lity

dens

ity

0

0.25

0.5

0.75

1

-20 -15 -10 -5 0 5 10 15 20 25

f (A ,B )

Cum

ulat

ive

prob

abili

ty

fTimes

fPlus

fTimes

fPlus


• Produces single distribution as answer• Can account for differential prior credibility• Takes account of available data

• Must enumerate all possible models• May be computationally challenging• Confounds variability and incertitude• Averages together incompatible theories• Underestimates tail risks

Updating from relational constraints

Updating with only constraintsSuppose W H = A

W [23, 33]H [112, 150]A [2000, 3200]

Any triples W H A get excluded with a likelihood of zeroMass concentrated onto manifold of feasible combinations

Likelihood

Prior

Posterior

)(),,|( HWAAHWHWAL

20003200

])3200,2000[(

112150

])150,112[(

2333

])33,23[(),,Pr(

AIHIWI

AHW

),,Pr()()|,,( AHWHWAHWAAHWf () Dirac delta function I() indicator function

20 30 400

1

120 140 1600

1

1500 2500 35000

1

A

W

H

CD

F

Interval constraint analysis

Original interval ranges

Multinomial sampling


• Bag of marbles colored red, blue, or green

• Bag is shaken and marbles are randomly drawn

• Its color is recorded and it’s put back in the bag

• What’s the chance the next marble is red?– If we knew how many marbles of each color are in

the bag, we could compute the probability– If we haven’t seen inside the bag, what can we say

given the colors of marbles drawn so far?


• Consider N random observations (sampled with replacement)

• As each marble is drawn, we record its color j = {red, blue, green}

• This is just the standard multinomial with P(j) = j, where j=1,2,3, 0 j, and j = 1

Bayesian analysis

• Given nj = number of marbles colored j, where nj = N, the likelihood function is

• A Dirichlet prior is conjugate

where 0 < s, 0 < tj < 1, tj = 1 (tj is mean for j)

k

j

nj

j

1

θ

k

j

stj

j

1

1θ Dirichlet(s, tj)

Solution

• Bayes-Laplace prior is uniform Dirichlet(s=3, tj=1/3)

• Posterior for this prior is Dirichlet(N+3, (nj+1)/(N+3))

• The vector of means for a Dirichlet(s, tj) is tj, so the posterior expected value of j is (nj+1)/(N+3)

• Suppose we had 7 draws: blue, green, blue, blue, red, green, red

• Predictive probability the next marble is red is (2+1)/(7+3) = 0.3

• Compare to the observed frequency of red marbles 2/7 0.2857

• Before any data, N=0, the probability is (0+1)/(0+3) = 0.333

k

j

stnj

jj

1

1θ

k

j

stnj

jj

1

1θ Dirichlet(N+s, (nj+stj)/(N+s))

shrinkage

Beyond colored marbles

• Multinomial sampling occurs in many risk-analytic and statistical problems with random trials with finitely many possible outcomes

• Analysis, and solutions, for all such problems are just the same as for the marbles

iid

Advantages of Bayesian statistics

• Ascendant– Fashionable and iconoclastic– Answers easier since computer revolution

• Decision making – Integrated into Bayesian framework– Frequentists don’t even balance costs of Type II errors

• Rationality– Maximizes expected utility, avoids sure loss (Dutch book)– Rational agents seeing same info eventually agree (Dawid)

• Subjective information– Legitimizes and takes account of prior beliefs

• Working without any empirical data• Naturalness• Data mining

(Berry 1996; 1997)

Naturalness

• Any question (OJ’s guilt, God)• Intuitive credibility intervals• Distributions represent uncertainty• Can use all information, even if

it’s from outside the experiment• Likelihood calculated only from

the data actually seen• Posterior depends on data only

through the likelihood • Free to stop whenever so can

enjoy serendipity, windfalls• Can use scientific stopping rule

• Can only study repeatable events• Torturous confidence intervals• No parameter distributions (fixed)• Can only take account of what’s

observed during the experiment• Consider probabilities of

hypothetical data never observed• Probability measures depend on the

experimental design• Must follow experimental design to

compute p• Can’t use scientific stopping rule

Bayesian Frequentist

Data mining

• Perfectly okay• Doesn’t affect results• Scientists free to peek at

data before and during• Can have many more

parameters than samples• Can estimate lots of

parameters at once

• Scientifically improper (can’t compute p)

• Data contaminates the scientist’s mind

• Bad to create a model with many parameters

• Bad to estimate more that a few parameters at a time

Bayesian Frequentist

Problems and technical issues• Controversial

• Needs priors, sampling model/likelihood

• Computational difficulty: It can be very hard to derive answers analytically

• Zero preservation: If either the likelihood or the prior is zero for a given , then the posterior is zero

• Subjectivity required: Beliefs needed for priors may be inconsistent with public policy/decision making

• Inadequate model of ignorance: Doesn’t distinguish between incertitude and equiprobability

Impo

rtan

t to

neop

hyte

sR

eall

y im

port

ant

Fear of the prior

• Many Bayesians are reluctant to employ subjective priors in scientific work (Yang and Berger 1998)

• Hamming (1991, page 298) wrote bluntly “If the prior distribution, at which I am frankly guessing, has little or no effect on the result, then why bother; and if it has a large effect, then since I do not know what I am doing how would I dare act on the conclusions drawn?” [emphasis in the original]

Everything’s in the prior

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1

Prior probability of paternity

Pos

teri

or p

rob

abili

ty

The strength of the evidence is reflected in

the function’s curvature

Groups

• Bayesian theory does not work for groups– Rationality inconsistent with democratic process

• Some even say “groups can’t make decisions”– Despite the fact they come to verdicts,

conclusions, findings, choices, …

• Must imagine a personalist “You” – Teams, agencies, collaborators, companies, clients– Reviewers, peers

Geo. Hazelrigg, NSF

Is coherence enough?

• Bayesians say their reasoning is rational and their conclusions sound if their beliefs and inferences are coherent (consistent to avoid sure loss)

• Standard Bayesian inferences are coherent, but coherence is only a minimal requirement of rationality (Walley 1991, 396f)

• Beliefs ought to also conform to evidence, which Bayesian inferences rarely do (ibid.)

• Abandoning objectivity forsakes the essential connection to evidence and accountability in the real world

• Evidence comes from measurement and objectivity

Robust Bayes

Robustness

• An answer is robust if it does not depend sensitively on the assumptions and calculation inputs on which it’s based

• Related to important ideas in other areas of statistics such as robust and resistance estimators

• Robust Bayes analysis, also called Bayesian sensitivity analysis, investigates the robustness of answers

Bayes’ rule made safe for uncertainty

prevalence sensitivity————————————————————————— sensitivity prevalence + (1 specificity) (1 prevalence)

1 / (1 + ((1/prevalence 1) (1 specificity)) / sensitivity)

ysensitivit

yspecificit11prevalence

1

1

1

sensitivity = 99.9% specificity = 99.99%prevalence = [0.005, 0.015]%

[0.005, 0.015]% 99.9% —————————————————————————— ([0.005,0.015]% 99.9% + (1[0.005,0.015]%) (199.99%))

= [19.99, 99.95]%

1 / (1 + ((1/[0.005, 0.015]% 1) (1 99.99%)) / 99.9%)

= [33.31, 59.98]%

Uncertainty about the prior

class of prior distribution class of posteriors

-5 0 5 10 15 20-5 0 5 10 15 20-5 0 5 10 15 20

likelihood

posteriors

priors

Uncertainty about the likelihood

-5 0 5 10 15 20-5 0 5 10 15 20-5 0 5 10 15 20

likelihoods

posteriors

prior

class of likelihood functions class of posteriors

-5 0 5 10 15 20-5 0 5 10 15 20

Uncertainty about both

-5 0 5 10 15 20

likelihoods

posteriors

priors

Moments of the normals

Prior: normal(μ,σ) Likelihood: normal(x,v)

μ = [1, 6] x = [5,13]σ = [2.5, 6] s = [4, 5]

Posterior: normal(θ1,v1)

mean = (μ/σ2 + x/s2)/(1/σ2 + 1/s2) = [2.28, 10.35]

stdev = 1/(1/σ2 + 1/s2) = [1.24, 1.66]

Conjugacy facilitates the calculation, but note that σ2 and s2

are repeated so be careful evaluating the mean

Robust Bayes can make a p-box

class of priors, class of likelihoods class of posteriors-5 0 5 10 15 20

PosteriorsPosterior p-box

LikelihoodsPriors

Uncertainty about decisions

class of probability models class of decisionsclass of utility functions class of decisions

If you end up with a single decision, you’re in like flint.

If the class of decisions is large and diverse, then you should be somewhat tentative about making any decision.

Bayesian dogma of ideal precision

• Robust Bayes is plainly inconsistent with the Bayesian idea that uncertainty should be measured by a single additive probability measure and values should always be measured by a precise utility function.

• Some Bayesians justify it as a convenience

• Others suggest it accounts for uncertainty beyond probability theory

Desirable properties of the class

• Easy to understand and elicit

• Easy to compute with

• Sufficiently big to reflect one’s uncertainty

• Generalization to multiple dimensions

(Berger 1994)

• Near ignorance (vacuous, but not in all ways)

Defining classes of priors

• Parametric conjugate families

• Parametric but non-conjugate families

• Density-ratio (bounded density distributions)-contamination, mixture, quantile classes, etc.

• Bounds on cumulatives (trivializes results)

Parametric class

Parameters can be chosen to make the class wide, but it will often still be too sparse in the sense that the distributions it admits are too similar to reflect what we think might be possible.

1

0

CD

F

Bounded cumulative distributions

• Shouldering: Bounds on CDFs (p-boxes) admit density distributions that may be zero anywhere.

• Renormalization: Posteriors have unit area.• Zero preservation: Posterior’s range is inside the

intersection of the ranges of the prior and likelihood.• Triviality: Together, these imply that all you can

ever conclude about the posterior is its range

Shouldering and zero preservationC

umul

ativ

e pr

obab

ilit

y

Thus posterior is zero everywhere

Prior

Likelihood

A shoulder on a CDF corresponds to zero density.Because the posterior can be zero almost everywhere, all the mass could be pushed to any point.

Result is trivial

0

1 posterior

likelihood

prior

CD

F

Bounded densities

Area below prior’s lower bound is smaller than 1; area below upper bound is larger than 1. There are no restrictions on the likelihood bounds.

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30 35

Pro

bab

ilit

y d

ensi

ty

Likelihoods

Priors

Likelihoods

Priors

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30 35

Pro

babi

lity

den

sity

Likelihoods

Priors

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30 35

Pro

babi

lity

den

sity

Likelihoods

Priors

Un-normalized posterior bounds

How can we normalize?

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30 35

Pro

babi

lity

den

sity

0

0.1

0.2

0.3

0.4

0.5

0.6

0 5 10 15 20 25 30 35

Pro

babi

lity

den

sity

Normalized posterior bounds

Normalization trick: divide the lower bound by the area of the upper

and vice versa

Bounds on posterior densities

• Imply bounds on cumulative tail risks

• Imply bounds on moments

• Imply bounds on the mode

bounds on mode

Take-home lessons

• Many ways depending on uncertainty

• Parametric uncertainty can be too sparse

• Yet too much generality trivializes the result

• Often not so easy to do the calculations

Probability of an event

• Imagine a gamble that pays one dollar if an event occurs (but nothing otherwise)– How much would you pay to buy this gamble?– How much would you be willing to sell it for?

• Probability theory requires the same price for both– By asserting the probability of the event, you agree to

buy any such gamble offered for this amount or less, and to sell the same gamble for any amount less than or equal to this ‘fair’ price…and to do so for every event!

• IP just says, sometimes, your highest buying price might be smaller than your lowest selling price

Imprecise Dirichlet model

Walley’s (1996) bag of marbles

• What’s the probability of drawing a red marble if you’ve never seen inside the bag?

• First, consider a model with N random (iid) observations (sampled with replacement). As each of the N marbles is drawn, we can see its color j = {1, …, k}

• This is just the standard multinomial with P(j) = j, where j=1,…,k, 0 j, and j = 1

Bayesian analysis

• Given nj = number of marbles that were colored j, nj = N, the likelihood function is proportional to

• A Dirichlet prior is convenient (it’s a multivariate generalization of a beta), with density proportional to

where 0 < s, 0 < tj < 1, tj = 1 (tj is mean for j)

k

j

nj

j

1

θ

k

j

stj

j

1

1θ Dirichlet(s, tj)

The answer seems easy

• Posterior is also Dirichlet, in density proportional to

• But what if you don’t even know how many colors k there are in the bag? – Letting = {red, not red} gives a different answer from

letting = {blue, green, yellow, red} – Lots of ways to form , lots of different answers

• Peeking (using the data to form ) is incoherent

k

j

stnj

jj

1

1θ

k

j

stnj

jj

1

1θ Dirichlet(N+s, (nj+stj)/(N+s))

Walley’s “Imprecise Dirichlet model”

• Likelihood is multinomial, as before

• Prior is the class of all Dirichlet distributions M0

s = {F : F = Dirichlet(s, tj)} for a fixed s > 0 that does not depend on

• The corresponding posterior class is

MNs = {F : F = Dirichlet(N+s, (nj+stj)/(N+s))}

Predictive probabilities

• Under this posterior, the predictive probability that the next observation will be color j is the posterior mean of j which is (nj + stj)/(N + s)

• By extremizing this quantity on tj, we get posterior upper and lower probabilities

P = nj / (N + s) (tj 0)

P = (nj + s) / (N + s) (tj 1)

• Easily generalizes to other events

_

What if there’s no or little data?

• Before any observations, P = [P, P] = [0,1] .

• This is appropriately vacuous (duh!)

• The width of the interval decreases as sample size N increases, but doesn’t depend on k or the sample space at all

• In the limit N , the Walley, Bayesian and frequentist answers all converge to nj /N

What should s be?

• Determines the prior’s influence on posteriorIf s 0, interval point (overconfident)

If s , interval width 1 (ineducable)

Using s = 1 or 2 might be reasonable

• Letting it depend on N would be incoherent

• Letting it vary would be incoherent

• Different values of s give consistent inferences that are nested (unlike Bayesian inconsistency)

Bayesian inconsistency

• Different priors• Bayes (nj + m) / (N + k)

• Jeffreys (1946, 1961) (nj + ½ m) / (N + ½ k)

• Perks (1947) (nj + m/k) / (N + 1)

• Haldane (1948) nj / N

• Different models for IDM

• Bayes [0, 1] s = • Jeffreys (1946, 1961) [0, 1] s =

• Perks (1947) [nj / (N + 1), (nj + 1) / (N + 1)] s = 1

• Haldane (1948) nj / N s = 0

s 2

What if the event has never happened?

• If nj = 0, then P = 0 andP = s/(N+s)– No reason to bet on such an event at any odds– But should be willing to bet against it at shorter

and shorter odds as N increases

• Contrast this to Laplace’s famous calculation of the probability the sun wouldn’t rise the next day via his “rule of succession” (nj+1)/(N+2) = 1/1826215

Conclusions and synopsis

Bayesian analysisHow?

decide on your “prior” belief, P(model)

compute likelihood function, P(data | model), as a function of the model

multiply to get posterior belief, P(model | data)

today’s posterior is tomorrow’s prior

Why?

“rational”

acknowledges inescapability of subjectivity

updating rule supports the coherent accumulation of knowledge

Why not?

tolerates (requires) subjectivity

Bayesian rationality does not extend to group decisions

inadequate model of ignorance (confounds with equiprobability)

Robust Bayes (sensitivity analysis)

How?use a class of distributions to represent uncertainty about the prior use a class of functions to represent uncertainty about the likelihood get class of posteriors by applying Bayes’ rule to every possible combination

Why?accounts for the analyst’s doubts about required inputscan be cheaper and easier than insisting on precise inputsconsistent with robust statistics (doesn’t require analyst’s omniscience)expresses the reasonableness of the posterior

Why not?can be computationally difficultdoes not obey Bayesian dogma of ideal precisionstill has zero preservation problem

References• Bayes, T. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical

Transactions of the Royal Society 53: 370-418. Reprinted in 1958 in Biometrika 45: 293-315.• Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York.• Bernardo, J.M. (1979). Reference posterior distributions for Bayesian inference. Journal of the Royal

Statistical Society, Series B 41: 113-147 (with discussion).• Berry, D.A. (1996). Statistics: A Bayesian Perspective. Duxbury Press, Belmont, California.• Berry, D.A. (1997). Bayesian Statistics. Institute of Statistics and Decisions Sciences, Duke University,

http://www.pnl.gov/Bayesian/Berry/.• Dawid, A.P. (1982). Intersubjective statistical models, In Exchangeability in Probability and Statistics,

edited by G. Koch and F. Spizzichino, North-Holland, Amsterdam.• Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical

Society Series B 57: 4597.• Edwards, A.W.F. (1972). Likelihood. Cambridge University Press.• Insua, D.R. and F. Ruggeri (eds.) (2000). Robust Bayesian Analysis. Lecture Notes in Statistics, Volume

152. Springer-Verlag, New York.• Laplace, Marquis de, P.S. (1820). Théorie analytique de probabilités (edition troisième). Courcier, Paris.

The introduction (Essai philosophique sur les probabilités) is available in an English translation in A Philosophical Essay on Probabilities (1951), Dover Publications, New York.

• Lavine, M. (1991). Sensitivity in Bayesian statistics, the prior and the likelihood. Journal of the American Statistical Association 86 (414): 396-399.

• Pericchi, L.R. (2000). Sets of prior probabilities and Bayesian robustness. Imprecise Probability Project. http://ippserv.rug.ac.be/documentation/robust/robust.html. http://www.sipta.org/documentation/robust/pericchi.pdf.

• Walley, P. 1991. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London.• Yang, R. and J.O. Berger (1998). A catalog of noninformative priors. Parexel International and Duke

University, http://www.stat.missouri.edu/~bayes/catalog.ps

Exercises1. Check the math in the example on slide 13.2. Sketch [P,P] for {-,B,G,B,B,G,R,G,…,(35/100)R,…(341/1000)R}.3. What can be said about the posterior if the prior is

uniform(a,b) and the likelihood function is an a non-zero constant for values of between c and d and zero elsewhere? Sketch the answer for a = 2, b = 14, c = 5, d = 21. What is the posterior if a [1,3], b [11,17], c [4,6], d [20,22]?

4. How could you do Bayesian backcalculation?5. What is the statistical ensemble corresponding to a

prior probability distribution?

Missing failure data

Failure data is often incomplete

• We know how many times C failed, but don’t whether it was A or B or both that made it fail

• We know from component testing how often A failed, but not whether B would have too

or

A B

C

• Traditional Bayesian approach (Vesely, et al.)

– Makes assumptions about missing data – Requires independence

• IDM can relax these assumptions– Computes chance of C, A or B failing at next test

– Uses only failure data nC, NC, nA, NA, nB, NB

– And hardly any assumptions

bayes and robust bayes scott ferson, [email protected] 9 october 2007, stony brook university, mar...

Documents