chapter 3 prior elicitation - newcastle universitynlf8/teaching/mas2317/notes/chapter3.pdf ·...

Chapter 3

Prior elicitation

3.1 Introduction

In this chapter we will think about how to construct a suitable prior distribution π(θ)for our parameter of interest θ. For example, why did we use a Be(77, 5) distributionfor θ in the music expert example (page 31)? Why did we use a Be(2.5, 12) distributionfor θ in the example about the video game pirate (page 34)? And why did we assumea Ga(10, 4000) distribution for θ in the earthquake example (page 37)? Prior elicitation– the process by which we attempt to construct the most suitable prior distribution forθ – is a huge area of research in Bayesian Statistics; the aim in this course is to give abrief (and relatively simple) overview.

We will consider the case of substantial prior knowledge, where expert opinion, for ex-ample, gives us good reason to believe that some values in a permissable range for θ aremore likely to occur than others. In particular, we will re–visit the examples about themusic expert, the video game pirate and earthquakes in Chapter 2.

Of course, sometimes it might be very difficult to properly elicit a prior distributionfor θ. For example, there may be no expert available to help guide your choice ofdistribution. In this case, the chosen prior distribution π(θ) might be one which keepsthe mathematics simple when operating Bayes’ Theorem, whilst also assuming a large or“infinite” variance for θ (vague prior knowledge). Alternatively, a prior which assumesall values of θ are equally likely could be used to represent complete prior ignornace, asin the example about the possibly biased coin on page 28.

In Chapter 4 we will consider the construction of priors for θ under certain parameterconstraints; we also consider the construction, and use, of mixture prior distributions inmore complex scenarios.

51

52 CHAPTER 3. PRIOR ELICITATION

3.2 Substantial prior knowledge

In this section, we will consider the situation where we have substantial prior knowl-edge. In Examples 3.3 and 3.4 we will make use of some online software developedby researching Bayesian statisticians at Sheffield University: Professor Tony O’Haganand Dr. Jeremy Oakley. We will use this software to implement the trial roulette andbisection methods of prior elicitation. This software is very simple to use and will bedemonstrated in lectures; you will also be expected to use this in our next computerpractical class.

Example 3.1 (Using suggested prior summaries)

Let us return to Example 2.4 of Chapter 2. Recall that we were given some data on the“waiting times”, in days, between 21 earthquakes, and we discussed why an exponenetialdistribution Exp(θ) might be appropriate to model the waiting times. Further, we weretold that an expert on earthquakes has prior beliefs about the rate θ, described by aGa(10, 4000) distribution; a plot of this prior is shown in Figure 2.8. Where did thisprior distribution come from?

Suppose the expert tells us that earthquakes in the region we are interested in usuallyoccur less than once per year; in fact, they occur on average once every 400 days. Thisgives us a rate of occurrence of about 1/400 = 0.0025 per day (to match the “daily”units given above). Further, he is fairly certain about this and specifies a very smallvariance of 6.25× 10−7.

A Ga(a, b) distribution seems sensible, since we can’t observe a negative daily earth-quake rate and the Gamma distribution is specified over positive values only. Using theinformation provided by the expert, verify our use of a = 10 and b = 4000.

✎...Solution to Example 3.1...

3.2. SUBSTANTIAL PRIOR KNOWLEDGE 53

Example 3.2 (Using suggested prior summaries)

Now let us return to Example 2.2 of Chapter 2. We considered an experiment to de-termine how good a music expert is at distinguishing between pages from Haydn andMozart scores; when presented with a score from each composer, the expert makes thecorrect choice with probability θ.

Before conducting the experiment, we were told that the expert is very competent; infact, we were told that θ should have a prior distribution peaking at around 0.95 and forwhich Pr(θ < 0.8) is very small. To achieve this, we assumed that θ ∼ Be(77, 5), withdensity given in Figure 2.4. How did we know a beta distribution would be appropriate?And how did we figure out the parameters of this distribution, i.e. a = 77 and b = 5?

By now, you should understand why we might work with a beta distribution: in thisexample, θ is a probability and so must lie in the interval [0, 1], and a beta distributionis defined over this range. But how did we know that a = 77 and b = 5 would give thedesired properties for θ?

We are told that the mode of the distribution should be around 0.95; using the formulaeon page 23, we can thus write

a− 1

a+ b− 2= 0.95 (3.1)

⇒ a− 1 = 0.95(a+ b− 2)

⇒ a− 0.95a = 0.95b− 1.9 + 1

⇒ 0.05a = 0.95b− 0.9

⇒ a = 19b− 18. (3.2)

We are also told that Pr(θ < 0.8) must be small. In fact, suppose we are told thatθ < 0.8 might occur with probability 0.0001. This means that if we integrate theprobability density function for our beta distribution between 0 and 0.8, we would get0.0001; from Equation (2.1) on page 23, we can write this as

∫ 0.8

0

θa−1(1− θ)b−1

B(a, b)dθ = 0.0001, i.e.

∫ 0.8

0

θ(19b−18)−1(1− θ)b−1

B(19b− 18, b)dθ = 0.0001, (3.3)

i.e. we set the cumulative distribution function for a Be(19b − 18, b), evaluated at 0.8,equal to 0.0001 and solve for b. Although this would be rather tricky to do by hand,we can do this quite easily in R. Recall that the R command dbeta(x,a,b) evaluatesthe density of the Be(a, b) distribution at the point x (see page 24); the commandpbeta(x,a,b) evaluates the corresponding cumulative distribution function at x. Firstof all, we re–write (3.3) to set it equal to zero:

∫ 0.8

0

θ(19b−18)−1(1− θ)b−1

B(19b− 18, b)dθ − 0.0001 = 0. (3.4)

We then write a function in R which computes the left-hand-side of Equation (3.4):


f=function(b){

+ answer=pbeta(0.8,((19*b)-18),b)-0.0001

+ return(answer)}

The trick now is to use a numerical procedure to find the root of answer in the R functionabove, i.e. find the value b which equates answer to zero (as is required in Equation(3.4). We can do this using the R function uniroot(f, lower=, upper=), which usesa numerical search algorithm to find the root of the expression provided by the functionf, having been given a lower bound and an upper bound to search within. We knowfrom the formulae on page 23 that a, b > 1 when using expression (3.1) for the mode,so we can search for a root over some specified domain > 1: for example, we might uselower=1 and upper=100, giving:

> uniroot(f,lower=1,upper=100)

$root

[1] 5.06513

$f.root

[1] 6.008134e-09

$iter

[1] 14

$estim.prec

[1] 6.103516e-05

Thus, the solution to Equation (3.3) is b = 5.06513. For simplicity, rounding down tob = 5 and then substituting into (3.2) gives

a = 19× 5− 18 = 77,

hence the use of θ ∼ Be(77, 5) in Example 2.2 in Chapter 2.

In the next two examples, we will use theMATCH Uncertainty Elcitation Tool, developedby Professor Tony O’Hagan and Dr. Jeremy Oakley at Sheffield, to demonstrate twotechniques of prior elicitation: the trial roulette method and the bisection method.


Example 3.3 (Trial roulette method)

We now return to Example 2.3 in Chapter 2. Recall that Max is a video game pirate, andhe is trying to identify the proportion θ of potential customers who might be interestedin buying Call of Duty: Elite next month. Why did we use θ ∼ Be(2.5, 12)?

For each month over the last two years Max knows the proportion of his customers whohave bought similar games; these proportions are shown below in Table 3.1.

0.32 0.25 0.28 0.15 0.33 0.12 0.14 0.18 0.12 0.05 0.25 0.080.07 0.16 0.24 0.38 0.18 0.15 0.22 0.05 0.01 0.19 0.08 0.15

Table 3.1: Proportion of Max’s customers, each month, who have bought computergames in the same genre as Call of Duty: Elite.

We will use these data to form a prior for θ using the trial roulette method. Here, wedivide the sample space for θ into m bins, and we distribute n chips amongst the bins,with the proportion of chips allocated to a particular bin representing the probability ofθ lying in that bin. If this is done graphically, we can see the shape of our distributionforming as we allocate the chips. Once the allocation is done, we can then fit a parametricdistribution to match the shape of the distribution of chips. Actually, we will do thisusing the MATCH Uncertainty Elicitation Tool. You can enter MATCH directly bytyping the following URL into any web browser:

http://optics.eee.nottingham.ac.uk/match/uncertainty.php

Doing so gives the screenshot shown in Figure 3.1 overleaf. Notice the default Lower

Limit and Upper Limit are set at 100 and 200, respectively; in this example, θ is aproportion, and so these need to be changed to 0 and 1 respectively. Doing so, andthen selecting the Roulette option, gives what can be seen in the screenshot in Figure3.2. The Number of Bins and Height of Grid can be changed, if necessary. We cannow place chips in the bins according to the prior information given in Table 3.1. Forexample, the first chip will go in between 0.3 and 0.4 as the first proportion is 0.32.Completing the grid gives the distribution of chips as shown in Figure 3.3.

Finally, selecting the Fitting and Feedback button gives the screenshit shown in Figure3.4. Here, you can see that MATCH automatically chooses the best–fitting distribution– in this case, a Scaled Beta distribution (in this example just the beta distributionthat you should be familiar with by now); the program also returns a picture of thefitted density function, showing the parameters of this distribution underneath. Here,MATCH suggests using a ≈ 2.5 and b ≈ 12 (given as α and β on the screen); hence, theuse of θ ∼ Be(2.5, 12) in Example 2.3 in Chapter 2.

Of course, instead of using past data to complete the roulette grid, we could ask anexpert to place chips in the bins according to her beliefs about likely values of θ; twiceas many chips in one bin compared to another would mean that the expert believes thatθ is twice as likely to take on values in that bin than in the other. Note the Feedback

Percentiles in Figure 3.4; we will discuss their use in the next example.


Figure 3.1: The homepage of the MATCH Uncertainty Elicitation Tool.

Figure 3.2: The Roulette option in MATCH.


Figure 3.3: Placing chips in bins using the Roulette option in MATCH.

Figure 3.4: Fitting and Feedback in MATCH.


Example 3.4 (Bisection method)

The bisection method is another method of prior elcitation. This method requires theexpert to be able to split the range of allowable values for θ into four smaller, equallylikely, chunks.

First of all, the range is bisected into two intervals of equal probability from which weelicit the expert’s median, m; we then ask the expert to further bisect each of theseintervals to elicit the lower and upper quartiles (l and u, respectively). From theseelicited quartiles, we can then obtain a parametric distribution for θ; for this we willagain make use of MATCH. Finally, a stage of feedback and refinement is consideredbefore we settle on a distribution for θ; this stage of feedback and refinement can (andshould!) be undertaken in all situations where we attempt to elicit a prior distributionfrom information given to use by an expert. We will now demonstrate the bisectionmethod through an example.

Over the past 15 years there has been considerable scientific interest in the rate of retreat,θ (feet per year), of glaciers in Greenland (as discussed in the recent Frozen Planet seriesshown on the BBC); indeed, this has often been used as an indicator of global warming.We are interested in elicitng a suitable prior distribution for θ for the Zachariae Isstrømglacier in Greenland.

Records from an expert glaciologist show that glaciers in Greenland have been retreatingat a rate of between 0 and 70 feet per year since 1995. We will use these values as thelower and upper limits for θ, respectively. We now attempt to elicit the median andquartiles for θ from the glaciologist.

1. Elicit the expert’s medianStatistician: “Can you give us a value m such that, for the Zachariae Isstrømglacier, we can expect the rate of retreat in 2012 to have an equal chance of lyingin [0,m]feet and [m, 70]feet?”

Glaciologist: “Hmmm... the Zachariae Isstrøm glacier lies in quite a northerlyregion of glacial activity in Greenland, one that has not been severely affected byglacial retreat. For this glacier this year, there is a good chance that the rate ofretreat will be lower than most in Greenland, perhaps around 24 feet.”

Statistician: “So for this glacier, it is just as likely that the rate of retreat will besomewhere in [0, 24] and [24, 70]?”

Glaciologist: “Yes, I think that sounds reasonable. The value which bisects thewhole range, in terms of how likely certain values of θ are to occur, should be closerto the lower bound than the upper.”

In MATCH, we enter the Lower Limit and Upper Limit as 0 and 70, respectively;after selecting the Quartilemethod of elicitation as the Input Mode, we then movethe Median slider to 24: this can be seen in the screenshot in Figure 3.5. Noticethat the green area of the median bar represents the lower 50% of the distributionfor θ and the blue area represents the upper 50% of the distribution for θ. Noticealso that the lower and upper quartiles are still at their default values – one quarterand three quarters of the range, respectively.


Figure 3.5: Entering the lower and upper limits, and the elicited median, in MATCH.

2. Elicit the expert’s lower quartileStatistician: “So, you think there is an evens chance that the Zachariae Isstrømglacier will retreat between [0, 24] feet and [24, 70] feet this year. Can you split thelower interval [0, 24] into two halves of equal probability also?”

Glaciologist: “Erm, not sure, that’s a bit more difficult...”

Statistician: “OK, you said you expect the glacier to retreat by about 24 feet thisyear. How certain are you of this value? Do you think it could be considerablylower than this?”

Glaciologist: “Well, obviously, I can’t be certain... but I doubt it would be muchlower than this, for this particular glacier...”

Statistician: “Do you think [0, 12] or [12, 24] is more likely?”

Glaciologist: “Definitely, a rate of retreat somewhere between 12 feet and 24 feetis much more likely than between 0 and 12 feet. There are areas of Greenlandfurther North where the glaciers have much slower rates of retreat... only the mostnorthely glaciers have zero retreat.”

Statistician: “OK. So [12, 24] is more likely than [0, 12]. Is there a value between12 and 24 that you’d be prepared to go down to for the rate of retreat for thisglacier?”

Glaciologist: “ Probably a bit more than half–way. Maybe 19 feet?”


3. Elicit the expert’s upper quartileStatistician: “Thank you. In a similar way, for this glacier, could you split theupper interval [24, 70] into two halves of equal probability?”

Glaciologist: “I’m sure that the rate of retreat for this glacier will be closer tomy specified value [24] than half way between 24 and 70... really, only the fastestretreating glaciers in more southerly regions have a rate of retreat more than about40 feet in one year... I think a value of about 30 feet would split this upper intervalquite nicely here.”

Statistician: “Thank you.”

We now update the MATCH screen with the suggested lower and upper quartiles(19 and 30, respectively), before clicking Fitting and Feedback. Figure 3.6 showsa screenshot of this.

Figure 3.6: Entering the lower and upper quartiles to obtain a fitted distribution for θin MATCH.


4. ReflectionStatistician: “So, would you consider the following four intervals equally likely?”

[0, 19], [19, 24], [24, 30], [30, 70]

Glaciologist: “This seems reasonable to me, I think...”

5. Fit a parametric distribution to these judgementsNotice that the screenshot from MATCH, shown in Figure 3.6, indicates that aGa(9, 0.36) distribution might be appropriate for θ. Notice from Figure 3.6 thatMATCH also gives Feedback percentiles – we will consider these in the feedbackand refinement stage.

6. Feedback and refinementWe now show the glaciologist our distribution for the rate of glacial retreat, i.e.θ ∼ Ga(9, 0.36).

Statistician: “We have obtained a probability distribution for the rate of retreatfor the Zachariae Isstrøm glacier. A plot of this distribution is shown in Figure3.6. Do you think this reasonably specifies your beliefs about the retreat of theZachariae Isstrøm glacier this year?”

Glaciologist: “I think this looks OK. The plot peaks close to my suggested value ofabout 24 feet; to allows for all values between 0 and 70 feet, but gives diminshingprobabilities as we move towards these extremes.”

The Statistician should now consider the tails: we can do this by setting theFeedback Percentiles in MATCH to 1% and 99% (see Figure 3.7).

Statistician: “What would have to happen for the rate of retreat at this glacier tobe as much as 48 feet this year?”

Glaciologist: “Something pretty spectacular for a glacier at this longitude! Al-though it’s not impossible, just really, really unlikely.”

Statistician: “Our probability distribution gives this event, or anything more ex-treme, a probability of 0.01 – i.e. once in a hundred years – does this seem smallenough?”

Glaciologist: “Yes, I suppose this is imaginable.”

Statistician: “At the other end of the scale, we give a rate of retreat of just 10 feetthis year the same small probability. Are you happy with this?”

Glaciologist: “Yes, that looks fine to me.”


Figure 3.7: Obtaining tail probabilities in MATCH : here, we ask for the 1st percentileand the 99th percentile.

Example 3.5

Let Y be the retreat, in feet, of the Zachariae Isstrøm glacier. A Pareto distributionwith rate θ is often used to model such geophysical activity, with probability densityfunction

f(y|κ, θ) = θκθy−(θ+1), θ, κ > 0 and y > κ.

(a) Obtain the likelihood function for θ given the parameter κ and some observed datay1, y2, . . . , yn (independent).

Solution✎...Solution to Example 3.5(a)...The likelihood function is simply the product of the probability density function evalautedat each observation yi, (i = 1, . . . , n), i.e.

L(θ|κ, y) = θκθy−(θ+1)1 × · · · × θκθy−(θ+1)

n

= θnκnθn∏

i=1

y−(θ+1)i . (3.5)


(b) Suppose we observe a retreat of 20 feet at the Zachariae Isstrøm glacier in 2012.Write down the likelihood function for θ.

Solution✎...Solution to Example 3.5(b)...We simply substitute n = 1 and y1 = 20 into Equation (3.5), giving

L(θ|κ, y1 = 20) = θκθ20−(θ+1).

(c) Using the elicted prior for the rate of retreat we obtained from the expert glaciol-ogist in Example 3.4, and assuming κ is known to be 12, obtain the posteriordistribution π(θ|y1 = 20).

Solution✎...Solution to Example 3.5(c)...Using Bayes’ Theorem, and following the examples in Chapter 2, we know that

π(θ|y1 = 20) ∝ π(θ)× L(θ|y1 = 20).

Recall from Example 3.4 that our elicited prior for θ is Ga(9, 0.36), which has density

π(θ) =0.369θ8e−0.36θ

Γ(9).

Combining this with the likelihood above (and using κ = 12) gives

π(θ|y1 = 20) =0.369θ8e−0.36θ

Γ(9)× θ12θ20−(θ+1)

∝ θ9e−0.36θ12θ20−(θ+1)

∝ θ9e−0.36θ12θ20−θ. (3.6)


Solution

✎...Solution to Example 3.5(c) continued...Now consider the term 12θ20−θ. Taking logs, we get

θln12− θln20 = (ln12− ln20)θ;

exponentiating to ‘re–balance’, you should see that

12θ20−θ = e(ln12−ln20)θ.

Substituting back into (3.6) gives

π(θ|y1 = 20) ∝ θ9e−0.36θe(ln12−ln20)θ i.e.

∝ θ9e−0.36θ+(ln12−ln20)θ

∝ θ9e−/0.87θ.

Referring to our definition of the gamma distribution on page 23 of these notes, you shouldrecognise this as a Ga(10, 0.87) distribution. Figure 3.8 compares the prior and posteriordensities.

0 20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

0.10

0.12

density

θ

Figure 3.8: Prior (dashed) and posterior (solid) densities for the rate of glacial retreatat the Zachariae Isstrøm glacier.


Definition 3.1

We have substantial prior information for θ when the prior distribution dominates theposterior distribution, that is π(θ|x) ∼ π(θ). An example of substantial prior knowledgewas given in Example 2.2 where a music expert was trying to distinguish between pagesfrom Mozart and Haydn scores. Figure 3.9 shows the prior and posterior distributionsfor θ, the probability that the expert makes the correct choice. Notice the similaritybetween the prior and posterior distributions. Observing the data has not altered ourbeliefs about θ very much.

0.80 0.85 0.90 0.95 1.00

05

1015

θ

dens

ity

Figure 3.9: Prior (dashed) and posterior (solid) densities for θ

When we have substantial prior information there can be some difficulties:

1. the intractability of the mathematics in deriving the posterior distribution —though with modern computing facilities this is less of a problem,

2. the practical formulation of the prior distribution — coherently specifying priorbeliefs in the form of a probability distribution is far from straightforward although,as we have seen, this can be attempted using computer software.


3.3 Limited prior knowledge

When prior information about θ is more limited, the pragmatic approach is to choosea distribution which makes the Bayes updating from prior to posterior mathematicallystraightforward, and use what prior information is available to determine the parametersof this distribution. Recall some previous examples:

1. Binomial random sample, Beta prior distribution −→ Beta posterior distribution

2. Normal random sample (known variance), Normal prior distribution −→ Normalposterior distribution

3. Exponential random sample, Gamma prior distribution −→ Gamma posterior dis-tribution

In all these examples, the prior distribution and the posterior distribution come fromthe same family. This leads us to the following definition.

Definition 3.2

Suppose that data x are to be observed with distribution f(x|θ). A family F of priordistributions for θ is said to be conjugate to f(x|θ) if for every prior distribution π(θ) ∈ F,the posterior distribution π(θ|x) is also in F.

Notice that the conjugate family depends crucially on the model chosen for the datax. For example, the only family conjugate to the model “random sample from anexponential distribution” is the Gamma family.

3.4 Vague Prior Knowledge/Prior Ignorance

If we have very little or no prior information about the model parameters θ, we muststill choose a prior distribution in order to operate Bayes Theorem. Obviously, it wouldbe sensible to choose a prior distribution which is not concentrated about any particularvalue, that is, one with a very large variance. In particular, most of the informationabout θ will be passed through to the posterior distribution via the data, and so we haveπ(θ|x) ∼ f(x|θ).An example of vague prior knowledge was given in Example 2.1 where a possibly biasedcoin was assessed. Figure 3.10 shows the prior and posterior distributions for θ =Pr(Head). Notice that the prior and posterior distributions look very different. In fact,in this example, the posterior distribution is simply a scaled version of the likelihoodfunction – likelihood functions are not usually proper probability (density) functionsand so scaling is required to ensure that it integrates to one. Most of our beliefs aboutθ have come from observing the data.

3.4. VAGUE PRIOR KNOWLEDGE/PRIOR IGNORANCE 67

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

θ

dens

ity

Figure 3.10: Prior (dashed) and posterior (solid) densities for θ

Vague Prior Knowledge

We represent vague prior knowledge by using a prior distribution which is conjugate tothe model for x and which has “infinite” variance.

Example 3.6

Suppose we have a random sample from a N(µ, 1/τ) distribution (with τ known). De-termine the posterior distribution assuming a vague prior for µ.

✎...Solution to Example 3.6...The conjugate prior distribution is a normal distribution. We have already seen that ifthe prior is µ ∼ N(b, 1/d) then the posterior distribution is µ|x ∼ N(B, 1/D) where

B =db+ nτx̄

d+ nτand D = d+ nτ.

If we now make our prior knowledge vague about µ by letting the prior variance tend toinfinity (d→ 0), we obtain

B → x̄ and D → nτ.

Therefore, assuming vague prior knowledge for µ results in a N(x̄, 1/(nτ)) posterior dis-tribution. Notice that the posterior mean is the sample mean (the likelihood mode) andthat the posterior variance 1/D → 0 as n→ ∞.


Example 3.7

Suppose we have a random sample from an exponential distribution, that is, Xi|θ ∼Exp(θ), i = 1, 2, . . . , n (independent). Determine the posterior distribution assuming avague prior for θ.

Solution✎...Solution to Example 3.7...The conjugate prior distribution is a Gamma distribution. Recall that a Ga(g, h) dis-tribution has mean m = g/h and variance v = g/h2. Rearranging these formulae weobtain

g =m2

vand h =

m

v.

Clearly g → 0 and h → 0 as v → ∞ (for fixed m). We have seen how taking a Ga(g, h)prior distribution results in a Ga(g + n, h+ nx̄) posterior distribution. Therefore, takinga vague prior distribution will give a Ga(n, nx̄) posterior distribution. Note that theposterior mean is 1/x̄ (the likelihood mode) and that the posterior variance 1/(nx̄2) → 0and n→ ∞.

Prior Ignorance

We could represent ignorance by the concept “all values of θ are equally likely”. If θwere discrete withm possible values then we could assign each value the same probability1/m. However, if θ is continuous, we need some limiting argument (from the discretecase). Suppose that θ can take values between a and b, where −∞ < a < b < ∞.Letting all (permitted) values of θ be equally likely results in taking a uniform U(a, b)distribution as our prior distribution for θ. However, if the parameter space is not finitethen we cannot do this: there is no such thing as a U(−∞,∞) distribution. Conventionsuggests that we should use the “improper” uniform prior distribution

π(θ) = constant.


This distribution is improper because∫∞

−∞π(θ) dθ is not a convergent integral, let alone

equal to one. We have a similar problem if θ takes positive values — we cannot use aU(0,∞) prior distribution. Now if θ ∈ (0,∞) then φ = log θ ∈ (−∞,∞), and so wecould use an “improper” uniform prior for φ: π(φ) = constant. In turn, this induces adistribution on θ. Recall the result from Distribution Theory:

Suppose that X is a random variable with probability density function fX(x). If g isa bijective (1–1) function then the random variable Y = g(X) has probability densityfunction

fY (y) = fX(

g−1(y))

∣

∣

∣

∣

d

dyg−1(y)

∣

∣

∣

∣

. (3.7)

Applying this result to θ = eφ gives

πθ(θ) = πφ(log θ)

∣

∣

∣

∣

d

dθlog θ

∣

∣

∣

∣

, θ > 0

= constant×∣

∣

∣

∣

1

θ

∣

∣

∣

∣

, θ > 0

∝ 1

θ, θ > 0.

This too is an improper distribution.

There is a drawback of using uniform or improper priors to represent prior ignorance:if we are “ignorant” about θ then we are also “ignorant” about any function of θ, forexample, about φ1 = θ3, φ2 = eθ, φ3 = 1/θ, . . . . Is it possible to choose a distributionwhere we are ignorant about all these functions of θ? If not, on which function of θ shouldwe place the uniform/improper prior distribution? After a little thought, it should beclear that there is no distribution which can represent ignorance for all functions of θ.The above example shows that assigning a uniform/ignorance prior to θ means that wedo not have a uniform/ignorance prior for eθ.

A solution to problems of this type was suggested by Sir Harold Jeffreys. His suggestionwas specified in terms of Fisher’s Information:

I(θ) = EX|θ

[

− ∂2

∂θ2log f(X|θ)

]

. (3.8)

He recommended that we represent prior ignorance by the prior distribution

π(θ) ∝√

I(θ). (3.9)

Such a prior distribution is known as a Jeffreys prior distribution.


Example 3.8

Suppose we have a random sample from a distribution with probability density function

f(x|θ) = 2x e−x2/θ

θ, x > 0, θ > 0.

Determine the Jeffreys prior for this model.

Solution

✎...Solution to Example 3.8...The likelihood function is

L(θ|x) = fX(x|θ) =n∏

i=1

2xi e−x2

i/θ

θ

=2n

θn

(

n∏

i=1

xi

)

exp

{

−1

θ

n∑

i=1

x2i

}

,


Solution✎...Solution to Example 3.8 continued...and therefore

logL(θ|x) = n log 2− n log θ +n∑

i=1

log xi −1

θ

n∑

i=1

x2i

⇒ ∂

∂θlogL(θ|x) = −n

θ+

1

θ2

n∑

i=1

x2i

⇒ ∂2

∂θ2logL(θ|x) = n

θ2− 2

θ3

n∑

i=1

x2i

⇒ I(θ) = EX|θ

[

− ∂2

∂θ2logL(θ|X)

]

= − n

θ2+

2

θ3EX|θ

[

n∑

i=1

X2i

]

= − n

θ2+

2n

θ3EX|θ

[

X2]

since the Xi are identically distributed. Now

EX|θ

[

X2]

=

∫ ∞

0

x2 ×(

2x e−x2/θ

θ

)

dx

= θ

∫ ∞

0

y e−y dy using the substitution y = x2/θ

= θ × 1

= θ.

Therefore

I(θ) = − n

θ2+

(

2n

θ3× θ

)

=n

θ2.

Hence, the Jeffreys prior for this model is

π(θ) ∝ I(θ)1/2

∝√n

θ, θ > 0

∝ 1

θ, θ > 0.

Notice that this distribution is improper since∫∞

0dθ/θ is a divergent integral, and so we

cannot find a constant which ensures that the density function integrates to one.


Example 3.9

Suppose we have a random sample from an exponential distribution, that is, Xi|θ ∼Exp(θ), i = 1, 2, . . . , n (independent). Determine the Jeffreys prior for this model.

Solution✎...Solution to Example 3.9...Recall that L(θ|x) = fX(x|θ) = θne−nx̄θ, and therefore

logL(θ|x) = n log θ − nx̄θ

⇒ ∂

∂θlogL(θ|x) = n

θ− nx̄

⇒ ∂2

∂θ2logL(θ|x) = − n

θ2

⇒ I(θ) = EX|θ

[

− ∂2

∂θ2logL(θ|X)

]

=n

θ2.


π(θ) ∝ I(θ)1/2

∝√n

θ, θ > 0

∝ 1

θ, θ > 0.


0dθ/θ is a divergent integral, and so we


Notice also that this density is, in fact, a limiting form of a Ga(g, h) density (ignoringthe integration constant) since

hg θg−1e−hθ

Γ(g)∝ θg−1e−hθ → 1

θ, as g → 0, h→ 0.

Therefore, we obtain the same posterior distribution whether we adopt the Jeffreys prioror vague prior knowledge.


Example 3.10

Suppose we have a random sample from a N(µ, 1/τ) distribution (with τ known). De-termine the Jeffreys prior for this model.

Solution✎...Solution to Example 3.10...Recall that

L(µ|x) = fX(x|µ) =( τ

2π

)n/2

exp

{

−τ2

n∑

i=1

(xi − µ)2

}

,

and therefore

logL(µ|x) = n

2log(τ)− n

2log(2π)− τ

2

n∑

i=1

(xi − µ)2

⇒ ∂

∂µlogL(µ|x) = −τ

2×

n∑

i=1

−2(xi − µ)

= τn∑

i=1

(xi − µ)

= nτ(x̄− µ)

⇒ ∂2

∂µ2logL(µ|x) = −nτ

⇒ I(µ) = EX|µ

[

− ∂2

∂µ2logL(µ|X)

]

= nτ.


π(µ) ∝ I(µ)1/2

∝ √nτ, −∞ < µ <∞

= constant, −∞ < µ <∞.



−∞dµ is a divergent integral, and so we


Also it is a limiting form of a N(b, 1/d) density (ignoring the integration constant) since

(

d

2π

)1/2

exp

{

−d2(µ− b)2

}

∝ exp

{

−d2(µ− b)2

}

→ 1, as d→ 0.

Therefore, we obtain the same posterior distribution whether we adopt the Jeffreys prioror vague prior knowledge.

3.5 Asymptotic posterior distribution

There are many limiting results in Statistics. The one you will probably rememberis the Central Limit Theorem. This concerns the distribution of X̄n, the mean of nindependent and identically distributed random variables (each with known mean µ andknown variance σ2), as the sample size n → ∞. It is easy to show that E(X̄n) = µ andV ar(X̄n) = σ2/n, and so

E

[√n(X̄n − µ)

σ

]

= 0 and V ar

[√n(X̄n − µ)

σ

]

= 1.

These two equations are true for all values of n. The important part of the Central LimitTheorem is the description of the distribution of

√n(X̄n − µ)/σ as n→ ∞:

√n(X̄n − µ)

σ

D−→ N(0, 1) as n→ ∞.

The following theorem gives a similar result for the posterior distribution.

Theorem

Suppose we have a statistical model f(x|θ) for data x = (x1, x2, . . . , xn)T , together with

a prior distribution π(θ) for θ. Then

√

J(θ̂) (θ − θ̂)|x D−→ N(0, 1) as n→ ∞,

where θ̂ is the likelihood mode and J(θ) is the observed information

J(θ) = − ∂2

∂θ2log f(x|θ).

3.5. ASYMPTOTIC POSTERIOR DISTRIBUTION 75

Proof

Using Bayes Theorem, the posterior distribution for θ is

π(θ|x) ∝ π(θ) f(x|θ).

Let ψ =√n(θ − θ̂) and

ℓn(θ) =1

nlog f(x|θ)

be the average log-likelihood per observation, in which case, f(x|θ) = enℓn(θ). Using(3.7), the posterior distribution of ψ is

πψ(ψ|x) = πθ

(

θ̂ +ψ√n

∣

∣

∣

∣

x

)

× 1√n

∝ πθ

(

θ̂ +ψ√n

)

exp

{

nℓn

(

θ̂ +ψ√n

)}

.

Now taking Taylor series expansions about ψ = 0 gives

πθ

(

θ̂ +ψ√n

)

= πθ(θ̂) + π′θ(θ̂)

ψ√n+O

(

ψ2

n

)

= πθ(θ̂)

{

1 +O

(

ψ√n

)}

nℓn

(

θ̂ +ψ√n

)

= n

{

ℓn(θ̂) + ℓ′n(θ̂)ψ√n+

1

2ℓ′′n(θ̂)

ψ2

n+O

(

ψ3

n3/2

)}

= nℓn(θ̂) +1

2ℓ′′n(θ̂)ψ

2 +O

(

ψ3

√n

)

since ℓ′n(θ̂) = 0 by definition of the maximum likelihood estimate. Therefore, retainingonly terms in ψ, we have

πψ(ψ|x) ∝ πθ(θ̂)

{

1 +O

(

ψ√n

)}

exp

{

nℓn(θ̂) +1

2ℓ′′n(θ̂)ψ

2 +O

(

ψ3

√n

)}

∝ exp{nℓn(θ̂)} exp{

1

2ℓ′′n(θ̂)ψ

2

}

exp

{

O

(

ψ3

√n

)}{

1 +O

(

ψ√n

)}

∝ exp

{

1

2ℓ′′n(θ̂)ψ

2

}{

1 +O

(

ψ3

√n

)}{

1 +O

(

ψ√n

)}

∝ exp

{

1

2ℓ′′n(θ̂)ψ

2

}{

1 +O

(

ψ√n

)}

∝ exp

{

− ψ2

2[−ℓ′′n(θ̂)]−1

}

{

1 +O

(

ψ√n

)}

.


Hence

πψ(ψ|x) → k exp

{

− ψ2

2[−ℓ′′n(θ̂)]−1

}

as n→ ∞.

Thus, the limiting form of the posterior density for ψ is that of a N(

0, [−ℓ′′n(θ̂)]−1)

distribution. Hence

√n(θ − θ̂)|x D−→ N

(

0, [−ℓ′′n(θ̂)]−1)

as n→ ∞,

or, equivalently, since nℓn(θ) = log f(x|θ)√

J(θ̂)(

θ − θ̂)

∣

∣

∣x

D−→ N(0, 1) as n→ ∞,

as required.

Comments

1. This asymptotic result can give us a useful approximation to the posterior distri-bution for θ when n is large:

θ|x ∼ N(

θ̂, J(θ)−1)

approximately.

2. The observed information is similar to Fisher’s information (3.8). In fact, Fisher’sinformation is the expected value of the observed information, where the expecta-tion is taken over the distribution of X|θ, that is, I(θ) = EX|θ[J(θ)].

3. This limiting result is similar to one for the maximum likelihood estimator inFrequentist Statistics:

√

I(θ) (θ̂ − θ)|x D−→ N(0, 1). as n→ ∞, (3.10)

Note that (3.10) is a statement about the distribution of θ̂ for fixed (unknown) θ,whereas the Theorem is a statement about the distribution of θ for fixed (known) θ̂.


Example 3.11

Suppose we have a random sample from a distribution with probability density function

f(x|θ) = 2x e−x2/θ

θ, x > 0, θ > 0.

Determine the asymptotic posterior distribution for θ. Note that from Example 3.8 wehave

∂

∂θlog f(x|θ) = −n

θ+

1

θ2

n∑

i=1

x2i ,

J(θ) = − ∂2

∂θ2log f(x|θ) = − n

θ2+

2

θ3

n∑

i=1

x2i =n

θ3

(

−θ + 2

n

n∑

i=1

x2i

)

.

Solution✎...Solution to Example 3.11...

Write x2 =1

n

n∑

i=1

x2i then we have

∂

∂θlog f(x|θ) = 0 =⇒ θ̂ = x2,

=⇒ J(θ̂) =n

(

x2)2

=⇒ J(θ̂)−1 =1

n

(

x2)2

.

Therefore, for large n, the (approximate) posterior distribution for θ is

θ|x ∼ N

(

x2,1

n

(

x2)2)

.


Example 3.12

Suppose we have a random sample from an exponential distribution, that is, Xi|θ ∼Exp(θ), i = 1, 2, . . . , n (independent). Determine the asymptotic posterior distributionfor θ. Note that from Example 3.9 we have

∂

∂θlog f(x|θ) = n

θ− nx̄,

J(θ) = − ∂2

∂θ2log f(x|θ) = n

θ2.

Solution

✎...Solution to Example 3.12...We have

∂

∂θlog f(x|θ) = 0 =⇒ θ̂ =

1

x̄

=⇒ J(θ̂) = nx̄2

=⇒ J(θ̂)−1 =1

nx̄2.

Therefore, for large n, the (approximate) posterior distribution for θ is

θ|x ∼ N

(

1

x̄,

1

nx̄2

)

.

Recall that, assuming a vague prior distribution, the posterior distribution is a Ga(n, nx̄)distribution, with mean 1/x̄ and variance 1/(nx̄2). The Central Limit Theorem tellsus that, for large n, the gamma distribution tends to a normal distribution, matched,of course, for mean and variance. Therefore, we have shown that, for large n, theasymptotic posterior distribution is the same as the posterior distribution under vagueprior knowledge. Not a surprising result!


Example 3.13

Suppose we have a random sample from a N(µ, 1/τ) distribution (with τ known). De-termine the asymptotic posterior distribution for µ. Note that from Example 3.10 wehave

∂

∂µlog f(x|µ) = nτ(x̄− µ),

J(µ) = − ∂2

∂µ2log f(x|µ) = nτ.

Solution✎...Solution to Example 3.13...We have

∂

∂µlog f(x|µ) = 0 =⇒ µ̂ = x̄

=⇒ J(µ̂) = nτ

=⇒ J(µ̂)−1 =1

nτ.

Therefore, for large n, the (approximate) posterior distribution for µ is

µ|x ∼ N

(

x̄,1

nτ

)

.

Again, we have shown that the asymptotic posterior distribution is the same as theposterior distribution under vague prior knowledge.

chapter 3 prior elicitation - newcastle universitynlf8/teaching/mas2317/notes/chapter3.pdf ·...

Documents