bayesian methods: a social and behavioral sciences...

Hong Min Park & Jeff Gill

Bayesian Methods: ASocial and BehavioralSciences Approach,ANSWER KEY

SECOND EDITION

September 2010

CRC PRESS

Boca Raton Ann Arbor London Tokyo

Contents

vii

List of Tables

ix

List of Figures

xi

1

Background and Introduction

Exercises

1. Restate the three general steps of Bayesian inference from page 5 in your

own words.

I Answer

• Use your prior knowledge of the system or set of parameters you

are interested in and incorporate this into the modeling process.

This should be done by using probability distributions and may be

very specific or fairly vague or uncertain.

• Use a parametric model (i.e. assume its form) to describe what you

think your data looks like. Then combine what your data says and

what you thought the system/parameters behaves like to uncover

more about the system/parameters you are investigating.

• Investigate how good your model is and to what extent the results

were dependent on your initial ideas you might have had about the

system or set of parameters you are interested in.

3. Rewrite Bayes’ Law when the two events are independent. How do you

interpret this?

I Answer

If A and B are independent then the joint probability of A and B is

p(A,B) = p(A)p(B). Again according to the conditional probability,

1

2Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY

p(A|B) =p(A,B)

p(B)= p(A)

and

p(B|A) =p(A,B)

p(A)= p(B)

Hence Bayes’ Law is trivial here if A and B are independent because

p(A|B) =p(B|A)p(A)

p(B)

= p(B)p(A)

p(B)

= p(A).

Incorporating B into the model will not help or obtain any further in-

formation about A.

5. Suppose f(θ|X) is the posterior distribution of θ given the data X.

Describe the shape of this distribution when the mode, argmaxθ

f(θ|X),

is equal to the mean,∫θθf(θ|X)dθ.

I Answer

The distribution f(θ|X) is symmetric.

7. Using R run the Gibbs sampling function given on page 32. What effect

do you seen in varying the B parameter? What is the effect of producing

200 sampled values instead of 5000?

I Answer

From Figure 1.1 below, we know that, when we change the B parameter,

the horizontal axis of the graphs (both in x and y graphs) changes and

the distributions become more skewed and flat to the right in B = 10

than in B = 5 and in B = 1. When we change the m value (the number

of sampling) from 200 to 5000, the density graphs become smoother.

< R.Code >

Background and Introduction 3

< Marginal Distribution of x >

B=1, m=5000

Marginal Distribution of x

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

1.2

B=5, m=5000


Den

sity

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

B=10, m=5000


Den

sity

0 2 4 6 8 10

0.0

0.2

0.4

B=1, m=200


Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

1.2

B=5, m=200


Den

sity

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

B=10, m=200


Den

sity

0 2 4 6 8 10

0.0

0.2

0.4

< Marginal Distribution of y >

B=1, m=5000

Marginal Distribution of y

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

1.2

B=5, m=5000


Den

sity

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

B=10, m=5000


Den

sity

0 2 4 6 8 10

0.0

0.2

0.4

B=1, m=200


Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

B=5, m=200


Den

sity

0 1 2 3 4 5

0.0

0.2

0.4

0.6

0.8

B=10, m=200


Den

sity

0 2 4 6 8 10

0.0

0.2

0.4

FIGURE 1.1: Plots of marginal distributions for different values of B and m.


# Define the Gibbs sampling algorithm

gibbs <- function(B, m) B <- B; k <- 15; m <- m; x <- NULL; y <- NULL

while (length(x) < m) x.val <- c(runif(1,0,B),rep((B+1),length=k))

y.val <- c(runif(1,0,B),rep((B+1),length=k))

for (j in 2:(k+1)) while(x.val[j] > B) x.val[j] <- rexp(1,y.val[j-1])

while(y.val[j] > B) y.val[j] <- rexp(1,x.val[j])

x <- c(x,x.val[(k+1)])

y <- c(y,y.val[(k+1)])

return(list("x"=x, "y"=y))

# Try different values

gibbs1 <- gibbs(B=1, m=5000)






# Draw graphs

par(mfrow=c(2,3), mgp=c(1.75,.25,0), mar=c(3,3,3,0), tcl=-0.2)

hist(gibbs1[["x"]], freq=FALSE, border="white",

col="gray", ylab="Density", main="B=1, m=5000",

xlab="Marginal Distribution of x")

lines(density(gibbs1[["x"]]))






















hist(gibbs1[["y"]], freq=FALSE, border="white",

col="gray", ylab="Density", main="B=1, m=5000"

xlab="Marginal Distribution of y")

lines(density(gibbs1[["y"]]))





















9. Rerun the Metropolis algorithm on section 1.8.2 in R but replacing the

uniform generation of candidate values in cand.gen with a normal trun-

cated to fit in the appropriate range. What differences do you observe?


Mar. dist. of x when using unif. for cand.gen

x

Density

0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

Mar. dist. of y when using unif. for cand.gen

y

Density

0 2 4 6 8

0.00

0.05

0.10

0.15

Mar. dist. of x when using tnorm. for cand.gen

x

Density

0 2 4 6 8

0.0

0.1

0.2

0.3

0.4

Mar. dist. of y when using tnorm. for cand.gen

y

Density

0 2 4 6 8

0.00

0.05

0.10

0.15

0.20

FIGURE 1.2: Plots of the marginal distributions of x and y for different

candidate generating distributions.

I Answer

From Figure 1.2 below, we know that a normal truncated distribution

does not make a big difference.

< R.Code >

# Setup

require(msm) # for the truncated normal distribution

m <- 5000; x <- 0.5; y <- 0.5 # initial values

L1 <- 0.5; L2 <- 0.1; L <- 0.01; B <- 8 # initial values

# Define the bivariate exponential distribution

biv.exp <- function (x, y, L1, L2, L) exp ( -(L1+L)*x - (L2+L)*y - L*max(c(x,y)))


# Define the M-H algorithm

mh <- function (A) if (A == 0) cand.gen <- function (max.x, max.y) c (runif (1, 0, max.x), runif (1, 0, max.y))

else cand.gen <- function (max.x, max.y) c (rtnorm (1, 0, 4, lower=0, upper=max.x),

rtnorm (1, 0, 10, lower=0,upper=max.y))

for (i in 1:m) cand.val <- cand.gen (B, B)

a <- biv.exp (cand.val[1], cand.val[2], L1, L2, L) /

biv.exp (x[i], y[i], L1, L2, L)

if (a > runif(1)) x <- c (x, cand.val[1])

y <- c (y, cand.val[2])

else x <- c (x, x[i])

y <- c (y, y[i])

return (list ("x" = x, "y" = y))

# Case with uniform cand.gen distribution is mh1

# Case with truncated normal cand.gen distribution is mh2

mh1 <- mh (0)

mh2 <- mh (1)

par(mfrow = c (2, 2))

hist (mh1[["x"]], freq=FALSE, xlim=c(0,8), xlab="x",

main="Mar. dist. of x when using unif. for cand.gen")

hist (mh1[["y"]], freq=FALSE, xlim=c(0,8), xlab="y",

main="Mar. dist. of y when using unif. for cand.gen")

hist (mh2[["x"]], freq=FALSE, xlim=c(0,8), xlab="x",


main="Mar. dist. of x when using tnorm. for cand.gen")

hist (mh2[["y"]], freq=FALSE, xlim=c(0,8), xlab="y",

main="Mar. dist. of y when using tnorm. for cand.gen")

11. If p(D|θ) = 0.5, and p(D) = 1, calculate the value of p(θ|D) for priors

p(θ), [0.001, 0.01, 0.1, 0.9].

I Answer

According to Bayes’ Law:

p(θ|D) =p(θ)p(D|θ)p(D)

Therefore, when priors p(θ) equals [0.001, 0.01, 0.1, 0.9], the value of

p(θ|D) are [0.0005, 0.005, 0.05, 0.45], respectively.

13. Sometimes Bayesian results are given as posterior odds ratios, which for

two possible alternative hypotheses is expressed as:

odds(θ1, θ2) =p(θ1|D)

p(θ2|D).

If the prior probabilities for θ1 and θ2 are identical, how can this be

re-expressed using Bayes’ Law?

I Answer

odds(θ1, θ2) =

p(θ1)p(D|θ1)p(D)

p(θ2)p(D|θ2)p(D)

=p(D|θ1)

p(D|θ2)

15. Suppose we had data, D, distributed p(D|θ) = θe−θD as in Section 1.3,

but now p(θ) = 1/θ, for θ ∈ (0:∞). Calculate the posterior mean.

I Answer

∫ ∞0

1

θ× θe−θDdθ =

∫ ∞0

e−θDdθ =

[−e−θD

D

]∞0

=1

D


17. Since the posterior distribution is a compromise between prior informa-

tion and the information provided by the new data, then it is interesting

to compare relative strengths. Perform an experiment where you flip a

coin 10 times, recording the data as zeros and ones. Produce the pos-

terior expected value (mean) for two priors on p (the probability of a

heads): a uniform distribution between zero and one, and a beta distri-

bution (Appendix B) with parameters [10, 1]. Which prior appears to

influence the posterior mean more than the other?

I Answer

Suppose that the data you got from an experiment is [1, 1, 0, 1, 0, 0,

1, 1, 0, 1]. Consider the beta-binomial model (refer to section 2.3.3 for

the complete discussion). Then, we have:

Xi ∼ BR(p) → y =

10∑i=1

Xi ∼ BN (10, 6)

Now, we have two options for the prior specification of p, which produces

the Table 1.1. Therefore, the beta distribution (BE(10, 1)) influences the

posterior mean more than the uniform distribution (U[0, 1]) does.

Table 1.1: Posteriors from Different Priors

prior of p prior mean posterior of p posterior mean

U[0, 1] = BE(1, 1) 0.5 BE(7, 5) 7/12 = 0.58

BE(10, 1) 10/11 = 0.91 BE(16, 5) 16/21 = 0.76

19. The Stopping Rule Principle states that if a sequence of experiments

(trials) is governed by a stopping rule, η, that dictates cessation, then

inference about the unknown parameters of interest must depend on

η only through the collected sample. Obvious stopping rules include


setting the number of trials in advance (i.e. reaching n is the stopping

rule), and stopping once a certain number of successes are achieved.

Consider the following experiment and stopping rule. Standard normal

distributed data are collected until the absolute value of the mean of the

data exceeds 1.96/√n. Explain why this fails as a non-Bayesian stopping

rule for testing the hypothesis that µ = 0 (the underlying population is

zero), but is perfectly acceptable for Bayesian inference.

I Answer

In a non-Bayesain setting, we are not allowed to draw parameter values

from its distribution for the purpose of hypothesis testing. We are to

replicate the sampling distribution several times and to check how often

those replications produce an interval (usually an interval, although not

neccesarry to be an interval with a particular fixed width) that contains

the true value of the parameter.

The stopping rule here involves an assumption in the distributional na-

ture of the parameter value, which is not allowed in a non-Bayesian

setting as explained. However, the Bayesian setup is to assume that

the parameter of our interest has a probability distribution, from which

we can draw values. Therefore, the stopping rule works under Bayesian

inference.

2

Specifying Bayesian Models

Exercises

1. Suppose that 25 out of 30 firms develop new marketing plans during

the next year. Using the beta-binomial model from Section 2.3.3, apply

a BE(0.5, 0.5) (Jeffreys prior) and then specify normal prior centered

at zero and truncated to fit on [0:1] as prior distributions and plot the

respective posterior densities. What differences do you observe?

I Answer

We know that there are n = 30 firms, y = 25 develop new marketing

plans. Suppose θ is the unknown parameter for the probability of an

firm to develop a new market plan.

Applying a Jefferys prior BE(A = 0.5, B = 0.5), we get:

p(θ|y) ∝ p(θ)p(y|θ)

∝ BE(θ|0.5, 0.5)BN (25|30, θ)

∝ BE(θ|25 + 0.5, 30− 25 + 0.5)

Applying a normal prior N (0, 1), we get:

p(θ|y) ∝ p(θ)p(y|θ)

= N (θ|0, 1)BN (25|30, θ)

=1√2π

exp

(−θ2

2

)(30

25

)θ25(1− θ)5

∝ exp

(−θ2

2

)θ25(1− θ)5

11


0.0 0.2 0.4 0.6 0.8 1.0

Posterior Density of p((θθ|y)) w/ Different Priors

w/ Jeffreys Prior Beta(0.5,0.5)w/ Normal Prior N(0,1)

FIGURE 2.1: The Posterior Density of p(θ|y) with Different Priors

The posterior density of p(θ|y) using Jeffreys prior BE(0.5, 0.5) and using

the normal prior N (0,1) differ in their posterior mean. Applying the

normal prior drags the data mean slightly leftward than applying the

Jeffreys prior.

< R.Code >

# Setup

n <- 30; y <- 25; A <- 0.5; B <- 0.5

par(mar=c(1,1,3,1),mgp=c(1.5,0.25,0),

tcl=-0.2, cex.axis=0.8, cex.main=1)

# Jeffrey’s prior Beta(0.5,0.5)

curve(dbeta(x, y+A, n-y+B),

from=0, to=1, n=1000, # n=1000 to smooth the curv

ylim=c(0,6.2),

xaxs="i", yaxs="i", yaxt="n", bty="n",

ylab="", xlab="",

main=expression("Posterior Density of "

*p(theta*"|"*y)*" w/ Different Priors"))

# Normal prior N(0,1)

# Intentionally used the log posterior first, and then make it

# normalized just for the computational purpose

unnormalized.post <- function(x) exp(-x^2/2*1 + 25*log(x) + 5*log((1-x)))

Specifying Bayesian Models 13

k <- integrate(unnormalized.post, 0, 1)$value

posterior <- function(x) unnormalized.post(x)/k

# insert k back in to normalized posterior

curve(posterior(x), from=0, to=1, n=1000, lty=2, add=T)

legend(locator(1), lty=c(1,2), lwd=c(1,1),

legend=c("w/ Jeffreys Prior Beta(0.5,0.5)",

"w/ Normal Prior N(0,1)"), cex=0.8, box.lty=0)

3. Prove that the gamma distribution,

f(µ|α, β) =1

Γ(α)βαµα−1e−βµ, µ, α, β > 0,

is the conjugate prior distribution for µ in a Poisson likelihood function,

f(y|µ) =

(n∏i=1

yi!

)−1

exp

[log(µ)

n∑i=1

yi

]exp[−nµ],

that is, calculate a form for the posterior distribution of µ and show

that it is also gamma distributed.

I Answer

The posterior distribution of µ is:

p(µ|y) ∝ p(µ|α, β)p(y|µ)

=1

Γ(α)βα︸︷︷︸

constant

µα−1 exp(−βµ)

×

(n∏i=1

yi!

)−1

︸︷︷︸constant

exp

[log(µ)

n∑i=1

yi

]exp[−nµ]

∝ µα−1 exp(−βµ) exp

[log(µ)

n∑i=1

yi

]exp[−nµ]

= µα−1µny exp(−(β + n)µ)

= µ(α+ny)−1e−(β+n)µ

So, µ|y ∝ G(α+ ny, β + n). (Q.E.D.)


5. Use the gamma-Poisson conjugate specification developed in the last

question to analyze the following count data on worker strikes in Ar-

gentina over the period 1984 to 1993, from McGuire (1996). Assign

your best guess as to reasonable values for the two parameters of the

gamma distribution: α and β. Produce the posterior distribution for µ

and describe it with quantiles and graphs using empirically simulated

values according to the following procedure:

Economic Sector Number of Strikes

Public Administrators 496

Teachers 421

Metalworkers 199

Municipal Workers 186

Private Hospital Workers 181

Bank Employees 133

Court Clerks 128

Bus Drivers 113

Construction Workers 92

Doctors 83

Nationalized Industries 77

Railway Workers 76

Maritime Workers 57

Meat Packers 56

Paper Industry Workers 55

Sugar Industry Workers 50

Public Services 47

University Staff Employees 43

Telephone Workers 39

Textile Workers 37

State Petroleum Workers 32

Food Industry Workers 28

Post Office Workers 26

Locomotive Drivers 25

Light and Power Workers 21

Total 2701

• The posterior distribution for µ is gamma(δ1, δ2) according to some

parameters δ1 and δ2 that you derived above which of course de-

pends on your choice of the gamma parameters.

• Generate a large number of values from this distribution in R, say

10,000 or so, using the command:

posterior.sample <- rgamma(10000,d1,d2)


• Produce posterior quantiles, such as the interquartile range, ac-

cording to:

iqr.posterior <- c(sort(posterior.sample)[2500],

sort(posterior.sample)[7500])

Note: the IQR function in R gives a single value for the difference,

which is not as useful.

• Graph the posterior in different ways, such as with a smoother

like lowess (a local-neighborhood smoother, see Cleveland [1979,

1981]):

post.hist <- hist(posterior.sample,plot=F,breaks=100)

plot(lowess(post.hist$mids,post.hist$intensities),

type="l")

I Answer

Since the Gamma distribution has α/β as its mean and α/β2 as its

variance, I subjectively assign a gamma prior G(100, 1). And we know

there are n = 25 sectors. The number of the total strikes is nx =

2701. Thus the posterior distribution for µ is G(100 + 2701, 1 + 25) =

G(2801, 26). And, the posterior quantile is [106.38, 109.06]:

84 86 88 90 92 94 96

0.00

0.05

0.10

0.15

Posterior Distribution of µµ

µµ

Den

sity

FIGURE 2.2: Posterior distribution of µ.


< R.Code >

A <- 100; B <- 1; nx <- 2701; n <- 25

d1 <- A + nx; d2 <- B + n

post.sample <- rgamma(10000,d1,d2)

iqr.post <- c(sort(post.sample)[2500],

sort(post.sample)[7500])

round(iqr.post,2)

post.hist <- hist(post.sample, plot=F, breaks=100)

plot(lowess(post.hist$mids, post.hist$intensities),

type="l", xlab=expression(mu), ylab="Density",

main=expression("Posterior Distribution of"*~~mu))

7. In his original essay (1763, 376) Bayes offers the following question:

Given the number of times in which an unknown event has

happened and failed: Required the chance that the probability

of its happening in a single trial lies somewhere between any

two degrees of probability that can be named.

Provide an analytical expression for this quantity using an appropriate

uniform prior (Bayes argued reluctantly for the use of the uniform as a

“no information” prior: Bayes postulate).

I Answer

θ is defined for the unknown event of our interest, and D is defined

for the set of observations in which the event has happened and failed.

Then, the question can be rephrased as: ”Given that we know p(D|θ),what is p(θ|D)?”

Bayesian way of answering this question involves the specification of

prior guess on θ: p(θ). Since the probability of an event happening or not

is bounded between zero and one, and since we have no prior knowledge

on p(θ), the general solution is the uniform distribution: p(θ) = U[0, 1].

Therefore, we can describe some details on p(θ|D) from:

p(θ|D) ∝ p(θ)p(D|θ).


9. Suppose we have two urns containing marbles; the first contains 6 red

marbles and 4 green marbles, and the second contains 9 red marbles and

1 green marble. Now we take one marble from the first urn (without

looking at it) and put it in the second urn. Subsequently, we take one

marble from the second urn (again without looking at it) and put it in

the first urn. Give the full probabilistic statement of the probability of

now drawing a red marble from the first urn, and calculate its value.

I Answer

Define the events as follows:

R1: draw a red marble from the first urn

G1: draw a green marble from the first urn

R2: draw a red marble from the second urn

G2: draw a green marble from the second urn

Rf : draw a red marble at the final step

Then, we have:

Pr(Rf) = Pr(R1)× [Pr(R2|R1)× Pr(Rf |R1, R2)

+ Pr(G2|R1)× Pr(Rf |R1, G2)]

+ Pr(G1)× [Pr(R2|G1)× Pr(Rf |G1, R2)

+ Pr(G2|R1)× Pr(Rf |G1, G2)]

=6

10×[

10

11× 6

10+

1

11× 5

10

]+

4

10×[

9

11× 7

10+

2

11× 4

10

]=

674

1100= 0.6127

11. This is the famous envelope problem. You and another contestant are

each given one sealed envelope containing some quantity of money with

equal probability of receiving either envelope. All you know at the mo-

ment is that one envelope contains twice the cash as the other. So if you

open your envelope and observe $10, then the other envelope contains


either $5 or $20 with equal probability. You are now given the opportu-

nity to trade with the other contestant. Should you? The expected value

of the unseen envelope is E[other] = 0.5(5) + 0.5(20) = 12.50, meaning

that you have a higher expected value by trading. Interestingly, so does

the other player for analogous reasons. Now suppose you are offered the

opportunity to trade again before you open the newly traded envelope.

Should you? What is the expected value of doing so? Explain how this

game leads to infinite cycling. There is a Bayesian solution. Define M

as the known maximum value in either envelope, stipulate a probability

distribution, and identify a suitable prior.

I Answer

Even after deciding to switch envelope, the new expected value calcu-

lation for switching again is exactly same as before: If I had a value of

V, the expected gain from switching would be always 14V, regardless of

how many times I’ve already switched the envelope. This leads to the

infinite cycling. However, this is NOT true if I consider the conditional

probabilities.

Some notations:

V: the value that I saw before switching envelope

M: the known maximum value in either envelope

Pr(V): the probability that I saw V

Pr(M): the probability that I got M

Then, the question becomes what is Pr(M|V)? Several things to note:

Pr(V) is equal to one, by definition. Whichever envelope I choose, I am

guaranteed to see the value, which implies that my having M and seeing

V are independent with each other. Since I am not informed about M,

I have an uninformed prior on M: Pr(M) = Pr(∼M) = 12 .

Applying the Bayes rule, we have:

Pr(M|V) =Pr(V|M)Pr(M)

Pr(V)=

Pr(V|M)Pr(M)

1


=

Pr(V, M)Pr(M) × Pr(M)

1= Pr(V, M)

= Pr(V)Pr(M) = 1× Pr(M) =1

2

Pr(∼M|V) = 1− Pr(M|V) =1

2

Expected values of switching and not switching are:

E(switching) = Pr(M|V)× M

2+ Pr(∼M|V)×M

=1

2× M

2+

1

2×M =

3

4M

E(not switching) = Pr(M|V)×M + Pr(∼M|V)× M

2

=1

2×M +

1

2× M

2=

3

4M

Therefore, switching the envelope does NOT change my fortune. No

more paradox. No more infinite cycling.

13. If the posterior distribution of θ is N (1, 3), then calculate a 99% HPD

region for θ.

I Answer

[-2.98, 5.01]

< R.Code >

post <- rnorm(10000, mean=1, sd=sqrt(3))

hpd.post <- c(sort(post)[100], sort(post)[9900])

round(hpd.post, 2)

15. Assume that the data [1,1,1,1,1,1,0,0,0,1,1,1,0,1,0,0,1,1,1,1] are produced

from iid Bernoulli trials. Produce a 1−α confidence set for the unknown

value of p using a uniform prior distribution.

I Answer

Since n = 20 and y = 14, we have:


Xi ∼ BR(p) → y =

20∑i=1

Xi ∼ BN (20, 14)

Applying an uniform prior, U[0, 1] = BE(1, 1), we get the posterior as:

p|y ∼ BE(14 + 1, 20− 14 + 1) = BE(15, 7)

Based on the α values, we then have different HPDs:

α HPD

1 % (0.43, 0.88)

5 % (0.52, 0.83)

10 % (0.55, 0.80)

Table 2.1: HPD w/ Different Values

< R.Code >

post.beta <- rbeta(10000, 15, 7)

hpd1 <- c(sort(post.beta)[100], sort(post.beta)[9900])



round(hpd1, 2); round(hpd5, 2); round(hpd10, 2)

17. The beta distribution, f(x|α, β) = Γ(α+β)Γ(α)Γ(β)x

α−1(1 − x)β−1, 0 < x <

1, α > 0, β > 0, is often used to model the probability parameter in a

binomial setup. If you were very unsure about the prior distribution of

p, what values would you assign to α and β to make it relatively “flat”?

I Answer

BE(1, 1), which is idential to U[0, 1].

19. Laplace (1774, 28) derives Bayes’ Law for uniform priors. His claim is


. . . je me propose de determiner la probabilite des causes par

les evenements matiere neuve a bien de egards et qui merit

d’autant plus d’etre cultivee que c’est principalement sous ce

point de vue que la science des hasards peut etre utile la vie

civile.

He starts with two events: E1 and E2 and n causes: A1, A2, . . . , An. The

assumptions are: (1) Ei are conditionally independent given Ai, and

(2) Ai are equally probable. Derive the Laplace’s inverse probability

relation:

p(Ai|E) =p(E|Ai)∑j p(E|Aj)

.

I Answer

Since Ai are equally probable, ∀i, Pr(Ai) = a (constant). Therefore, we

have:

(LHS) = p(Ai|E) =p(E|Ai)p(Ai)

p(E)

=p(E|Ai)p(Ai)∑nj=1 p(E|Aj)p(Aj)

=a× p(E|Ai)

a×∑nj=1 p(E|Aj)

=p(E|Ai)∑nj=1 p(E|Aj)

= (RHS) (Q.E.D.)

3

The Normal and Student’s-t Models

Exercises

1. The most important case of a two-parameter exponential family is when

the second parameter is a scale parameter. Designate ψ as such a scale

parameter, then the exponential family form expression of a PDF or

PMF is rewritten:

f(y|θ) = exp

[yθ − b(θ)a(ψ)

+ c(y, ψ)

].

Rewrite the normal PDF in exponential family form.

I Answer

f (y|θ) =(2πσ2

)− 12 exp

[− 1

2σ2(y − µ)

2

]= exp

[y2 − 2yµ+ µ2

2σ2+

(−1

2

)ln(2πσ2

)]

= exp

yµ− 12µ

2

σ2− y2

2σ2− 1

2ln(2πσ2

)︸︷︷︸

c(y,ψ)

= exp

[yθ − b (θ)

a (ψ)− c (y, ψ)

],

where θ = µ, b(θ) = 12µ

2 = 12θ

2, and a(ψ) = σ2.

3. Suppose the random variable X is distributed N (µ, σ2). Prove that the

random variable Y = (X − µ)/σ is N (0, 1): standard normal.

I Answer

23


Since X is distributed N (µ, σ2), we have:

p(X|µ, σ2) =1√2πσ

exp

(− 1

2σ2(X − µ)2

)

And, we know Y = (X − µ)/σ ⇒ X = µ+ σY , which gives:

|J| =∣∣∣∣∂X∂Y

∣∣∣∣ = σ

Therefore, we get:

p(Y |µ, σ2, X) = |J| × p(X|µ, σ2)

= σ × 1√2πσ

exp

(− 1

2σ2(µ+ σY − µ)2

)=

1√2π

exp

(−1

2Y 2

)= N (Y |0, 1) (Q.E.D.)

5. Missing Data. Suppose we have an iid sample of collected data: X1, X2, . . . , Xk, Yk+1, . . . , Yn ∼N (µ, 1), where the Yi values represent data that has gone missing. Spec-

ify a joint posterior for µ and the missing data with a N (0, 1/n) prior.

I Answer

p(µ,y|x) ∝ p(x|µ,y)p(y|µ)p(µ)

= N (x|µ, 1)N (y|µ, 1)N(µ

∣∣∣∣0, 1

n

)=

[k∏i=1

(2π)−12 exp

(−1

2(xi − µ)2

)]

×

[n∏

i=k+1

(2π)−12 exp

(−1

2(yi − µ)2

)]

×

[(2π

n

)− 12

exp

(−1

2nµ2

)]

The Normal and Student’s-t Models 25

∝ exp

(−1

2

k∑i=1

(xi − µ)2

)exp

(−1

2

n∑i=k+1

(yi − µ)2

)exp

(−1

2nµ2

)

= exp

[−1

2

(k∑i=1

x2i − 2µ

k∑i=1

xi + kµ2 +

n∑k+1

(yi − µ)2 + nµ2

)]

= exp

−1

2

k∑i=1

x2i − 2µ(kx) + kµ2 + (n− k)×Var(y)︸︷︷︸

=1

+nµ2

= exp

[−1

2

((k + n)µ2 − 2kxµ+

k∑i=1

x2i + n− k

)]

Although not neccessary here, we can also consider a marginal posterior

distribution of µ as,

p(µ|x) ∝ exp

[−1

2

((k + n)µ2 − 2kxµ

)]∝ exp

[−1

2

(11

k+n

)(µ− kx

k + n

)2]

∼ N(

kx

k + n,

1

k + n

)

7. Suppose X ∼ N (µ, µ), µ > 0. Give an expression for the conjugate

prior for µ.

I Answer

In an usual Bayesian normal model, the conjugate prior for the mean

parameter is the normal distribution and the conjugate prior for the

variance parameter is the inverse gamma distribution. So, I try both

priors in this case ad find that the inverse gamma, not normal, is the

conjugate prior for µ.

Suppose µ ∼ IG(α, β), where µ, α, β > 0. Then, the posterior becomes:

p(µ|X) ∝ p(µ)p(X|µ)

=βα

Γ(α)µ−α−1 exp

[−βµ

](1√2πµ

)nexp

[− 1

2µ

n∑i=1

(Xi − µ)2

]


∝ µ−α−n2−1 exp

[−βµ− 1

2µ× nVar(X)

]= µ−(α+n

2 )−1 exp

[− 1

µ

(β +

n

2Var(X)

)]= µ−(α+n

2 )−1 exp

[− 1

µ

(β +

n

2

(1

n

n∑i=1

X2i − X2

))]

= µ−(α+n2 )−1 exp

[− 1

µ

(β +

1

2

(n∑i=1

X2i − nX2

))]

∼ IG

(α+

n

2, β +

1

2

(n∑i=1

X2i − nX2

)),

which means that my choice of inverse gamma distribution as a prior

works for the conjugate prior for µ.

9. Returning to the example where the normal mean is known and the

posterior distribution for the variance parameter is developed. Plot a

figure that illustrates how the likelihood increases in relative importance

to the prior by performing the following steps:

(a) Plot the IG(5, 5) density over its support.

(b) Specify a posterior distribution for σ2 using the IG(5, 5) prior.

(c) Generate four random vectors of size: 10, 100, 200, and 500 dis-

tributed standard normal. Using these create four different pos-

terior distributions of σ2, and add the new density curves to the

existing plot.

(d) Label all distributions, axes, and legends.

Hint: modify the R code in the Computational Addendum to pro-

vide a posterior for the variance parameter instead of the mean param-

eter.

I Answer

< R.Code >

# specify the prior and the posteriors

prior <- rgamma(10000, shape=5, rate=5)


ig <- function (x) n <- length (x)

xhat <- sum (t(x)%*%x)

a1 <- 5 + n/2

b1 <- 5 + xhat/2

y <- rgamma (10000, shape=a1, rate=b1)

return(list("y"=y, "a1"=a1, "b1"=b1))

x1 <- rnorm(10); x2 <- rnorm(100)

x3 <- rnorm(200); x4 <- rnorm(500)

post1 <- ig (x1); post2 <- ig (x2)

post3 <- ig (x3); post4 <- ig (x4)

# plot prior first, and then posteriors

hist.prior <- hist(prior, xlim=c(0,2), plot=TRUE,

freq=FALSE, breaks=100)

FIGURE 3.1: Relative Importance of Likelihood to Prior

0.0 0.5 1.0 1.5 2.0

01

23

45

67

value

Pro

babi

lity

dens

ity

priorn=10

n=100

n=200

n=500


hist.post1 <- hist(post1[["y"]], xlim=c(0,2), plot=TRUE,








plot(c(0, 2), c(0, 7), type="n", main="",

ylab="Probability density", xlab="value", )

lines(hist.prior$density ~ hist.prior$mids, col=1)

lines(hist.post1$density ~ hist.post1$mids, col=2)




text (0.6, 0.5, "prior", col=1)

text (0.5, 1.5, "n=10", col=2)

text (1.25, 2, "n=100", col=3)

text (1.25, 3.5, "n=200", col=4)

text (1.15, 5.5, "n=500", col=5)

11. Rejection Method. Like the normal, the Cauchy distribution is a uni-

modal, symmetric density of the location-scale family: C(X|θ, σ) =1πσ

1

1+( x−θσ )2 , where −∞ < X, θ < ∞, 0 < σ < ∞. Unlike the nor-

mal, the Cauchy distribution has very heavy tails, heavy enough so that

the Cauchy distribution has no moments and is therefore less easy to

work with. Given a C(0, 3) distribution, find the probability of observing

a point between 3 and 7. To do this, simulate 100,000 values of C(0, 3)

(in R this is done by rcauchy(100000,0,3)), and count the number

of points in the desired range. Graph your results.

I Answer

< R.Code >

# count the number of points in (3,7)

cau <- rcauchy (100000, 0, 3)

range <- cau[cau>=3 & cau<=7]


length(range); length(range)/length(cau)

# hist(cau, c(-10000000,seq(-10,10,1),10000000),

xlim=c(-10,10))

# graph it (w/ normals)

ruler <- seq(-10, 10, length=1000)

dcau <- dcauchy(ruler, 0, 3)

dnorm1 <- dnorm(ruler,0,1); dnorm2 <- dnorm(ruler,0,3)

plot(c(-10, 10), c(0, 0.4), type=’’n’’,

xlab="value", ylab="probability density")

lines(dcau ~ ruler, col=2)

lines(dnorm1 ~ ruler, col=1); lines(dnorm2 ~ ruler, col=4)

for (i in 1:80) a <- 3 + i/20

segments(a, 0, a, dcauchy(a, 0, 3), col=2)

text (2.3, 0.3, "N(0,1)", col=1)

text (4.5, 0.08, "N(0,3)", col=4)

text (8.5, 0.025, "C(0,3)", col=2)

FIGURE 3.2: Cauchy distribution with Normal’s

−10 −5 0 5 10

0.0

0.1

0.2

0.3

0.4

value

prob

abili

ty d

ensi

ty

N(0,1)

N(0,3)

C(0,3)


13. The expressions for the mean and variance of the inverse gamma were

supplied in (3.9). Derive these from the inverse gamma PDF. Show all

steps.

I Answer

The inverse gamma distribution is:

IG(x|α, β) =βα

Γ(α)x−α−1 exp

[−βx

], x, α, β > 0

First, the mean can be calculated as:

E[X] =

∫ ∞0

x× βα


[−βx

]dx

=

∫ ∞0

β

α− 1× βα−1

Γ(α− 1)x−(α−1)−1 exp

[−βx

]dx

=β

α− 1×∫ ∞

0

βα−1

Γ(α− 1)x−(α−1)−1 exp

[−βx

]dx︸︷︷︸

=1

=β

α− 1

Next, the variance can also be calculated as:

V ar[X] = E[X2]− (E[X])2

=

∫ ∞0

x2 × βα


[−βx

]dx−

(β

α− 1

)2

=

∫ ∞0

βα

Γ(α)x−α−1+2 exp

[−βx

]dx−

(β

α− 1

)2

=β2

(α− 1)(α− 2)

∫ ∞0

βα−2

Γ(α− 2)x−(α−2)−1 exp

[−βx

]dx︸︷︷︸

=1

− β2

(α− 1)2

=β2

(α− 1)(α− 2)− β2

(α− 1)2=

β2

(α− 1)2(α− 2)


15. Modify the function biv.norm.post given below so that it provides

posterior samples for multivariate normals, given specified priors and

data. Also modify the function normal.posterior.summary so that

the user can specify any level of density coverage.

I Answer

< R.Code >

require(MASS); require(bayesm)

multi.norm.post <- function(data.mat,alpha,beta,m,n0,

n.reps=1000) n <- nrow(data.mat)

xbar <- apply(data.mat, 2, mean)

S.sq <- (n-1)*var(data.mat)

Beta.inv <- solve(beta)

K <- (n0*n)/(n0+n)*(xbar-m)%*%t(xbar-m)

V.inv <- solve(Beta.inv + S.sq + K)

rep.mat <- NULL

for (i in 1:n.reps) Sigma <- rwishart(alpha+n,V.inv)$IW

mu <- mvrnorm(1, (n0*m + n*xbar)/(n0+n),

Sigma/(n0+n))

out <- c(mu1=mu[1],mu2=mu[2],sig1=Sigma[1,1],

sig2=Sigma[2,2],rho=Sigma[2,1])

rep.mat <- rbind(rep.mat, out)

return(rep.mat)

normal.posterior.summary <- function(reps, alpha)

reps[,5] <- reps[,5]/sqrt(reps[,3]*reps[,4])

n <- nrow(reps)

reps <- apply(reps, 2, sort); bound <- (1-alpha)/2

L <- round(bound*n); H <- round((1-bound)*n)

out.mat <- cbind("mean"= apply(reps,2,mean),

"std.err"= apply(reps,2,sd),

"HPD Lower"= reps[L,],

"HPD Upper"= reps[H,])

round(out.mat,3)


17. Specify a normal mixture model for the following data from Brooks,

Dellaportas & Roberts (1997). Summarize the posterior distributions

of the model parameters with tabulated quantiles: X = 2.3, 3.7, 4.1,

10.9, 11.6, 13.8, 20.1, 21.4, 22.3.

I Answer

The model for the data can be expressed:

X ∼ pN (µ1, 1) + (1− p)N (µ2, 1), µ1 < µ2 (3.1)

where the variances are assumed known and standardized for conve-

nience. There are three parameters to estimate here, (p, µ1, µ2), as well

as a vector, Z, that contains indicator functions for each of the data

values: zi = 1 if xi is currently determined to be from the component

with mean µ1, and zero otherwise. Now assign priors:

p ∼ U(0, 1)

µ1 ∼ N (5, 20)

µ2 ∼ N (20, 20)

zi ∼ random BR(p), i = 1:n. (3.2)

Using a joint normal likelihood function produced from (3.1) and priors

in (3.2), the full conditional posteriors distributions are:

p|Z ∼ BE(

1 +∑

zi, n+ 1−∑

zi

)µ1|X,Z ∼ N

(20X′Z + 5∑

zi,

20

20∑zi + 1

)µ2|X,Z ∼ N

(20X′(1− Z) + 20∑

(1− zi),

20

20∑

(1− zi) + 1

)zi|xi, p ∼ BR

(pfN (xi|µ = µ1)

pfN (xi|µ = µ1) + (1− p)fN (xi|µ = µ2)

)where 1 indicates a vector of 1s of length n, and fN () is normal PDF

with mean as indicated and variance equal to one. At each iteration the

zi values are estimated to be either zero or one.


Note that this answer deals with the bi-modal case. However, the tri-

modal case that the question asks to do could be easily extended by

adding one more “distribution” to the model in (3.1).

< R.Code >

sigmasq.nu <- 20; nu1 <- 5; nu2 <- 20

x <- c(2.3,3.7,4.1,10.9,11.6,13.8,20.1,21.4,22.3)

start <- matrix(c(0.3,7,20,rbinom(length(x),1,0.5)),

1,length(x)+3)

tol <- .Machine$double.eps

# Gibbs sampler for two-component mixture problem

gibbs.bdr <- function(theta.matrix,x,reps) for (i in 2:(reps+1))

P <- theta.matrix[(i-1), 1]

u1 <- theta.matrix[(i-1), 2]

u2 <- theta.matrix[(i-1), 3]

z <- theta.matrix[(i-1), 4:ncol(theta.matrix)]

P <- rbeta(1, 1+sum(z), length(x)+1-sum(z))

u1 <- 1000

while (u1 > u2) u1 <- rnorm(1, (sigmasq.nu*(x%*%z) +nu1)/

(sigmasq.nu*sum(z) + 1),

sigmasq.nu/(sigmasq.nu*sum(z) + 1))

u2 <- -1000

while (u1 > u2) u2 <- rnorm(1, (sigmasq.nu*(x%*%(1-z))+nu2)/

(sigmasq.nu*sum(1-z)+1),

sigmasq.nu/(sigmasq.nu*sum(1-z) + 1))

p <- P * exp(-0.5*(x-u1)^2) /

((P * exp(-0.5*(x-u1)^2)

+ (1-P)*exp(-0.5*(x-u2)^2))

z <- rbinom(length(p),1,p)

theta.matrix <- rbind(theta.matrix,c(P,u1,u2,z))

mix.out <- gibbs.bdr(start,x,500)

apply(mix.out,2,mean)


# Now use a random scan version, blocked for the z’s

gibbs.bdr.rs <- function(theta.matrix,x,reps) for (i in 2:(reps+1))

P <- theta.matrix[(i-1),1];

u1 <- theta.matrix[(i-1),2];

u2 <- theta.matrix[(i-1),3];

z <- theta.matrix[(i-1),4:ncol(theta.matrix)]

scan.order <- sample(4)

for (j in 1:length(scan.order)) if (scan.order[j] == 1)

P <- rbeta(1,1+sum(z),length(x)+1-sum(z))

if (scan.order[j] == 2)

u1 <- 1000

while (u1 > u2) u1 <- rnorm(1,

(sigmasq.nu*(x%*%z) + nu1) /

(sigmasq.nu*sum(z) +1),

sigmasq.nu/(sigmasq.nu*sum(z) + 1))


u2 <- -1000

while (u1 > u2) u2 <- rnorm(1,

(sigmasq.nu*(x%*%(1-z)) + nu2) /

(sigmasq.nu*sum(1-z)+1),

sigmasq.nu/(sigmasq.nu*sum(1-z)+1))


p <- P * exp(-0.5*(x-u1)^2) /

(P * exp(-0.5*(x-u1)^2) +

(1-P) * exp(-0.5*(x-u2)^2))

z <- rbinom(length(p),1,p)

if (i %% 100 == 0) print(i)

theta.matrix <- rbind(theta.matrix,c(P,u1,u2,z))

mix.out <- gibbs.bdr.rs(start,x,12000)


19. Use the following data on the race of convicted murders in the United

States in 1999 to specify a bivariate model. Specify reasonable priors and

produce posterior descriptions of the unknown parameters according to

Section 3.5. (Source: FBI, Crime in the United States 1999.)

Age Group White Black

9 to 12 7 11

13 to 16 218 247

17 to 19 672 976

20 to 24 987 1,285

25 to 29 619 660

30 to 34 493 429

35 to 39 444 303

40 to 44 334 228

45 to 49 236 134

50 to 54 153 73

55 to 59 89 60

60 to 64 55 24

65 to 69 47 17

70 to 74 23 10

75 and up 57 14

I Answer

First, assign priors as:

µ|Σ ∼ N 2

(m,

Σ

n0

), Σ−1 ∼ W(α,β)

where µ = (300, 300), β is a 2 × 2 identity matrix, α = 3 degree of

freedom, and n0 = 0.01 so that priors are weighted less.

So according to Equation (3.24), the posteriors are:

µ|Σ ∼ N k+2

(0.01m + nx

0.01 + n,

Σ

n0 + n

)Σ−1 ∼ W

(3 + n,

(β−1 + S2 +

n0n

n0 + n(x−m)(x−m)′

)−1)

Table 3.1 summarizes the posterior of µ and Σ. And, the definition of

R code here is stated in Exercise 3.8.


Table 3.1: Posterior of the bivariate model

mean std.err 95% lower HPD 95% Upper HPD

white 298.880 62.886 189.182 421.735

black 299.679 83.385 149.624 464.976

σ2white 79232.966 30992.859 43452.824 152118.011

σ2black 141160.434 54178.453 76347.078 278235.760

ρ 0.966 0.017 0.933 0.987

< R.Code >

library(bayesm)

library(MASS)

white <- c(7, 218, 672, 987, 619, 493, 444,

334, 236, 153, 89, 55, 47, 23, 57)

black <- c(11, 247, 976, 1285, 660, 429, 303,

228, 134, 73, 60, 24, 17, 10, 14)

X <- cbind(white, black)

multi.norm.post <- function(data.mat, alpha,beta, m,

n0, n.reps=1000) n <- nrow(data.mat)

xbar <- apply(data.mat, 2, mean)

S.sq <- (n-1)*var(data.mat)

Beta.inv <- solve(beta)

K <- (n0*n)/(n0+n)*(xbar-m)%*%t(xbar-m)

V.inv <- solve(Beta.inv + S.sq + K)

rep.mat <- NULL

for (i in 1:n.reps) Sigma <- rwishart(alpha+n,V.inv)$IW

mu <- mvrnorm(1, (n0*m + n*xbar)/(n0+n), Sigma/(n0+n))

out <- c(mu1=mu[1],mu2=mu[2],sig1=Sigma[1,1],

sig2=Sigma[2,2],rho=Sigma[2,1])

rep.mat <- rbind(rep.mat, out)

return(rep.mat)


normal.posterior.summary <- function(reps, alpha) reps[,5] <- reps[,5]/sqrt(reps[,3]*reps[,4])

n <- nrow(reps)

reps <- apply(reps, 2, sort)

bound <- (1-alpha)/2

L <- round(bound*n)

H <- round((1-bound)*n)

out.mat <- cbind("mean"= apply(reps,2,mean),

"std.err"= apply(reps,2,sd),

"HPD Lower"= reps[L,],

"HPD Upper"= reps[H,])

round(out.mat,3)

out <- multi.norm.post(X, 3, diag(2), m=c(200, 200),

n0=5, n.reps=1000)

normal.posterior.summary(out, 0.95)

4

The Bayesian Prior

Exercises

1. Show that the exponential PDF and the chi-square PDF is a special

case of the gamma PDF.

I Answer

The Gamma PDF has the form:

f(θ|α, β) =βα

Γ(α)θα−1 exp[−βθ], θ, α, β > 0,

First, by substituting α with 1, we get:

f(θ|1, β) =β1

Γ(1)θα−1 exp[−βθ] = β exp[−βθ],

which is an exponential PDF.

Second, by substituting α with ν, and β with 12 , we get:

f(θ|ν2,

1

2) =

( 12 )

ν2

Γ(ν2θν2−1 exp

[−(

1

2)θ

]=θν−22 exp[− θ2 ]

2ν2 Γ(ν2 )

,

which is an chi-square PDF. (Q.E.D.)

3. (Pearson 1920). An event has occurred p times in n trials. What is the

probability that it occurs r times in the next m trials?

I Answer

From the past experience, we know that Pr(Event) = pn . So, the prob-

ability that the even occurs r times from independent m trials would

39


follow the binomial PMF:(m

r

)( pn

)r (1− p

n

)m−r,

The Bayesian approach is a little different. I mainly follow the Beta-

Binomial model that was covered in Section 2.3.3.

Let’s define Xi as an event at ith trial. Xi gets 1 if the event occurs, and

0 if the event doesn’t occur. When Xi follows the Bernoulli distribution

with probability of Pr(Xi) = PR, the new random variable y =∑ni Xi

follows the Binomial distribution, BN (n, PR). By the way, we know

that y ≡ p from the question.

We can assign an uninformative flat prior on PR, such as BE(1, 1).

Then, the posterior for PR becomes π(PR|p) = BE(p + 1, n − p +

1). As a point estimate, we can take the mean ( p+1n+2 ), and use it for

new Pr(Event). Then, the probability of the event occurs r times from

independent m trials would be:(m

r

)(p+ 1

n+ 2

)r (1− p+ 1

n+ 2

)m−r,

which is slightly different from the frequentist solution above.

5. Suppose that you had a prior parameter with the restriction: [0 < η <

1]. If you believed that η had prior mean 0.4 and variance 0.1, and

wanted to specify a beta distribution, what prior parameters would you

assign?

I Answer

E(η) =α

α+ β= 0.4

V ar(η) =αβ

(α+ β)2(α+ β + 1)= 0.1,

which produces α = 1425 and β = 21

25 . Therefore, the prior distribution

would be BE( 1425 ,

2125 ).

The Bayesian Prior 41

7. Derive the Jeffrey’s prior for a normal likelihood model under three

circumstances: (1) σ2 = 1, (2) µ = 1, and (3) both µ and σ2 unknown

(nonindependent).

I Answer

(a) Suppose we have x ∼ N (µ, 1), where −∞ < µ < ∞. So the

likelihood is:

L(µ|x) ∝ exp

[−1

2

(∑x2i − 2nxµ+ nµ2

)]The loglikelihood is therefore:

`(µ|x) ∝ −1

2

(∑x2i − 2nxµ+ nµ2

)The first and the second derivative are:

∂

∂µ`(µ|x) = nx− µ,

∂2

∂µ2`(µ|x) = −n

So, the Jeffreys prior is:

pJ(µ) =

[−Ex|µ

(∂2

∂µ2`(µ|x)

)] 12

=[Ex|µ(n)

] 12

∝ 1 (constant)

(b) Suppose we have x ∼ N (1, σ2), where 0 < σ <∞. So the likelihood

is:

L(σ2|x) ∝ (σ2)−n2 exp

[−1

2σ−2

(∑(xi − 1)2

)]The loglikelihood is therefore:

`(σ2|x) ∝ −n2

log(σ2)− 1

2(σ2)−1

∑(xi − 1)2

The first and the second derivatives are:

∂

∂σ2`(σ2|x) = −n

2(σ2)−1 +

1

2(σ2)−2

∑(xi − 1)2,

∂2

∂(σ2)2`(σ2|x) =

n

2(σ2)−2 − (σ2)−3

∑(xi − 1)2︸︷︷︸=nσ2

=n

2(σ2)−2 − n(σ2)−2 = −n

2(σ2)−2


So the Jeffreys prior is:

pJ(σ2) =

[−Ex|σ2

(∂2

∂(σ2)2`(σ2|x)

)] 12

=[−Ex|σ2

(−n

2(σ2)−2

)] 12

∝ (σ2)−1

(c) Suppose we have x ∼ N (µ, σ2), where where 0 < σ2 < ∞,−∞ <

µ <∞. So the likelihood is:

L(µ, σ2|x) ∝ (σ2)−n2 exp

[−1

2(σ2)−1

∑(xi − µ)2

]The loglikelihood is therefore:

`(µ, σ2|x) = −n2

log(σ2)− 1

2(σ2)−1

∑(xi − µ)2

The first and the second derivatives are:

∂

∂µ`(µ, σ2|x) =

∂

∂µ

[−1

2(σ2)−1

∑(−2µxi + µ2)

]= (σ2)−1

∑xi − n(σ2)−1µ

∂

∂σ2`(µ, σ2|x) = −n

2(σ2)−1 +

1

2(σ2)−2

∑(−2µxi + µ2)

and

∂2

∂µ2`(µ, σ2|x) = −n(σ2)−1

∂2

µσ2`(µ, σ2|x) = −(σ2)−2

∑xi + n(σ2)−2µ

= (nµ−∑

xi)︸︷︷︸=0

(σ2)−2 = 0

∂2

∂(σ2)2`(µ, σ2|x) =

n

2(σ2)−2 − (σ2)−3

∑(xi − µ)2︸︷︷︸=nσ2

=n

2(σ2)−2 − n(σ2)−2 =

n

2(σ2)−2

So, the Jeffreys prior is:

pJ(µ, σ2) =

[−Ex|µ,σ2

(−n(σ2)−1 0

0 −n2 (σ2)−2

)] 12

=

(n(σ2)−

12 0

0 n2 (σ2)−1

),


which is:pJ(µ) ∝ 1 (constant)

pJ(σ2) ∝ (σ2)−1

Note that Jeffreys priors here are equivalent to those obtained from

(a) and (b).

10. (Robert 2001) Calculate the marginal posterior distributions for the

following setups:

• x|σ2 ∼ N (0, σ2), 1/σ2 ∼ G(1, 2).

• x|λ ∼ P(λ), λ ∼ G(2, 1).

• x|p ∼ NB(10, p), p ∼ BE(0.5, 0.5).

I Answer

(a)

π(σ2|x) ∝ f(x|σ2)π(σ2)

∝[(σ2)−

n2 exp

(− 1

2σ2

∑x2i

)][(σ2)−2 exp

(− 2

σ2

)]= (σ2)−(n2 +1)−1 exp

[− 1

σ2

(∑x2i

2+ 2

)]∼ IG

(n

2+ 1,

∑x2i

2+ 2

)(b)

π(λ|x) ∝ f(x|λ)π(λ)

∝[λ∑xi exp(−nλ)

][λ exp(−λ)]

= λ∑xi+1 exp [−(n+ 1)λ)]

∼ G(∑

xi + 2, n+ 1)

(c)

π(p|x) ∝ f(x|p)π(p)

=[p10n(1− p)

∑xi−10n

] [p−

12 (1− p)− 1

2

]= p10n− 1

2 (1− p)∑xi−10n− 1

2

∼ BE(

10n+1

2,∑

xi − 10n+1

2

)


12. Calculate the Jeffreys prior for the form in Exercise 5.5

I Answer

(a)

`(σ2|x) ∝ −n2

log(σ2)− 1

2(σ2)−1

∑x2i

∂`

∂σ2= −n

2(σ2)−1 +

1

2(σ2)−2

∑x2i

∂2`

∂(σ2)2=n

2(σ2)−2 − (σ2)−3

∑x2i

∴ pJ(σ2) =

−Ex|σ2

n2

(σ2)−2 − (σ2)−3∑

x2i︸︷︷︸

=nσ2

12

=[n

2(σ2)−2 − n(σ2)−2

] 12

= (n

2)

12 (σ2)−1

∝ (σ2)−1

(b)

`(λ|x) ∝(∑

xi

)log(λ)− nλ

∂`

∂λ=(∑

xi

)λ−1 − n

∂2`

∂λ2=(∑

xi

)λ−2

∴ pJ(λ) =

−Ex|λ(∑xi

)︸︷︷︸

=nλ

λ−2

12

= nλ−12

∝ λ− 12

(c)

`(p|x) ∝ 10n log(p) +(∑

xi − 10n)

log(1− p)

∂`

∂p= 10np−1 +

(∑xi − 10n

)(1− p)−1(−1)

∂2`

∂p2= −10np−2 +

(∑xi − 10n

)(1− p)−2(−1)


∴ pJ(p) =

−Ex|p−10np−2 − (

∑xi︸︷︷︸

=n10(1−p)

p

−10n)(1− p)−2

12

=

[10np−2 + 10n

(1− pp− 1

)(1− p)−2

] 12

=

[10n

1− p− p2

p2(1− p)2

]− 12

∝ (1− p− p2)12

p(1− p)

14. The Bayesian framework is easily adapted to problems in time-series.

One of the most simple time-series specifications is the AR(1), which

assumes that the previous period’s outcome is important in the current

period estimation.

Given and observed outcome variable vector, Yt measured at time t,

AR(1) model for T periods is:

Yt = Xtβ + εt

εt = ρεt−1 + ut, |ρ| < 1

ut ∼ iid N(0, σ2u). (4.1)

Here Xt is a matrix of explanatory variables at time t, εt and εt−1 are

residuals from period t and t− 1, respectively, ut is an additional zero-

mean error term for the autoregressive relationship, and β, ρ, σ2u are

unknown parameters to be estimated by the model. Backward substitu-

tion through time to arbitrary period s gives εt = ρ2εt−s+∑T−1j=0 ρ

jut−j ,

and since E[εt] = 0, then var[εt] = E[ε2t ] =σ2u

1−ρ2 , and the covariance be-

tween any two errors is: Cov[εt, εt−j ] =ρjσ2

u

1−ρ2 . Assuming asymptotic

normality, this setup leads to a general linear model with the following


tridiagonal T × T weight matrix (Amemiya 1985, 164):

Ω =1

σ2u

1 −ρ−ρ (1 + ρ2) −ρ 0

. . .

0 −ρ (1 + ρ2) −ρ−ρ 1

. (4.2)

Using the following data on worldwide fatalities from terrorism com-

pared to some other causes per 100,000 (source: Falkenrath 2001), de-

velop posteriors for β, ρ, and σu for each of the causes (Y ) as a separate

model using the date (minus 1983) as the explanatory variable (X).

Specify conjugate priors (Berger and Yang [1994] show some difficulties

with nonconjugate priors here), or use a truncated normal for the prior

on ρ (see Chib and Greenberg [1998]). Can you reach some conclusion

about the differences in these four processes?

Year Terrorism Car Accidents Suicide Murder

1983 0.116 14.900 12.100 8.3001984 0.005 15.390 12.420 7.9001985 0.016 15.150 12.380 8.0001986 0.005 15.920 12.870 8.6001987 0.003 19.106 12.710 8.3001988 0.078 19.218 12.440 8.4001989 0.006 18.429 12.250 8.7001990 0.004 18.800 11.500 9.4001991 0.003 17.300 11.400 9.8001992 0.001 16.100 11.100 9.3001993 0.002 16.300 11.300 9.5001994 0.002 16.300 11.200 9.0001995 0.005 16.500 11.200 8.2001996 0.009 16.500 10.800 7.400

I Answer

The joint posterior can be written as:

π(β, ρ, σ2u|X,y) ∝ f(y1, · · · , yT |β, ρ, σ2

u)π(β)π(ρ)π(σ2u)

= f(y1)f(y2|y1) · · · f(yT |yT−1)π(β)π(ρ)π(σ2u)


For f(yt|yt−1), we have:

yt − ρyt−1︸︷︷︸≡yt

= (xt − ρxt−1)︸︷︷︸≡xt

β + et − ρet−1︸︷︷︸≡ut

⇒ yt = xt′β + ut

∴ yt|β, ρ, σ2u ∼ N (xtβ, σ

2u)

For f(y1), we have:

∀t, E[et] = 0, Var[et] =σ2u

1− ρ2

⇒ E[y1] = E[x′1β + e1] = x′1β

Var[y1] = Var[e1] =σ2u

1− ρ2

∴ y1 ∼ N(x′1β,

σ2u

1− ρ2

)For priors, we have:

β ∼ N (βo,Bo) , ρ ∼ N (ρo,Φo) , σ2u ∼ IG(a, b)

Then, the joint posterior becomes:

π(β, ρ, σ2u|X,y) ∝

(1− ρ2

σ2u

) 12

exp

[−1− ρ2

σ2u

(y1 − x′1β)2

]×(

1

σ2u

) 12 (T−1)

exp

[− 1

σ2u

T∑i=2

(yt − xt′β)2

]

×(

1

Bo

) 12

exp

[− 1

2Bo(β − βo)

2

]× (σ2

u)a−1 exp[−bσ2

u

]×(

1

Φo

) 12

exp

[− 1

2Φo(ρ− φo)2

]It would be very demanding to mathematically derive the marginal pos-

terior distributions. But, using BUGS simplifies the computation. For the

introduction of BUGS (including WinBUGS and JAGS ), see the Appendix

C.

< Bugs.Code >


model z[1] ~ dnorm(xb1, v1)

z[1] <- y[1]

xb1 <- b * x[1]

v1 <- pow(smu,2) / (1 - pow(rho,2))

for (i in 2:n) z[i] ~ dnorm(xb2[i], v2)

z[i] <- y[i] - rho * y[i-1]

xb2[i] <- b * (x[i] - rho * x[i-1])

v2 <- pow(smu,2)

b ~ dnorm(0, 0.01)

rho ~ dnorm(0, 0.01)

smu ~ dgamma(0.01, 0.01)

16. Laplace (1774) wonders what the best choice is for a posterior point

estimate. He sets three conditions for the shape of the posterior: sym-

metry, asymptotes, and properness (integrating to one). In addition,

Laplace tacitly uses uniform priors. He proposes two possible criteria

for selecting the estimate:

• La primiere est l’instant tel qu’en egalement probable que le veritable

instant du phenomene tombe avant ou apres; on pourrait appeler

cet instant milieu de probabilite.

Meaning: use the median.

• Le seconde est l’instant tel qu’en le prenant pour milieu, la somme

des erreurs a craindre, multipliees par leur probabilite, soit un mini-

mum; on pourrait l’appeler milieu d’erreur ou milieu astronomique,

comm etant celui auquel les astronomes doivent s’arreter de preference.

Meaning: use the quantity at the “astronomical center of mass”

that minimizes: f(x) =∫|x− V |f(x)dx. In modern terms, this is

equivalent to minimizing the posterior expected loss: E[L(θ, d)|x] =∫ΘL(θ, d)π(θ|x)dθ, which is the average loss defined by the poste-

rior distribution and d is the “decision” to use the posterior esti-

mate of θ (see Berger [1985]).


Prove that these two criteria lead to the same point estimate.

I Answer

When L(θ, θ) = |θ − θ|, we have:

E[L(θ, θ)

]=

∫Θ

|θ − θ|π(θ|x)dθ

=

∫ θ

−∞(θ − θ)π(θ|x)dθ +

∫ ∞θ

(θ − θ)π(θ|x)dθ

To minimize E[L(θ, θ)

], we consider (F.O.C) w.r.t. θ:

∂E[L(θ, θ)

]∂θ

=

∫ θ

−∞

∂

∂θ

[(θ − θ)π(θ|x)

]dθ

+

∫ ∞θ

∂

∂θ

[(θ − θ)π(θ|x)

]dθ

=

∫ θ

−∞π(θ|x)dθ −

∫ ∞θ

π(θ|x)dθ ≡ 0

⇒∫ θ

−∞π(θ|x)dθ =

∫ ∞θ

π(θ|x)dθ

Therefore, by definition of median, θ is the median. (Q.E.D.)

18. Review one body of literature in your area of interest and develop a

detailed plan for creating elicited priors.

I Answer

One example would be Jeff Gill and Lee D. Walker, 2005, “Elicited

Priors for Bayesian Model Specification in Political Science Research,”

Journal of Politics 67(3): 841-872.

20. Test a Bayesian count model for the number of times that capital pun-

ishment is implemented on a state level in the United States for the

year 1997. Included in the data below (source: United States Census

Bureau, United States Department of Justice) are explanatory variables

for: median per capita income in dollars, the percent of the population


classified as living in poverty, the percent of Black citizens in the popu-

lation, the rate of violent crimes per 100,000 residents for the year before

(1996), a dummy variable to indicate whether the state is in the South,

and the proportion of the population with a college degree of some kind.

Exe- Median Percent Percent Violent Prop.State cutions Income Poverty Black Crime South Degrees

Texas 37 34453 16.7 12.2 644 1 0.16Virginia 9 41534 12.5 20.0 351 1 0.27Missouri 6 35802 10.6 11.2 591 0 0.21Arkansas 4 26954 18.4 16.1 524 1 0.16Alabama 3 31468 14.8 25.9 565 1 0.19Arizona 2 32552 18.8 3.5 632 0 0.25Illinois 2 40873 11.6 15.3 886 0 0.25S. Carolina 2 34861 13.1 30.1 997 1 0.21Colorado 1 42562 9.4 4.3 405 0 0.31Florida 1 31900 14.3 15.4 1051 1 0.24Indiana 1 37421 8.2 8.2 537 0 0.19Kentucky 1 33305 16.4 7.2 321 0 0.16Louisiana 1 32108 18.4 32.1 929 1 0.18Maryland 1 45844 9.3 27.4 931 0 0.29Nebraska 1 34743 10.0 4.0 435 0 0.24Oklahoma 1 29709 15.2 7.7 597 0 0.21Oregon 1 36777 11.7 1.8 463 0 0.25

EXE INC POV BLK CRI SOU DEG

In 1997, executions were carried out in 17 states with a national total

of 74. The model should be developed from the Poisson link function,

θ = log(µ), with the objective of finding the best β vector in:

g−1(θ)︸︷︷︸17×1

= exp [Xβ] = exp

[1β0 + INCβ1 + POVβ2

+ BLKβ3 + CRIβ4 + SOUβ5 + DEGβ6

].

Specify a suitable prior with assigned prior parameters, then summarize

the resulting posterior distribution.

I Answer

It would be mathematically demanding to derive marginal posterior

distribution for the Poisson model. But, using BUGS simplifies the


computation. For the introduction of BUGS (including WinBUGS and

JAGS ), see the Appendix C.

< Bugs.Code >

model for (i in 1:n)

EXE[i] ~ dpois(theta[i])

log(theta[i]) <- b0 + b1*INC[i] + b2*POV[i] +

b3*BLK[i] + b4*CRI[i] + b5*SOU[i] + b6*DEG[i]

b0 ~ dnorm(0,.0001)

b1 ~ dnorm(0,.0001)

b2 ~ dnorm(0,.0001)

b3 ~ dnorm(0,.0001)

b4 ~ dnorm(0,.0001)

b5 ~ dnorm(0,.0001)

b6 ~ dnorm(0,.0001)

Table 4.1: Summary Statistics of β’s

Mean S.E. 2.5% 25% 50% 75% 97.5%

b0 -0.2 0.3 -0.9 -0.4 -0.2 0.0 0.3

b1 2.7 0.5 1.7 2.4 2.7 3.0 3.8

b2 0.5 0.5 -0.6 0.1 0.5 0.8 1.6

b3 -1.9 0.5 -2.8 -2.2 -1.9 -1.5 -0.9

b4 0.0 0.4 -0.7 -0.2 0.0 0.3 0.8

b5 2.4 0.4 1.6 2.1 2.4 2.7 3.3

b6 -1.9 0.4 -2.7 -2.2 -1.9 -1.6 -1.1

5

The Bayesian Linear Model

Exercises

1. Derive the posterior marginal for β in (4.17) from the joint distribution

given by (4.16).

I Answer

From (4.16), we have:

π(β, σ2|X,y) ∝ σ−n−Aexp

[−

1

2σ2

(s+ (β + β)′(Σ−1 + X′X)(β + β)

)]

Now integrate with respect to σ2 to get the marginal for β:

π(β|X,y) ∝∫ ∞

0

σ−n−A exp

[−

1

2σ2

(s+ (β + β)′(Σ−1 + X′X)(β + β)

)]︸︷︷︸

inverse gamma PDF kernel

dσ2

We know that, ∫ ∞0

qp

Γ(p)x−(p+1) exp

[−qx

]︸︷︷︸

IG(x|p,q)

dx = 1

⇒∫ ∞

0

x−(p+1) exp

[−qx

]dx =

Γ(p)

qp

By setting p = n+a2 −1, q = 1

2

(s+ (β + β)′(Σ−1 + X′X)(β + β)

), and

x = σ2, we have:

π(β|X,y) ∝ q−p ∝[s+ (β + β)′(Σ−1 + X′X)(β + β)

]−(n+a2 )+1

,

which is (4.17). (Q.E.D.)

53


3. For uninformed priors, the joint posterior distribution of the regression

coefficients was shown to be multivariate-t (page 108), with covariance

matrix: R = (n−k)σ2(X′X)−1

n−k−2 . Under what conditions is this matrix

positive definite (a requirement for valid inferences here).

I Answer

R is positive definite.

⇐⇒ (X′X)−1 is positive definite and (n− k − 2) is positive.

(X′X)−1 is positive definite.

⇐⇒ X′X is positive definite.

⇐⇒ X is non-singular matrix (∵ X′X is symmetric).

⇐⇒ Columns of X are independent.

⇐⇒ rank(X) = k.

Note that, if rank(X) = k, then rank(X′X) = k, which guarantees that

the (k × k) matrix X′X is invertible.

Therefore, the condition is rank(X) = k and (n − k − 2) > 0. Since, k

is the number of explanatory variables, this condition also means that

there is no perfect multinolinearity in the data.

5. Under standard analysis of linear models, the hat matrix is given by

y = Hy where H is X(X′X)−1X′ where the diagonal values of this

matrix indicate leverage, which is obviously a function of X only. Can

the stipulation of strongly informed priors change data-point leverage?

Influence in linear model theory depends on both hat matrix diagonal

values and yi. Calculate the influence on the Technology variable of each

datapoint in the Palm Beach County model with uninformed priors by

jackknifing out these values one-at-a-time and re-running the analysis.

Which precinct has the greatest influence?

I Answer

< R.Code >

# data and setup

require(mvtnorm)

The Bayesian Linear Model 55

pbc.vote <- as.matrix(read.table("http://artsci.wustl.edu/

~stats/BMSBSA/Chapter06/pbc.dat"))

dimnames(pbc.vote)[[2]] <- c("badballots", "tech", "new",

"turnout", "rep", "whi")

X <- cbind(1, pbc.vote[,-1])

Y <- pbc.vote[,1]

# define posteriors

uninform.lm <- function(x, y) n <- nrow(x); k <- ncol(x)

H.inv <- solve(t(x)%*%x)

bhat <- H.inv%*%t(x)%*%y

s2 <- t(y- x%*%bhat)%*%(y- x%*%bhat)/(n-k)

R <- as.real((n-k)*s2/(n-k-2))*H.inv

alpha <- (n-k-1)/2

beta <- 0.5*s2*(n-k)

coef.draw <- rmvt(100000, sigma=R, df=n-k)

post.coef <- apply(coef.draw, 2, mean)

post.sd <- apply(coef.draw, 2, sd)

post.sigma <- mean(sqrt(1/rgamma(10000,alpha,beta)))

out <- list("coef"=post.coef, "se"=post.sd,

"sigma"=post.sigma)

# define jackknife procedure

jacknife <- function(X, Y, A=NULL, B=NULL) N <- nrow(X)

K <- nrow(X)

post.coef <- matrix(NA, N, K)

post.sigma <- rep(NA, N)

for (i in 1:N) x <- X[-i,]; y <- Y[-i]

n <- nrow(x); k <- ncol(x)

out <- uninform.lm(x, y)

post.coef[i,] <- out[["coef"]]

post.sigma[i] <- out[["sigma"]]

OUT <- list("coefs"=post.coef, "sigma"=post.sigma)


0 100 200 300 400 50021.2

21.4

21.6

21.8

22.0

Unit Index

Pos

terio

r M

ean

for

σσ

78

222

398

FIGURE 5.1: Plot of posterior σ using jackknifing

# make graph

post.sigma <- jacknife(X, Y)[["sigma"]]

par(mgp=c(2,0.25,0), mar=c(3,3,1,1), tcl=-0.2)

plot(1:516, post.sigma,

ylab=expression("Posterior Mean for "*sigma),

xlab="Unit Index")

true.sigma <- uninform.lm(X, Y)[["sigma"]]

abline(h=true.sigma, col=2, lty=2)

#identify(1:516, post.sigma, labels=1:516)

7. Meier and Keiser (1996) used a linear pooled model to examine the

impact on several federal laws on state-level child-support collection

policies. Calculate a Bayesian linear model and plot the posterior dis-

tribution of the parameter vector, β, as well as σ2, specifying an inverse

gamma prior with your selection of prior parameters, and the uninfor-

mative uniform prior: f(σ2) = 1σ . Use a diffuse normal prior for the β,

and identify outliers using the hat matrix method (try the R command

hat).


The data are collected for the 50 states over the period 1982 to 1991,

where the outcome variable, SCCOLL, is the change in child-support

collections. The explanatory variables are: chapters per population

(ACES), policy instability (INSTABIL), policy ambiguity (AAMBIG), the

change in agency staffing (CSTAFF), state divorce rate (ARD), organiza-

tional slack (ASLACK), and state-level expenditures (AEXPEND). These

data can be downloaded on the webpage for this book. For details, see

their original article or Meier and Gill (2000, Chapter 2).

SCCOLL ACES INSTABIL AAMBIG CSTAFF ARD ASLACK AEXPEND

1141.4 1.467351 70431.41 23.57 115.5502 26.2 3.470122 3440.1592667.0 1.754386 6407.35 32.10 39.5922 27.8 4.105816 8554.186307.3 0.421585 63411.55 35.14 106.0642 29.4 5.610657 2544.263840.7 0.000000 16396.95 23.46 50.5777 30.8 3.848957 2366.732482.5 1.184990 78906.07 31.95 -1.1440 20.3 3.700461 4954.157550.0 2.072846 28103.63 17.81 25.7648 22.8 2.742410 2976.427514.1 0.607718 22717.82 22.23 18.3592 14.9 5.063129 4905.780

1352.5 1.470588 6602.02 65.03 -232.2549 19.6 9.178106 5970.5201136.5 0.828500 123240.60 21.44 60.6078 30.8 2.495392 2567.2121575.4 1.660879 87061.77 31.26 86.8236 22.2 2.100618 2589.5311170.6 0.000000 13042.99 37.68 31.7781 19.1 4.491279 4141.2471679.6 0.962464 7197.37 21.72 79.9475 27.0 3.074239 3216.067882.3 1.212856 181247.60 16.97 45.6272 17.4 1.902308 2469.215

1347.5 1.782531 78822.32 11.84 33.7674 26.6 1.762038 1742.7451353.4 1.431127 11549.32 31.10 57.3484 16.6 3.434238 2659.6971363.4 2.404810 13084.74 18.11 72.4447 22.5 2.563138 3296.4651087.9 0.538648 26327.75 23.06 166.6439 21.9 2.649425 3330.793712.8 0.470367 27160.63 25.46 35.4368 15.6 4.030189 3416.769

1865.4 2.429150 9965.54 18.14 90.4288 20.9 3.209765 4109.0071088.4 1.234568 38821.03 40.84 24.8976 14.2 3.488844 5275.8241308.1 0.667111 76348.86 26.11 82.5835 12.7 3.641929 4926.7753485.4 1.174210 197802.90 25.89 80.5744 17.7 1.753225 5419.7841946.4 1.128159 35488.20 26.85 70.3593 14.9 4.764180 5208.2181116.8 0.385803 76166.11 16.39 225.7210 21.8 4.611794 2539.3692047.5 0.580495 54306.30 21.42 121.5814 22.0 2.799651 2754.369893.3 2.475247 7206.16 12.36 103.7574 22.9 3.882953 2083.009

1779.2 0.000000 26995.98 45.80 78.2787 17.6 3.511719 3982.897703.4 1.557632 3533.64 49.19 -32.1768 54.9 6.609568 4082.064695.1 2.714932 6912.47 43.87 32.3424 19.7 6.407141 2998.161

1287.0 0.257732 62029.57 36.51 17.6954 15.0 4.470681 6310.163573.0 0.000000 13153.38 28.29 48.0668 26.0 2.264973 2759.142955.6 0.332263 151116.20 32.31 50.3237 15.0 3.284161 5634.305

1296.1 1.187472 70689.08 23.75 31.7058 20.6 3.450963 2992.5311194.5 1.574803 5132.25 15.61 54.0354 15.4 2.777402 2679.4544299.2 3.565225 169930.40 22.62 136.6210 20.3 2.841627 3072.191896.4 1.259843 66218.98 23.09 35.2008 32.9 3.420866 2381.565815.4 4.106776 53441.92 37.42 -176.8579 23.8 6.527769 4303.525

2262.9 0.752445 121275.40 51.96 18.3963 14.8 2.932871 4130.6571108.8 0.996016 10880.08 38.61 22.0054 16.1 2.421065 3339.3501314.2 1.404494 27217.83 25.78 36.9890 17.4 1.432419 2533.1391268.9 4.207574 2172.79 17.05 23.4408 16.9 2.988671 2366.595952.8 0.807591 94001.98 41.90 33.3545 26.6 1.970768 2031.763763.2 0.403481 121752.90 51.39 53.8607 24.4 2.710884 1620.140

1228.6 0.000000 11480.06 21.33 73.4535 21.8 4.641821 5069.0721023.5 3.527337 3049.59 20.86 84.8460 18.9 4.317273 2983.7821646.3 1.272669 58245.25 19.83 105.2593 17.7 2.980112 3097.7822483.3 1.195695 80936.93 30.97 160.9784 24.7 6.063038 5996.7341045.3 0.555247 15810.82 15.11 78.1052 22.3 3.786695 2222.7283866.9 1.210898 70813.30 24.03 60.6050 15.2 3.335582 4972.4251450.2 0.000000 7217.25 10.33 163.1716 29.0 2.623226 1860.076

I Answer

Table 4.1 and Figure 4.2 summarizes the posteriors.

< R.Code >


constant ACES INSTABIL AAMBIG CSTAFF

Mean -95.5694 119.3877 0.0016 3.6618 1.9330

Std.Dev. 2.5e+02 5.9e+01 1.1e-03 5.2e+00 1.0e+00

ARD ASLACK AEXPEND σ2

Mean -0.9359 -53.4573 0.1242 453696

Std.Dev. 6.5e+00 4.4e+01 5.4e-02 105390

Table 5.1: Posteriors of Meier and Keiser (1996) model

ACES

0 200 400 600 800

0.00

00.

008

INSTABIL

0.000 0.005 0.010 0.015

020

0

AAMBIG

−40 −20 0 20 40

0.00

0.06

CSTAFF

0 5 10 15

0.0

0.3

ARD

−60 −40 −20 0 20 40

0.00

0.05

ASLACK

−400 −300 −200 −100 0 100 200

0.00

00.

008

AEXPEND

0.0 0.2 0.4 0.6 0.8 1.0

04

8

sigma^2

200000 600000 1000000 1400000

0e+

004e

−06

FIGURE 5.2: Posteriors of Meier and Keiser (1996) model


# data and setup

require(arm); require(mvtnorm)

mk <- as.matrix(read.table("http://artsci.wustl.edu/~stats/

BMSBSA/Chapter06/meier.keiser.dat", header=T))

y <- mk[,1]; x <- mk[,-1]

# priors

BBeta <- rep(0, 8); Sigma <- diag(c(5,5,5,5,5,5,5,5))

A <- 1; B <- 1

# posteriors

x <- cbind(1, x); n <- nrow(x); k <- ncol(x)

H.inv <- solve(t(x) %*% x)

bhat <- H.inv %*% t(x) %*% y

s2 <- t(y - x %*%bhat) %*% (y - x %*% bhat) / (n-k)

tB <- solve(solve(Sigma) + t(x) %*% x) %*%

(solve(Sigma) %*% BBeta + t(x) %*% x %*% bhat)

ts <- 2*B + s2*(n-k) + (t(BBeta) - t(tB)) %*%

solve(Sigma) %*% BBeta + t(bhat - tB) %*%

t(x) %*% x %*% bhat

R <- diag(ts / (n+A-k-3)) *

solve(solve(Sigma) + t(x) %*% x)

alpha <- (n-k-1)/2

beta <- s2*(n-k)/2

coef.draw <- mvrnorm(100000, tB, Sigma=R) /

sqrt(rchisq(100000, k-1))

sigma.draw <- 1/rgamma(100000, alpha, beta)

post.coef <- apply(coef.draw, 2, mean)

post.sd <- apply(coef.draw, 2, sd)

post.sigma <- c(mean(sigma.draw), sd(sigma.draw))

output <- list("coef.mean"=post.coef, "coef.sd"=post.sd,

"sigma.mean"=post.sigma[1],

"sigma.sd"=post.sigma[2])

output

# make graphs

par(mfrow=c(4,2))

for (i in 2:8) hist(coef.draw[,i], freq=FALSE, breaks=50,

main=dimnames(x)[[2]][i], xlab="", ylab="")


lines(density(coef.draw[,i]))

hist(sigma.draw, freq=FALSE, breaks=50,

main="sigma^2", xlab="", ylab="")

lines(density(sigma.draw))

9. Returning to the discussion of conjugate priors for the Bayesian linear

regression model starting on page 111, show that substituting (4.14) and

(4.15) into (4.13) produces (4.16).

I Answer

All we have to show is:

σ2(n− k) + (β − b)′X′X(β − b) + 2b+ (β −B)′Σ−1(β −B)

= s+ (β − β)′(Σ−1 + X′X)(β − β),

where β = (Σ−1 + X′X)−1(Σ−1B + X′Xb) and s = 2b + σ2(n − k) +

(B− β)Σ−1B + (b− β)′X′Xb.

(LHS) = σ2(n− k) + 2b︸︷︷︸use definition of s

+(β − b)′X′X(β − b)

+ (β −B)′Σ−1(β −B)

= s− (B− β)′Σ−1B− (b− β)′X′Xb

+ (β − b)′X′X(β − b) + (β −B)′Σ−1(β −B)

= s−B′Σ−1B + β′Σ−1B− b′X′Xb + β

′X′Xb

+ β′X′Xβ − 2β′X′Xb + b′X′Xb

+ β′Σ−1β − 2β′Σ−1B + B′Σ−1B

= s+ (β′− 2β′)Σ−1B + (β

′− 2β′)X′Xb

+ β′X′Xβ + β′Σ−1β

= s+ (β′− 2β′) (Σ−1B + X′Xb)︸︷︷︸

use definition of β

+β′X′Xβ + β′Σ−1β


= s+ (β′− 2β′)(Σ−1 + X′X)β + β′X′Xβ + β′Σ−1β

= s+ β′(Σ−1 + X′X)β − 2β′(Σ−1 + X′X)β + β′(Σ−1 + X′X)β

= s+ (β − β)′(Σ−1 + X′X)(β − β)

= (RHS) (Q.E.D.)

11. For the Bayesian linear regression model, prove that the posterior that

results from conjugate priors is asymptotically equivalent to the poste-

rior that results from uninformative priors. See Table 4.3.

I Answer

In the asymptotic world, data (i.e. likelihood) dominates prior, which

produces:

β → (X′X)−1(X′Xb) = b

s → σ2(n− k) + (b− b)′X′Xb = σ2(n− k)

Therefore, when n → ∞, the posteriors from either conjugate prior or

uninformative prior becomes:

β|X,y ∼ N(b, σ2(X′X)−1

)σ2|X,y ∼ IG(∞,∞)

(Q.E.D.)

13. Develop a Bayesian linear model for the following data that describe the

average weekly household spending on tobacco and alcohol (in pounds)

for the eleven regions of the United Kingdom (Moore and McCabe 1989,

originally from Family Expenditure Survey. Department of Employment,

1981, British Official Statistics). Specify both an informed conjugate

and uninformed prior using the level for alcoholic beverage as the out-

come variable and the level for tobacco products as the explanatory

variable. Do you notice a substantial difference in the resulting priors.

Describe.


Region Alcohol Tobacco

Northern Ireland 4.02 4.56

East Anglia 4.52 2.92

Southwest 4.79 2.71

East Midlands 4.89 3.34

Wales 5.27 3.53

West Midlands 5.63 3.47

Southeast 5.89 3.20

Scotland 6.08 4.51

Yorkshire 6.13 3.76

Northeast 6.19 3.77

North 6.47 4.03

I Answer

The two analyses produces Table 4.2. It turns out that the resulting

posteriors are pretty much different. Since we have only 11 observa-

tions the posteriors are dominated by the prior specification, not by the

likelihood (i.e. data).

From uninformed prior Mean SD 2.5% Quant Median 97.5% Quant

constant 4.351 1.822 0.230 4.351 8.472

tabacco 0.302 0.498 -0.824 0.302 1.427

σ2 0.963 0.281 0.589 0.906 1.681

From conjugate prior Mean SD 2.5% Quant Median 97.5% Quant

constant 1.671 1.505 -1.556 1.671 4.898

tabacco 1.022 0.419 0.122 1.022 1.921

σ2 0.595 0.082 0.460 0.586 0.777

Table 5.2: Posteriors of Alcohol vs. Tobacco model

< R.Code >

# data

x <- c(4.56, 2.92, 2.71, 3.34, 3.53, 3.47, 3.20, 4.51,

3.76, 3.77, 4.03)

X <- cbind(1, x)

y <- c(4.02, 4.52, 4.79, 4.89, 5.27, 5.63, 5.89, 6.08,

6.13, 6.19, 6.47)


# function to return a regression table with quantiles

t.ci.table <- function(coefs, cov.mat, level=0.95,

degrees=Inf, quantiles=c(0.025, 0.500, 0.975)) quantile.mat <- cbind(coefs, sqrt(diag(cov.mat)),

t(qt(quantiles, degrees) %o%

sqrt(diag(cov.mat))) +

matrix(rep(coefs, length(quantiles)),

ncol=length(quantiles)))

quantile.names <- c("Mean", "SD")

for (i in 1:length(quantiles)) quantile.names <- c(quantile.names,

paste(quantiles[i], "Quantile"))

colnames(quantile.mat) <- quantile.names

return(round(quantile.mat, 4))

# uninformed prior analysis

bhat <- solve(t(X) %*% X) %*% t(X) %*% y

s2 <- t(y - X %*% bhat) %*% (y - X %*% bhat) /

(nrow(X) - ncol(X))

R <- solve(t(X) %*% X) * ((nrow(X) - ncol(X)) *

s2 / (nrow(X) - ncol(X) - 2))[[1,1]]

uninformed.table <- t.ci.table(bhat, R,

degrees=nrow(X)-ncol(X))

alpha <- (nrow(X) - ncol(X) -1) / 2

beta <- s2 * (nrow(X) - ncol(X)) / 2

sort.inv.gamma.sample <- sort(1/rgamma(10000,

alpha, beta))

sqrt.sort.inv.gamma.sample <- sqrt(sort.inv.gamma.sample)

uninformed.table <- rbind(uninformed.table,

c(mean(sqrt.sort.inv.gamma.sample),

sqrt(var(sqrt.sort.inv.gamma.sample)),

sqrt.sort.inv.gamma.sample[250],


sqrt.sort.inv.gamma.sample[9750]))

# conjugate prior analysis

A <- 5; B <- 5

BBeta <- rep(0, 2); Sigma <- diag(c(2, 2))


tB <- solve(solve(Sigma) + t(X) %*% X) %*%

(solve(Sigma) %*% BBeta + t(X) %*% X %*% bhat)

ts <- 2*B + s2 * (nrow(X) - ncol(X)) +

(t(BBeta) - t(tB)) %*% solve(Sigma) %*% BBeta +

t(bhat - tB) %*% t(X) %*% X %*% bhat

R <- diag(ts / (nrow(X)+ A - ncol(X) - 3)) *

solve(solve(Sigma) + t(X) %*% X)

conjugate.table <- t.ci.table(tB, R,

degrees=nrow(X)+A-ncol(X))

a <- nrow(X) + A - ncol(X)

b <- s2 * (nrow(X) + A - ncol(X)) / 2

sort.iv.gamma.sample <- sort(1/rgamma(10000, a, b))

sqrt.sort.iv.gamma.sample <- sqrt(sort.iv.gamma.sample)

conjugate.table <- rbind(conjugate.table,






# compare two analyses

uninformed.table

conjugate.table

15. The standard econometric approach to testing parameter restrictions in

the linear model is compare H0 :Rβ − q = 0 with H1 :Rβ − q 6= 0 (eg.

Greene [2003, 95]), where R is a matrix of linear restrictions and q is a

vector of values (typically zeros). Thus for example R =[

1 0 0 00 1 1 00 0 0 1

]and

q = [0, 1, 2]′ indicate β1 = 0, β2+β3 = 1, and β4 = 2. Derive expressions

for the posteriors π(β|X,y) and π(σ|X,y) using both conjugate and

uninformative priors.

I Answer

If we use conjugate priors,

β|σ2 ∼ N (B, σ2)

σ2 ∼ IG(a, b),


then we have posteriors,

β|X ∼ t(n+ a− k)

σ2|X ∼ IG(n+ a− k, 1

2σ2(n+ a− k)

).

And, if we use uniform priors (as an example of uninformative priors),

β ∝ c over [−∞,∞]

σ2 =1

σover [0,∞],

then we have posteriors,

β|X ∼ t(n− k)

σ2|X ∼ IG(

1

2(n− k − 1),

1

2σ2(n− k)

).

The next step is to define the hypotheses:

H0 : Rβ − q = 0

H1 : Rβ − q 6= 0,

where k is the number of parameters.

Consider the gth restriction from the gth row of R and the gth value of

q. Then, what we would be interested in are:

fg(β) =

k∑j=1

Rgjπ(βj |X)− qg,

which is a non-central mixture of t-distributions because R ∈ 0, 1.And, f(σ2) is unchanged by restrictions in test.

17. Using the Palm Beach County electoral data (page 109), calculate the

posterior predictive distribution of the data using a skeptical conjugate

prior (i.e. centered at zero for hypothesized effects and having large

variance).

I Answer

< R.Code >


# data and setup

require(arm); require(mvtnorm)



dimnames(pbc.vote)[[2]] <- c("badballots", "teach", "new",


x <- cbind(1, pbc.vote[,-1])

y <- pbc.vote[,1]


xbar <- x

# posterior predictive distribution

bhat <- solve(t(x)%*%x) %*% t(x)%*%y

s2 <- t(y - x%*%bhat) %*% (y - x%*%bhat) / (n-k)

m <- t(x)%*%x + t(xbar)%*%x

h <- (diag(n) - xbar %*% solve(m) %*% t(xbar)) /

as.numeric(s2)

newb <- xbar %*% bhat

newR <- (n-k) * solve(h) / (n-k-2)

pred <- mvrnorm(10000, newb, Sigma=newR) /

sqrt(rchisq(10000, k-1))

p <- nrow(xbar)

pred.mean <- rep(0, p)

pred.sd <- rep(0, p)

for (i in 1:p) pred.mean[i] <- mean(pred[,i])

pred.sd[i] <- sd(pred[,1])

# actual data vs. predictive distribution

pred.table <- cbind(y, pred.mean, pred.sd)

pred.table

19. Rerun the Ancient China Conflict Model using the code in this chap-

ter’s Computational Addendum using informed prior specifications

of your choice (you should be willing to defend these decisions though).

Calculate the posterior mean for each marginal distribution of the pa-

rameters and create a predicted data values as customarily done in linear

models analysis. Graph the yi against yi and make a statement about


Mean SD 2.5% Quant Median 97.5% Quant

CONSTANT 15.54 2.32 10.927 15.54 20.16

EXTENT 5.39 1.12 3.168 5.39 7.61

DIVERSE 0.37 0.59 -0.802 0.37 1.55

ALLIANCE -1.22 0.76 -2.733 -1.22 0.28

DYADS -3.41 0.75 -4.900 -3.41 -1.91

TEMPOR 0.74 0.32 0.096 0.74 1.38

DURATION 0.15 0.46 -0.767 0.15 1.07

σ2 0.96 0.28 0.586 0.90 1.65

Table 5.3: Posteriors of Ancient China Conflict Model

the quality of fit for your model. Can you improve this fit by dramati-

cally changing the prior specification?

I Answer

The prior for β isN ([0, 0, 0, 0, 0, 0, 0]′, 2I) and the prior for σ2 is IG(2, 1).

Then, the linear model produces the posteriors as in Table 4.3. Figure

4.3 show that the model choice (i.e. linear model with conjugate priors)

does not perform very well. Furthermore, since the data dominates the

posteriors, the change in the prior specification improve the model fit.

< R.Code >

# data and setup

# data downloaded from http://dvn.iq.harvard.edu/dvn/dv/mra

require(foreign)

wars <- read.dta("data_16049.dta")

attach(wars)

X <- cbind(1, V3, V5, V6, V7, V12, V2-V1+1)

colnames(X) <- cbind("CONSTANT", "EXTENT", "DIVERSE",

"ALLIANCE", "DYADS", "TEMPOR",

"DURATION")

y <- 10*V8 + V9

missing <- c(4, 56, 78)

X <- X[-missing,]

y <- y[-missing]

n <- nrow(X); k <- ncol(X)

# function to return a regression table with quantiles


20 22 24 26 28 30

2224

2628

3032

34

observed y value

pred

icte

d y

valu

e

FIGURE 5.3: Model fit for Ancient China Conflict Model

t.ci.table <- function(coefs, cov.mat, level=0.95,

degrees=Inf, quantiles=c(0.025, 0.500, 0.975)) quantile.mat <- cbind(coefs, sqrt(diag(cov.mat)),

t(qt(quantiles, degrees) %o%

sqrt(diag(cov.mat))) +

matrix(rep(coefs, length(quantiles)),

ncol=length(quantiles)))

quantile.names <- c("Mean", "SD")

for (i in 1:length(quantiles)) quantile.names <- c(quantile.names,

paste(quantiles[i], "Quantile"))

colnames(quantile.mat) <- quantile.names

return(round(quantile.mat, 4))

# define priors


A <- 2; B <- 1

BBeta <- rep(0, 7)

Sigma <- diag(c(2,2,2,2,2,2,2))

# calculate posteriors

bhat <- solve(t(X)%*%X) %*% t(X)%*%y

s2 <- t(y - X%*%bhat) %*% (y - X%*%bhat) / (n-k)

tB <- solve(solve(Sigma) + t(X)%*%X) %*%

(solve(Sigma) %*% BBeta + t(X)%*%X%*%bhat)



t(X)%*%X%*%bhat

R <- (ts / (n+A-k))[1,1] *

solve(solve(Sigma) + t(X)%*%X)

conjugate.table <- t.ci.table(tB, R, degrees=n+A-k)

a <- n+A-k

b <- s2*(n+A-k)/2

sort.iv.gamma.sample <- sort(1/rgamma(10000,a,b))

sqrt.sort.iv.gamma.sample <- sqrt(sort.iv.gamma.sample)

conjugate.table <- rbind(conjugate.table,






conjugate.table

# create predicted data vlues and make a graph

beta <- conjugate.table[,1]

beta <- beta[-8]

yhat <- X %*% beta

plot(yhat ~ y, xlab="observed y value",

ylab="predicted y value")

abline(0,1)

6

Assessing Model Quality

Exercises

1. Derive the marginal posterior for β in

π(β|X,y) ∝[s+ (β − β)′(Σ−1 + X′X)(β − β)

]−(n+a2 )+1

,

where:

β = (Σ−1 + X′X)−1(Σ−1B + X′Xb)

s = 2b+ σ2(n− k) + (B− β)′Σ−1B + (b− β)′X′Xb.

from the joint distribution given by

π(β, σ2|X,y) ∝ σ−n−a exp

[−

1

2σ2

(s+ (β − β)′(Σ−1 + X′X)(β − β)

)].

See Chapter 4 (page 112) for terminology details.

I Answer

Let’s integrate the joint distribution with respect to σ2 to get the marginal

for β:

π(β|X,y) ∝∫ ∞

0

σ−n−A exp

[−

1

2σ2

(s+ (β + β)′(Σ−1 + X′X)(β + β)

)]︸︷︷︸

inverse gamma PDF kernel

dσ2

We know that, ∫ ∞0

qp

Γ(p)x−(p+1) exp

[−qx

]︸︷︷︸

IG(x|p,q)

dx = 1

⇒∫ ∞

0

x−(p+1) exp

[−qx

]dx =

Γ(p)

qp

71


By setting p = n+a2 −1, q = 1

2

(s+ (β + β)′(Σ−1 + X′X)(β + β)

), and

x = σ2, we have:

π(β|X,y) ∝ q−p ∝[s+ (β + β)′(Σ−1 + X′X)(β + β)

]−(n+a2 )+1

,

which is the marginal posterior for β above. (Q.E.D.)

3. Derive the form of the posterior predictive distribution for the beta-

binomial model. See Section 6.4.

I Answer

From Equation (2.15) we know that,

p(θ|y) =Γ(n+A+B)

Γ(y +A)Γ(n− y +B)θ(y+A)−1(1− θ)(n−y+B)−1

Thus the posterior predictive distribution for new data y is:

p(y|y) =

∫ 1

0

p(y|θ)p(θ|y)dθ

=

∫ 1

0

[(n

y

)θy(1− θ)n−y

× Γ(n+A+B)

Γ(y +A)Γ(n− y +B)θ(y+A)−1(1− θ)(n−y+B)−1

]dθ

=Γ(n+A+B)

Γ(y +A)Γ(n− y +B)

Γ(n+ 1)

Γ(n− y + 1)Γ(y + 1)

×∫ 1

0

θ(y+y+A)−1(1− θ)(n−y+n−y+B)−1dθ

=Γ(n+A+B)

Γ(y +A)Γ(n− y +B)

Γ(n+ 1)

Γ(n− y + 1)Γ(y + 1)

× Γ(y + y +A)Γ(n− y +B + n− y)

Γ(A+B + n+ n)

To calculate mean and variance of y, I denote A′ = y+A, B′ = n−y+B.

Therefore, we have:


= logβαΓ(γ)

δγΓ(α)

∫ ∞0

βα

Γ(α)xα−1 exp[−xβ]dx︸︷︷︸

=1

+ (α− γ)

∫ ∞0

(log x) · βα

Γ(α)xα−1 exp[−xβ]dx︸︷︷︸

=ψ(α)

− (β − γ)

∫ ∞0

x · βα

Γ(α)xα−1 exp[−xβ]dx︸︷︷︸=αβ

= logβαΓ(γ)

δγΓ(α)+ (α− γ)ψ(α)− (β − γ)

α

β,

where ψ(α) is the digamma function.

7. Calculate the Polasek (6.15) linear model diagnostic for each of the

data points in your model from Exercise 4.4. Graphically summarize

identifying points of concern.

I Answer

From Section 6.3.2.1, the Polasek linear model diagnostic can be calcu-

lated by:∂b(wi)

∂wi= (X′X)−1Xi

[yi −Xiβ

],

where β is the posterior mean for β.

< R.Code >

# data and setup

mk <- as.matrix(read.table("http://artsci.wustl.edu/~stats/

BMSBSA/Chapter06/meier.keiser.dat", header=T))

y <- mk[,1]; x <- mk[,-1]

# priors

BBeta <- rep(0, 8); Sigma <- diag(c(5,5,5,5,5,5,5,5))

A <- 1; B <- 1

# posteriors

x <- cbind(1, x); n <- nrow(x); k <- ncol(x)


H.inv <- solve(t(x) %*% x)

bhat <- H.inv %*% t(x) %*% y

s2 <- t(y - x %*%bhat) %*% (y - x %*% bhat) / (n-k)

tB <- solve(solve(Sigma) + t(x) %*% x) %*%

(solve(Sigma) %*% BBeta + t(x) %*% x %*% bhat)



t(x) %*% x %*% bhat

R <- diag(ts / (n+A-k-3)) *

solve(solve(Sigma) + t(x) %*% x)

0 20 40

−30

0−

200

−10

00

100

intercept

0 20 40

−20

020

4060

80

ACES

0 20 40

−0.

0010

0.00

000.

0010

INSTABIL

0 20 40

−2

−1

01

23

AAMBIG

0 20 40

−1.

0−

0.5

0.0

0.5

1.0

CSTAFF

0 20 40

−3

−2

−1

01

2

ARD

0 20 40

−20

−10

010

2030

40

ASLACK

0 20 40

−0.

02−

0.01

0.00

0.01

0.02

0.03

AEXPEND

FIGURE 6.1: Polasek Diagnostic for Meier and Keiser (1996) model


# Polasek diagnostic

e <- as.vector(y - x %*% tB)

polasek <- NULL

for (i in 1:n) pol <- solve(t(x)%*%x) %*% x[i,] %*% e[i]

polasek <- cbind(polasek, pol)

colnames(polasek) <- obs <- c(1:50)

rownames(polasek) <- indv <- c("intercept", "ACES",

"INSTABIL", "AAMBIG", "CSTAFF",

"ARD", "ASLACK", "AEXPEND")

# graphs

par(mfrow=c(2,4))

for (i in 1:k) plot(polasek[i,] ~ obs, main=indv[i], xlab="", ylab="")

9. In Section 6.2.2, it was shown how to vary the prior variance 50% by ma-

nipulating the beta parameters individually. Perform this same analysis

with the two parameters in the Poisson-gamma conjugate specification.

Produce a graph like Figure 6.4.

I Answer

We know that the prior λ ∼ G(α, β) and the likelihood p(y|λ) ∼ P(λ)

produces the posterior λ|y ∼ G(α+ ny, β + n). By using the data from

Exercise 2.3, we can make Figure 6.2 below.

< R.Code >

A <- c(10, 5, 15)

B <- c(2, 2.828, 1.633)

nx <- 2701; n <- 25

d1 <- A + nx; d2 <- B + n

ruler <- seq(90, 110, length=100)

par(mfrow=c(1,2))

plot(c(90, 110), c(0, 0.25), type="n",

xlab="", ylab="", main="Varying alpha: [5, 10, 15]")


90 95 100 105 110

0.00

0.05

0.10

0.15

0.20

0.25

Varying alpha: [5, 10, 15]

90 95 100 105 110

0.00

0.05

0.10

0.15

0.20

0.25

Varying beta: [2.828, 2, 1.633]

FIGURE 6.2: Exponential Model Sensitivity

for (i in 1:3) lines(dgamma(ruler, d1[i], d2[1]) ~ ruler, col=i)

plot(c(90, 110), c(0, 0.25), type="n",

xlab="", ylab="",

main="Varying beta: [2.828, 2, 1.633]")

for (i in 1:3) lines(dgamma(ruler, d1[1], d2[i]) ~ ruler, col=i)

11. The following data are annual firearm-related deaths in the United

States per 100,000 from 1980 to 1997, by row (source: National Center

for Health Statistics, Health and Aging Chartbook, August 24, 2001).


14.9 14.8 14.3 13.3 13.3 13.3 13.9 13.6 13.9

14.1 14.9 15.2 14.8 15.4 14.8 13.7 12.8 12.1

Specify a normal model for these data with a diffuse normal prior for

µ and a diffuse inverse gamma prior for σ2 (see Section 3.5 for the

more complex multivariate specification that can be simplified). In order

to obtain local robustness of this specification, calculate ||D(q)|| (see

page 213) for the alternative priors: q(µ) = N (14, 1), q(µ) = N (12, 3),

q(σ−2) = G(1, 1), q(σ−2) = G(2, 4).

I Answer

I specify the diffuse priors as µ ∼ N (14, 100) and σ−2 ∼ G(1, 0.1).

The resulting posterior distributions are shown in Figure 6.3. And the

robustness measure ||D(q)|| is:

Alternative Prior for µ ||D(q)||N (14, 1) 0.0667

N (12, 3) 0.1030

Alternative Prior for σ−2 ||D(q)||G(1, 1) 0.2797

G(2, 4) 0.8352

Table 6.1: Robustness measure ||D(q)||

Observe that the posteriors for µ look very similar in Figure 6.3, reflect-

ing the low level of robustness measure ||D(q)|| in Table 6.1. On the

other hand, the posteriors for σ−2 do not look very similar in Figure 3,

which is consistent with the high level of robustness measure ||D(q)|| inTable 6.1.

< R.Code >

# setup

require(runjags)

y <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3, 13.9, 13.6,


Posteriors for mu w/ Different Priors

12 13 14 15 16

0.0

0.5

1.0

1.5

2.0

−−− N(14, 100)−−− N(14, 1)−−− N(12, 3)

Posteriors for sigma^(−2) w/ Different Priors

0 1 2 3 4

0.0

0.5

1.0

1.5

−−− G(1, 0.1)

−−− G(1, 1)

−−− G(2, 4)

FIGURE 6.3: Posteriors for µ and σ−2 w/ different priors

13.9, 14.1, 14.9, 15.2, 14.8, 15.4, 14.8, 13.7,

12.8, 12.1)

n <- length(y)

data <- dump.format(list(y=y, n=n))

inits <- dump.format(list(mu=1, sigma=1))

# diffuse priors

diffuse.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)

mu ~ dnorm(14, 0.01)

sigma ~ dgamma(1, 0.1)

"diffuse <- run.jags(model=diffuse.model, data=data,

monitor=c("mu", "sigma"),

inits=inits, n.chains=1, burnin=1000,

sample=5000, plots=TRUE)

mu.diffuse <- as.vector(diffuse$mcmc[,1][[1]])

sigma.diffuse <- as.vector(diffuse$mcmc[,2][[1]])

# alternative priors

normal1.model <- "model for (i in 1:n)


y[i] ~ dnorm(mu, sigma)

mu ~ dnorm(14, 1)


"normal2.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)

mu ~ dnorm(12, 3)


"gamma1.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)

mu ~ dnorm(14, 0.01)

sigma ~ dgamma(1, 1)

"gamma2.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)

mu ~ dnorm(14, 0.01)

sigma ~ dgamma(2, 4)

"normal1 <- run.jags(model=normal1.model, data=data,




normal2 <- run.jags(model=normal2.model, data=data,




gamma1 <- run.jags(model=gamma1.model, data=data,




gamma2 <- run.jags(model=gamma2.model, data=data,





mu.normal1 <- as.vector(normal1$mcmc[,1][[1]])

mu.normal2 <- as.vector(normal2$mcmc[,1][[1]])

sigma.gamma1 <- as.vector(gamma1$mcmc[,2][[1]])

sigma.gamma2 <- as.vector(gamma2$mcmc[,2][[1]])

# graph

p.mu.diffuse <- hist(mu.diffuse, freq=F,

breaks=seq(12, 16, length=30),

main="Posteriors for mu w/ Different Priors",

xlab="", ylab="", ylim=c(0, 2))

lines(density(mu.diffuse))

lines(density(mu.normal1), col=2)

lines(density(mu.normal2), col=3)

text(15.26, 1.5, "--- N(14, 100)")

text(15.20, 1.4, "--- N(14, 1)", col=2)

text(15.20, 1.3, "--- N(12, 3)", col=3)

p.sigma.diffuse <- hist(sigma.diffuse, freq=F,

breaks=seq(0, 4.5, length=30),

main="Posteriors for sigma^(-2) w/ Different Priors",

xlab="", ylab="", ylim=c(0, 1.5))

lines(density(sigma.diffuse))

lines(density(sigma.gamma1), col=2)

lines(density(sigma.gamma2), col=3)

text(3.05, 1, "--- G(1, 0.1)")

text(3, 0.9, "--- G(1, 1)", col=2)

text(3, 0.8, "--- G(2, 4)", col=3)

# calculate local robustness

p.mu.normal1 <- hist(mu.normal1, freq=F,

breaks=seq(12, 16, length=30))

p.mu.normal2 <- hist(mu.normal2, freq=F,

breaks=seq(12, 16, length=30))

robust.mu1 <- max(p.mu.normal1$density -

p.mu.diffuse$density)

robust.mu2 <- max(p.mu.normal2$density -

p.mu.diffuse$density)

p.sigma.gamma1 <- hist(sigma.gamma1, freq=F,

breaks=seq(0, 4.5, length=30))


p.sigma.gamma2 <- hist(sigma.gamma2, freq=F,

breaks=seq(0, 4.5, length=30))

robust.sigma1 <- max(p.sigma.gamma1$density -

p.sigma.diffuse$density)

robust.sigma2 <- max(p.sigma.gamma2$density -

p.sigma.diffuse$density)

robust <- rbind(robust.mu1, robust.mu2,

robust.sigma1, robust.sigma2)

robust

13. Calculate the posterior predictive distribution of X in Exercise 6 as-

suming that σ20 = 0.75 and using the prior: p(µ) = N (15, 1). Perform

the same analysis using the prior: C(15, 0.675). Describe the difference

graphically and with quantiles.

I Answer

From (6.22), we know that the posterior predictive distribution of X

has “posterior mean of µ” for its mean, and “σ20 + posterior variance of

µ” for its variance. So, it suffices to find the posterior distributions for

µ in order to describe the posterior predictive distribution of X.

mean variance

from Normal prior 14.0965 0.7522

from Cauchy prior 14.1193 0.7523

Table 6.2: Posterior predictive distributions from two priors

By using two different prior specifications, we have two sets of infor-

mation on the posterior predictive distribution, which is shown in Table

6.2. Figure 6.4. is demonstrating two posterior distributions for µ (NOT

posterior predictive distributions of X), which eventually contribute to

the shape of the posterior predictive distributions of X.

As is obvious from both Table 6.2 and Figure 6.4, the two posterior

predictive distributions are very simular.


Den

sity

13.0 13.5 14.0 14.5 15.0

0.0

0.5

1.0

1.5

2.0

−− N(15, 1)

−− Cauchy(15, 0.675)

FIGURE 6.4: Posterior distributions of mu from two priors

< R.Code >

# setup

require(runjags)

y <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3, 13.9,

13.6, 13.9, 14.1, 14.9, 15.2, 14.8, 15.4,

14.8, 13.7, 12.8, 12.1)

n <- length(y)

data <- dump.format(list(y=y, n=n))

inits.normal <- dump.format(list(mu=1))

inits.cauchy <- dump.format(list(top=1, bottom=1))

# normal prior

normal.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, pow(0.75, -1))


mu ~ dnorm(15, 1)

"normal <- run.jags(model=normal.model, data=data,

monitor=c("mu"), inits=inits.normal,

n.chains=1, burnin=1000,


mu.normal <- as.vector(normal$mcmc[[1]])

# cauchy prior

cauchy.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, pow(0.75, -1))

mu <- 15 + 0.675 * cauchy # cauchy(a,b) = a+b*cauchy(0,1)

cauchy <- top / bottom # N(0,1) / N(0,1) ~ cauchy(0,1)

top ~ dnorm(0,1)

bottom ~ dnorm(0,1)

"cauchy <- run.jags(model=cauchy.model, data=data,

monitor=c("mu"), inits=inits.cauchy,

n.chains=1, burnin=1000,


mu.cauchy <- as.vector(cauchy$mcmc[[1]])

# compare

compare <- rbind(cbind(mean(mu.normal), var(mu.normal)),

cbind(mean(mu.cauchy), var(mu.cauchy)))

compare <- cbind(compare, 0.75 + compare[,2]/18)

colnames(compare) <- c("mean", "var of mu", "var")

rownames(compare) <- c("normal", "cauchy")

compare

p.mu.normal <- hist(mu.normal, freq=F,

breaks=seq(13, 15, length=20),

ylim=c(0, 2.3), xlab="", main="")

lines(density(mu.normal))

lines(density(mu.cauchy), col=2)

text(13.25, 1.5, "-- N(15, 1)")

text(13.395, 1.38, "-- Cauchy(15, 0.675)", col=2)


15. Calculate the posterior predictive distribution for the Palm Beach County

model (Section 4.1.1) using both the conjugate prior and the uninfor-

mative prior.

I Answer

E[ynew|y,X] = E [E[ynew|β, y,X]|y,X]

= E [Xβ|y,X] = X× (posterior mean of β)

Var[ynew|y,X] = E [Var[ynew|β, y,X]|y,X]

+ Var [E[ynew|β, y,X]|y,X]

= E[σ2|y,X

]+ Var [Xβ|y,X]

= E[σ2|y,X

]+ 0 = (posterior mean of σ2)

(a) Conjugate prior: β ∼ N (0,Σ = I), σ2 ∼ IG(2, 1)

E[ynew|y,X] = Xβ

Var[ynew|y,X] =1

2σ2(n+ 2− k)/(n+ 2− k − 1),

where β = (Σ−1 + X′X)−1(0 + X′Xb), σ2 = (y−Xb)′(y−Xb)n−k , b =

(X′X)−1X′y, n = 516, and k = 6.

(b) Uninformed prior: p(β) ∝ c, p(σ2) ∝ 1σ

E[ynew|y,X] = Xb

Var[ynew|y,X] = σ2(n− k)/(n− k − 3),

where b = (X′X)−1X′y, σ2 = (y−Xb)′(y−Xb)n−k , n = 516, and k = 6.

< R.Code >

# data and setup



dimnames(pbc.vote)[[2]] <- c("badballots", "teach", "new",



x <- cbind(1, pbc.vote[,-1])

y <- pbc.vote[,1]


xbar <- x

# posterior predictive distribution

bh <- solve(t(xbar) %*% xbar) %*% t(xbar) %*% y

bt <- solve(diag(1, k) + t(xbar) %*% xbar) %*%

(t(xbar) %*% xbar %*% bh)

s2 <- (t(y - xbar %*% bh) %*% (y - xbar %*% bh)) / (n - k)

st <- 2 + s2 * (n-k) +

t(bh - bt) %*% t(xbar) %*% xbar %*% bh

con.y.E <- xbar %*% bt

con.y.var <- (s2 * (n + 2 - k)) / (2 * (n + 2 - k - 1))

17. Using the model specification in (6.1), produce a posterior distribution

for appropriate survey data of your choosing. Manipulate the forms

of the priors and indicate the robustness of this specification for your

setting. Summarize the posterior results with tabulated quantiles.

I Answer

Try to utilize any types of survey data.

19. Leamer (1983) proposes reporting extreme bounds for coefficient values

during the specification search as a way to demonstrate the reliability of

effects across model-space. So the researcher would record the maximum

and minimum value that every coefficient took-on as different model

choices were attempted. Explain how this relates to Bayesian model

averaging.

I Answer

What Leamer’s proposal is doing is to record the maximum and the

minimum values that every coefficient takes as different model specifica-

tions are tried. And, the Bayesian model averaging is to 1) try multiple

parametric forms for the model specifications; 2) to weight each of them;


and 3) average them with their weights. Therefore, not just by recording

the extreme values, but by averaging all values (with proper weights),

Bayesian model averaging generalize the idea of Leamer’s proposal.

7

Bayesian Hypothesis Testing and the Bayes

Factor

Exercises

1. Perform a frequentist hypothesis test of H0 :θ = 0 versus H1 :θ = 500,

where X ∼ N (θ, 1) (one-tail at the α = 0.01 level). A single datapoint

is observed, x = 3. What decision do you make? Why is this not

a defensible procedure here? Does a Bayesian alternative make more

sense?

I Answer

The frequentist hypothesis test is:

(a) Under H0, t = 3−01 = 3 > 2.325. ∴ Reject H0.

(b) Under H1, t = 3−5001 = −497 < −2.325. ∴ Reject H1.

Since H0 and H1 are NOT mutually exclusive and exhaustive, the fre-

quentist procedure is a little problematic. However, if we use a Bayesian

alternatives, we can have the exact probability that a certain hypothesis

is correct. Then, we can discuss about how precise the hypotheses are.

3. Akaike (1973) states that models with negative AIC are better than the

saturated model and by extension the model with the largest negative

value is the best choice. Show that if this is true, then the BIC is a

better asymptotic choice for comparison with the saturated model.

I Answer

89


Note that:

BIC = −2l(θ|x) + p log n

AIC = −2l(θ|x) + 2p

Then, BIC has always a smaller negative value then AIC once we have

p log n > 2p (i.e. n > e2). Therefore, BIC would be much more “con-

servative” criterion when comparing with the saturated model. In other

words, BIC is a “better” asymptotic choice in the sense that it is a more

“conservative” choice when n becomes larger.

5. Derive the last line of (7.12) from: p(H0|data) = 1 − p(H1|data) =

1− 1B(x)

[p(data)p(H0) p(H0|data)

]p(H1)p(data) .

I Answer

Let p(H0|data) ≡ A. Then, we have:

A = 1− 1

B(x)·[p(data)

p(H0)·A]· p(H1)

p(data)

= 1− 1

B(x)· p(data)

p(H0)

p(H1)

p(data)·A

= 1− 1

B(x)· p(H1)

p(H0)·A

⇒ A+

[1

B(x)· p(H1)

p(H0)

]A = 1

⇒[1 +

1

B(x)· p(H1)

p(H0)

]A = 1

∴ A =

[1 +

1

B(x)· p(H1)

p(H0)

]−1

= p(H0|data) (Q.E.D.)

7. (Berger 1985). A child takes an IQ test with the result that a score over

100 will be designated as above average, and a score of under 100 will be

designated as below average. The population distribution is distributed

N (100, 225) and the child’s posterior distribution is N (110.39, 69.23).

Test competing designations on a single test, p(θ ≤ 100|x) vs. p(θ >

Bayesian Hypothesis Testing and the Bayes Factor 91

100|x), with a Bayes Factor using equally plausible prior notions (the

population distribution), and normal assumptions about the posterior.

I Answer

Let M1 : θ ≤ 100 and M2 : θ > 100. Then, we have

B(x) =p(M1|x)/p(M1)

p(M2|x)/p(M2)

=p(M1|x)

p(M2|x)(∵ p(M1) = p(M2) =

1

2)

p(M1|x) = 0.1059 can be calculated by pnorm(100, mean=110.39,

sd=sqrt(69.23)) in R . And, p(M2|x) = 1− p(M1|x) = 0.8941. There-

fore, B(x) = 0.10590.8941 = 0.1184, which is substantial evidence against M1.

9. Returning to the beta-binomial model from Chapter 2, set up a Bayes

Factor for: p < 0.5 versus p ≥ 0.5, using a uniform prior, and the data:

[0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1]. Perform the integration step using

rejection sampling (Chapter 8).

I Answer

The resulting Bayes Factor is 3.3419, which is a strong evidence for

p < 0.5.

< R.Code >

# beta function

sdbeta <- function(x, alpha, beta) a <- factorial(alpha+beta-1)

b <- factorial(alpha-1) * factorial(beta-1)

c <- x^(alpha-1) * (1-x)^(beta-1)

return(a/b*c)

betamode <- function(alpha, beta)

(alpha - 1) / (alpha + beta - 2)

sdbetamode <- function(alpha, beta) sdbeta(betamode(alpha, beta), alpha, beta)


# rejection sampling for integration

int.beta <- function(n, alpha, beta) xvec <- runif(n, 0, 0.5)

yvec <- runif(n, 0, sdbetamode(alpha, beta))

fvec <- sdbeta(xvec, alpha, beta)

indicator <- ifelse(yvec < fvec, 1, 0)

return(mean(indicator)*sdbetamode(alpha, beta)/2)

# calculate Beta(7,10) and Bayes Facor

z <- int.beta(100000, 7, 10)

z/(1-z)

11. Demonstrate using rejection sampling the following equality from Bayes’

original (1763) paper, using different values of n and p:∫ 1

0

(n

p

)xp(1− x)n−pdx =

1

n.

I Answer

We have: ∫ 1

0

(n

p

)xp(1− x)n−pdx =

(n

p

)∫ 1

0

xp(1− x)n−pdx︸︷︷︸integral part

Since the integral part is bounded, the first thing is to find the maximum

value of xp(1− x)n−p under the bounded region of x (F.O.C.):

∂xp(1− x)n−p

∂x= pxp−1(1− x)n−p − x(n− p)(1− x)n−p−1 ≡ 0

∴ x =p

n

Therefore, the rejection method follows:

(a) Draw a from U [0, 1] and b from U [0, pn ].

(b) If ap(1− a)n−p > b then accept it.


(c) Repeat (a) and (b) many times.

(d) Calculate pn ×

#accepts#trials

From R , we can recognize that, even though we change n and p, the

values are always very similar to 1n .

< R.Code >

trial <- 10000

value <- function(n, p) max <- (p/n)^p * (1-p/n)^(n-p)

accept <- 0

for (i in 1:trial) a <- runif(1, 0, 1)

b <- runif(1, 0, max)

f <- a^p * (1-a)^(n-p)

if (f>b) accept <- accept + 1

else accept <- accept

int <- max*(accept/trial)*factorial(n) /

(factorial(n-p)*factorial(p))

return(int)

try1 <- value(10, 3); try2 <- value(10, 5)



ans <- rbind(c(1/10, try1, try2, try3),

c(1/20, try4, try5, try6))

colnames(ans) <- c("true", "1st try",

"2nd try", "3rd try")

> ans

true 1st try 2nd try 3rd try

[1,] 0.10 0.09104169 0.09026719 0.09132174

[2,] 0.05 0.04766731 0.04805365 0.04823303

13. Calculate the posterior distribution for β in a logit regression model

r(X′b) = p(y = 1|X′b) = 1/[1 + exp(−X′b)]


with a BE(A,B) prior. Perform a formal test of a BE(4, 4) prior versus

a BE(4, 1) prior. Calculate the Kullback-Leibler distance between the

two resulting posterior distributions.

I Answer

Let’s define f(b) as:

f(b) ≡n∏i=1

[(1

1 + exp(−X′b)

)yi ( exp(−X′b)

1 + exp(−X′b)

)1−yi]

The, the posterior distribution is:

π(b|X, y) =BE(b|A,B)f(b)∫ 1

0BE(b|A,B)f(b)db

The Bayes Factor for BE(4, 4) versus BE(4, 1) is:

BF =

∫ 1

0BE(b|4, 4)f(b)db∫ 1

0BE(b|4, 1)f(b)db

= 14×∫ 1

0b3(1− b)3f(b)db∫ 1

0b3f(b)db

And, the Kullback-Leibler distance is:

I (BE(4, 4),BE(4, 1)) =

∫ 1

0

log

[BE(b|4, 4)

BE(b|4, 1)· 1

BF

]BE(b|4, 4)f(b)db

= 70× log14

BF×∫ 1

0

b3(1− b)3f(b)db

15. (Aitkin 1991). Model 1 specifies y ∼ N (µ1, σ2) with µ1 specified and

σ2 known, Model 2 specifies y ∼ N (µ2, σ2) with µ2 unknown and the

same σ2. Assign the improper uniform prior to µ2: p(µ2) = C/2 over

the support [−C :C], where C is large enough value to make this a

reasonably uninformative prior. For a predefined standard significance

test level critical value z, the Bayes Factor for Model 1 versus Model 2

is given by:

B =2Cn1/2φ(z)

σ[Φ(n1/2 y+C

σ

)− Φ

(n1/2 y−C

σ

)] .


Show that B can be made as large as one likes as C → ∞ or n → ∞for any fixed value of z. This is an example of Lindley’s paradox for a

point null hypothesis test (Section 7.3.3.).

I Answer

First, as C →∞,

B =1

σ

∞[Φ(∞)− Φ(−∞)]

=1

σ

∞1− 0

=∞, for any fixed σ, n, z

Second, as n→∞,

B =1

σ

∞[Φ(∞)− Φ(∞)]

=1

σ

∞1− 1

=∞, for any fixed σ,C, z (Q.E.D.)

17. Given a Poisson likelihood function, instead of specifying the conjugate

gamma distribution, stipulate p(µ) = 1/µ2. Derive an expression for

the posterior distribution of µ by first finding the value of µ, which

maximizes the log density of the posterior, and then expanding a Taylor

series around this point (i.e., Laplace approximation). Compare the

resulting distribution with the conjugate result.

I Answer

From π(λ) = 1λ2 and Laplace approximation, we have the posterior

distribution as:

π(λ|x) = π(λo|x) +1

2(λ− λo)2π′′(λo|x),

where λo maximizes log π(λ|x).

In order to find λo, take F.O.C.:

∂ log π(λ|x)

∂λ=∂ log

[1λ2 ·

∏ni=1

λxie−λ

xi!

]∂λ

≡ 0


⇒∂ log

(λ∑xi−2 exp[−nλ]

)∂λ

= 0

⇒ ∂ [(∑xi − 2) log λ− nλ]

∂λ=

∑xi − 2

λ− n = 0

∴ λo =

∑xi − 2

n

Then, the posterior becomes:

π(λ|x) = λΣxi−2o exp[−nλo]

+1

2(λ− λo)2 · (Σxi − 2)(Σxi − 3)λΣxi−4 exp[−nλo]n2

=

(λ2o +

n2

2(λ− λo)2(Σxi − 2)(Σxi − 3)

)λΣxi−4o exp[−nλo]

=

[(Σxi − 2

n

)2

+n2

2

(λ− Σxi − 2

n

)2

(Σxi − 2)(Σxi − 3)

]

×(

Σxi − 2

n

)Σxi−4

exp[−(Σxi − 2)]

∝ n2

2(Σxi − 2)(Σxi − 3)

[λ− Σxi − 2

n

]2

Letting k1 ≡ Σxi−2n and k2 ≡ n2(Σxi−2)(Σxi−3) produces the posterior

distribution as:

π(λ|x) ∝ k2

2[λ− k1]

2

= log

(exp

[1

2/k2(λ− k1)2

])= logN

(k1,

1

k2

)Note that this is a little different from the posterior distribution obtained

from the conjugate prior G(α, β):

π(λ|x) = G(Σxi + α, β + n)

However, if we compare the means and variances from the two posterior

specifications, we can recognize that sufficient number of data would

make the two posterior converge.


19. Using the Palm Beach County voting model in Section 4.1.1. with unin-

formative priors, compare the full model with the nested model, leaving

out the technology variable using a local Bayes Factor. Experiment

with different training samples and compare the subsequent results. Do

you find that the Bayes Factor is sensitive to your selection of training

sample?

I Answer

The below R code utilizes the model specification that was discussed in

Section 4.1.1. Running this code multiple times would inform us how

sensitive the Bayes Factor is to the selection of training sample. For

the training sample below (keep) is designed to be a random sample

from the entire data. It turns out to be not sensitive. Additionally, it

is consistently in favor of keeping the technology variable in the model

(BF > 5× 1025).

< R.Code >

library(BaM); library(mvtnorm)

data(pbc.vote)

norm.like <- function(B,B.HAT,SIGMA.SQ,X)

SIGMA.SQ^(-nrow(X)/2) * exp( -0.5/SIGMA.SQ *

(-diff(dim(X)) * SIGMA.SQ + t(B-B.HAT) %*% t(X) %*%

X %*% (B-B.HAT)) )

# CREATE THE TWO DATASETS, STANDARDIZE FOR STABILITY

# X1 for bigger model, X2 for smaller model

# a for regular data, b for training data

attach(pbc.vote)

X1a <- cbind(new, size, Republican, white, technology)

X1b <- cbind(new, size, Republican, white )

Y1 <- badballots

detach(pbc.vote)

keep <- sample(1:nrow(X1a),200)

X2a <- X1a[keep,]; X2b <- X1b[keep,]; Y2 <- Y1[keep]

X1a <- apply(X1a,2,scale); X2a <- apply(X2a,2,scale)

X1b <- apply(X1b,2,scale); X2b <- apply(X2b,2,scale)

X1a <- as.matrix(cbind(rep(1,nrow(X1a)), X1a))


X1b <- as.matrix(cbind(rep(1,nrow(X1b)), X1b))

X2a <- as.matrix(cbind(rep(1,nrow(X2a)), X2a))

X2b <- as.matrix(cbind(rep(1,nrow(X2b)), X2b))

Y1 <- scale(Y1); Y2 <- scale(Y2)

# UNINFORMED PRIOR -> POSTERIOR PARAMETERS

bhat1a <- solve(t(X1a)%*%X1a)%*%t(X1a)%*%Y1

bhat1b <- solve(t(X1b)%*%X1b)%*%t(X1b)%*%Y1

s21a <- t(Y1- X1a%*%bhat1a)%*%(Y1- X1a%*%bhat1a)/

(nrow(X1a)-ncol(X1a))

s21b <- t(Y1- X1b%*%bhat1b)%*%(Y1- X1b%*%bhat1b)/

(nrow(X1b)-ncol(X1b))

R1a <- solve(t(X1a)%*%X1a)*((nrow(X1a)-ncol(X1a))* s21a/

(nrow(X1a)-ncol(X1a)-2))[1,1]

R1b <- solve(t(X1b)%*%X1b)*((nrow(X1b)-ncol(X1b))* s21b/

(nrow(X1b)-ncol(X1b)-2))[1,1]

bhat2a <- solve(t(X2a)%*%X2a)%*%t(X2a)%*%Y2

bhat2b <- solve(t(X2b)%*%X2b)%*%t(X2b)%*%Y2

s22a <- t(Y2- X2a%*%bhat2a)%*%(Y2- X2a%*%bhat2a)/

(nrow(X2a)-ncol(X2a))

s22b <- t(Y2- X2b%*%bhat2b)%*%(Y2- X2b%*%bhat2b)/

(nrow(X2b)-ncol(X2b))

R2a <- solve(t(X2a)%*%X2a)*((nrow(X2a)-ncol(X2a))* s22a/

(nrow(X2a)-ncol(X2a)-2))[1,1]

R2b <- solve(t(X2b)%*%X2b)*((nrow(X2b)-ncol(X2b))* s22b/

(nrow(X2b)-ncol(X2b)-2))[1,1]

# IMPORTANCE SAMPLING CALCULATION OF MARGINAL LIKELIHOODS

K <- 10000

bhat1 <- bhat1a; bhat2 <- bhat1b

R1 <- R1a; R2 <- R1b; s21 <- s21a; s22 <- s21b

X1 <- X1a; X2 <- X1b

g1 <- h1 <- rmvnorm(K,bhat1,R1)


G1 <- H1 <- rep(NA,K); G2 <- H2 <- rep(NA,K)

for (i in 1:K) G1[i] <- dmvnorm(g1[i,],bhat1,R1)

H1[i] <- norm.like(g1[i,],bhat1,s21,X1)

G2[i] <- dmvnorm(g2[i,],bhat2,R2)



m1 <- K/(sum(G1/H1)); m2 <- K/(sum(G2/H2))

BF.main <- m1/m2

bhat1 <- bhat2a; bhat2 <- bhat2b

R1 <- R2a; R2 <- R2b; s21 <- s22a; s22 <- s22b

X1 <- X2a; X2 <- X2b



G1 <- H1 <- rep(NA,K); G2 <- H2 <- rep(NA,K)

for (i in 1:K) G1[i] <- dmvnorm(g1[i,],bhat1,R1)


G2[i] <- dmvnorm(g2[i,],bhat2,R2)


m1 <- K/(sum(G1/H1)); m2 <- K/(sum(G2/H2))

BF.training <- m2/m1

(LBF <- BF.training*BF.main)

8

Bayesian Decision Theory

Exercises

To be provided.

101

9

Monte Carlo Methods

Exercises

1. Use the trapezoidal rule with n = 8 to evaluate the definite integral:

I(g) =

∫ π

0

excos(x)dx.

The correct answer is I(g) = −12.070346; can you explain why your

answer differs slightly?

I Answer

I(g) =π

8×[e0 cos(0) + eπ/8 cos(π/8)

]× 1

2

+π

8×[eπ/8 cos(π/8) + e2π/8 cos(2π/8)

]× 1

2

+ · · ·+ π

8×[e7π/8 cos(7π/8) + eπ cos(π)

]× 1

2

≈ −12.38216

Because n is quite small, the Trapezoid answer is a little different from

the true value. However, when n becomes larger, the Trapezoid answer

becomes much closer to the true value.

< R.Code >

trapezoid <- function(a, b, n) x.values <- seq(a, b, length=(n+1))

height <- NULL

area <- 0

for (i in 1:(n+1))

103


height[i] <- exp(x.values[i]) * cos(x.values[i])

for (i in 1:n) area <- area +

((b-a)/n) * (height[i] + height[i+1]) * 0.5

return(area)

> trapezoid(0, pi, 8)

[1] -12.38216


[1] -12.10213


[1] -12.07829

3. Use Simpson’s rule with n = 8 to evaluate the definite integral in Exer-

cise 8.1. Is your answer better?

I Answer

I(g) =π

8×

1

6×[e0 cos(0) + eπ/8 cos(π/8) + 4× eπ/16 cos(π/16)

]+π

8×

1

6×[eπ/8 cos(π/8) + e2π/8 cos(2π/8) + 4× e3π/16 cos(3π/16)

]· · ·+

π

8×

1

6×[e7π/8 cos(7π/8) + eπ cos(π) + 4× e15π/16 cos(15π/16)

]≈ −12.06995

Note that the Simpson’s Rule answer is much closer to the true value

even when n remains the same. Of course, as before, when n becomes

larger, the Simpson’s Rule answer becomes much closer to the true value.

< R.Code >

simpson <- function(a, b, n) x.value <- seq(a, b, length=(n+1))

height.point <- NULL

mid <- NULL

height.mid <- NULL

Monte Carlo Methods 105

area <- 0

for (i in 1:(n+1)) height.point[i] <- exp(x.value[i]) * cos(x.value[i])

for (i in 1:n) mid[i] <- (x.value[i] + x.value[i+1])/2

height.mid[i] <- exp(mid[i]) * cos(mid[i])

for (i in 1:n) area <- area + ((b-a)/n) * (1/6) * (height.point[i] +

height.point[i+1] + 4 * height.mid[i])

return(area)

> simpson(0, pi, 8)

[1] -12.06995

> simpson(0, pi, 25)

[1] -12.07034

> simpson(0, pi, 50)

[1] -12.07035

5. It is well known that the normal is the limiting distribution of the bino-

mial (i.e., that as the number of binomial experiments gets very large the

histogram of the binomial results increasing looks like a normal). Using

R , generate 25 BN (10, 0.5) experiments. Treat this as a posterior for

p = 0.5 even though you know the true value and improve your estimate

using importance sampling with a normal approximation density.

I Answer

The algorithm is:

(a) Draw 25 values from N (5, 25): x1, x2, · · · , x25.

(b) Round them because the binomial distribution allows only natural

numbers.

(c) Calculate the weights

wi =dbinom(xi, size=10, prob=0.5)

dnorm(xi, mean=5, var=2.5)


(d) Get

Mean =

∑25i=1 xiwi∑25i=1 wi

Variance =

∑25i=1 x

2iwi∑25

i=1 wi− [ Mean ]2

The importance sampling produces Mean = 4.8463 and Variance =

2.3802, which are not exactly same as the true values but pretty much

similar to those.

Just for the reference, the Monte Carlo analysis (about how the impor-

tance sampling is doing better as compared to the direct sampling from

the binomial distribution) is done. It turns out that, the importance

sampling is doing a great job when producing the mean. But, it is not

doing well when producing the variance (as the number of draws become

larger).

< R.Code >

app <- round(rnorm(25, mean=5, sd=sqrt(2.5)), 0)

weight <- dbinom(app, size=10, prob=0.5) /

dnorm(app, mean=5, sd=sqrt(2.5))

average <- sum(app*weight) / sum(weight)

variance <- sum(app^2 * weight)/ sum(weight) - average^2

imp <- cbind(average, variance)

> imp

average variance

[1,] 4.846275 2.380189

# compare with direct sampling from BINOM

# Monte Carlo analysis (w/ simulation number = 10000)

t <- c(25, 100, 1000, 5000, 10000)

improve <- NULL; estimate <- NULL

for (j in 1:5) m <- 0; v <- 0

for (i in 1:10000) y <- rbinom(t[j], size=10, prob=0.5)

m1 <- mean(y); v1 <- var(y)

x <- round(rnorm(t[j], mean=5, sd=sqrt(2.5)), 0)


w <- dbinom(x, size=10, prob=0.5) /

dnorm(x, mean=5, sd=sqrt(2.5))

m2 <- sum(x*w) / sum(w)

v2 <- sum(x^2 *w) / sum(w) - m2^2

if (abs(m2-5) < abs(m1-5)) m <- m + 1

else m <- m

if (abs(v2-2.5) < abs(v1-2.5)) v <- v + 1

else v <- v

improve <- rbind(improve, c(m, v))

estimate <- rbind(estimate, c(m1, m2, v1, v2))

result <- cbind(estimate[,1:2], improve[,1]/100,

estimate[,3:4], improve[,2]/100)

colnames(result) <- c("Mean-Binom.", "Mean-Imp.",

"Better(%)?",

"Var-Binom.", "Var-Imp.",

"Better(%)?")

rownames(result) <- c("N=25", "N=100", "N=1000",

"N=5000", "N=10000")

> result

Mean-Binom. Mean-Imp. Better (%)?

N=25 4.6800 5.010181 49.01

N=100 4.9700 5.080935 50.72

N=1000 4.9450 4.992402 49.98

N=5000 4.9734 5.029711 49.34

N=10000 5.0176 5.001492 49.10

Var-Binom. Var-Imp. Better (%)?

N=25 2.143333 2.459887 51.21

N=100 2.332424 2.537924 50.46

N=1000 2.530506 2.569648 43.46

N=5000 2.560004 2.650143 23.55

N=10000 2.550345 2.563561 11.35

7. Casella and Berger (2001, 638) give the famous temperature and failure

data on space shuttle launches before the Challenger disaster, where

failure is dichotomized. Actually several o-ring failure events were mul-

tiple events. Treat the number of multiple events as missing data and


write an EM algorithm to estimate the Poisson parameter given the

missing data. Here are the dichotomized failure data in R form:

temp <- c(53,57,58,63,66,67,67,68,69,70,70,70,70,70,72,73,

75,75,76,76,78,79,81),

fail <- c(1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0).

I Answer

Set all the values of 1 as missing. Then, the EM algorithm is:

(a) Start with an arbitrary value for λ: λ(0) = 20

(b) Draw m values for ymis from f(ymis|yobs, λ(0)),

where f(ymis|yobs, λ(0)) = f(ymis|λ(0)) (∵ data are i.i.d), and

f(ymis|λ(0)) = P(ymis|λ(0),ymis ≥ 1) (∵ missing data ≥ 1).

(c) For the ith draw of ymis, calculate

`(i)(λ|yobs,y(i)mis) =

23∑j=1

yj

log(λ)− 23λ

(d) Take the mean

¯=

∑mi=1

[(∑23j=1 yj

)log(λ)− 23λ

]m

(e) Find λ that maximizes the mean above

(f) Repeat the procedure until λ(t) and λ(t+1) becomes very close (say

0.001)

From R , we know that λ is estimated to be 0.3652. The number 1’s

should be mostly 1 and some other numbers greater than 1.

< R.Code >

m <- 100

y <- c(NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,

NA, NA, 0, 0, 0, NA, 0, 0, 0, 0, 0)

fr <- function(lambda, y.draw) y.draw <- as.matrix(y.draw)

log.lik <- NULL


for (j in 1:m) y.new <- c(as.vector(y.draw[j, 1:4]),

as.vector(y[5:12]),

as.vector(y.draw[j, 5:6]),

as.vector(y[15:17]),

as.vector(y.draw[j, 7]),

as.vector(y[19:23]))

log.lik[j] <- sum(y.new)*log(lambda) - 23*lambda

return(- sum(log.lik) / m)

y.draw <- matrix(rep(NA, 7*m), m)

lambda <- cbind(0, 20); c <- 2

while(abs(lambda[c] - lambda[c-1]) > 0.001) for (i in 1:7) count <- 1

while (count <= m) draw <- rpois(1, lambda[c])

if (draw >= 1) y.draw[count, i] <- draw

count <- count + 1

else count <- count

lambda.hat <- optimize(f=fr, y.draw=y.draw,

lower=0, upper=100)

lambda <- cbind(lambda, lambda.hat$minimum)

c <- c + 1

> print(lambda[c])

[1] 0.3652197

> y.draw[1,]

[1] 2 1 3 1 1 1 1

9. Replicate the calculation of the Kullback-Leibler distance between the

two beta distributions in Section 8.3.1 using rejection sampling. In


the cultural anthropology example given in Section 2.3.3 the beta priors

BE(15, 2) and BE(1, 1) produced beta posteriors BE(32, 9) and BE(18, 8)

with the beta-binomial specification. Are these two posterior distribu-

tions closer or farther apart in Kullback-Leibler distance than their re-

spective priors? What does this say about the effect of the data in this

example?

I Answer

The first thing that we have to do is to calculate the two integrals by

using the rejection method. For each integral, the algorithm is:

(a) Find the minimum value (min) of the function (h) that is inside

the integral.

(b) Randomly draw x from U [0, 1], and y from U [min, 0].

(c) If h(x) ≤ y, accept it. Otherwise, reject it.

(d) Calculate | min | × # accept# draws .

Then, the resulting two Kullback-Leibler distances are 1.5345 for two

priors and 0.4816 for two posteriors. So, we can say that two distribu-

tions becomes closer in posteriors than in priors. This is because the

data reduces the impact of the prior specification.

< R.Code >

m <- 1000 # number of draws

fr.1 <- function(p, A, B) return(log(p) * p^(A-1) * (1-p)^(B-1))

fr.2 <- function(p, A, B) return(log(1-p) * p^(A-1) * (1-p)^(B-1))

kl.distance <- function(a, b, c, d)

min.1 <- optimize(f=fr.1, A=a, B=b,

lower=0.5, upper=0.9)

min.2 <- optimize(f=fr.2, A=a, B=b,

lower=0.5, upper=0.9)

x <- runif(m, 0, 1)

y1 <- runif(m, min.1$objective, 0)


y2 <- runif(m, min.2$objective, 0)

accept1 <- accept2 <- 0

for (i in 1:m) if (y1[i] < fr.1(x[i], a, b)) accept1 <- accept1

else accept1 <- accept1 + 1

if (y2[i] < fr.2(x[i], a, b)) accept2 <- accept2

else accept2 <- accept2 + 1

f.1 <- - min.1$objective * (accept1 / m)

f.2 <- - min.2$objective * (accept2 / m)

I <- log((gamma(a+b)*gamma(c)*gamma(d))/

(gamma(c+d)*gamma(a)*gamma(b))) -

(a-c)*(gamma(a+b)/(gamma(a)*gamma(b)))*f.1 -

(b-d)*(gamma(a+b)/(gamma(a)*gamma(b)))*f.2

return(I)

compare <- cbind(kl.distance(15, 2, 1, 1),

kl.distance(32, 9, 18, 8))

colnames(compare) <- c("distance b/w priors",

"distance b/w posterior")

> print(compare)

distance b/w priors distance b/w posterior

[1,] 1.534457 0.4815835

11. The Cauchy distribution (Appendix B) is a unimodal, symmetric density

like the normal but has much heavier tails. In fact the tails are suffi-

ciently heavy that the Cauchy distribution does not have finite moments.

To understand this characteristic, perform the following experiment at

least 10 times:

• Generate a sample of size 100 from C(x|θ, σ), where you choose

θ, σ.

• Perform Monte Carlo integration to attempt to calculate E[X].

• Compare your estimate with the mode (θ).

What do you conclude about the moments of the Cauchy?

I Answer


From the experiment in R , we observe that mean varies a lot (variance

of 203.4884) from around -80 to around 30. So, we cannot say about (or

define) the moments of the Cauchy distribution.

< R.Code >

cau.mean <- NULL

for (i in 1:50) cau <- rcauchy(100, location=0, scale=1)

mean <- sum(cau)/100

cau.mean <- c(cau.mean, mean)

cau.mean

result <- cbind(mean(cau.mean), var(cau.mean),

min(cau.mean), max(cau.mean))

colnames(result) <- c("mean of mean", "var of mean",

"min of mean", "max of mean")

> print(result)

mean of mean var of mean min of mean max of mean

[1,] -1.664704 203.4884 -80.29538 30.29719

13. Generate 100 standard normal random variables in R and use these

values to perform rejection sampling to calculate the integral of the nor-

mal PDF from 2 to ∞. Repeat this experiment 100 times and calculate

the empirical mean and variance from your replications. Now generate

10,000 Cauchy values and use this as an approximation distribution in

importance sampling to obtain an estimate of the same normal interval.

Which approach is more accurate?

I Answer

Using either algorithm produces no different result. The slight difference

in variance comes from the difference in number of draws (100 normal

draws for the rejection sampling vs. 10000 cauchy draws for the impor-

tance sampling). Once we adjust that difference, we have no different

result.

< R.Code >


reject <- function(T, n) int <- NULL

for (i in 1:T) draw <- rnorm(n, 0, 1)

accept <- draw[draw >= 2]

int[i] <- length(accept)/n

result <- cbind(mean(int), var(int))

colnames(result) <- c("mean", "var")

return(result)

imp <- function(T, n) int <- NULL

for (i in 1:T) p <- rcauchy(n, 0, 1)

w <- dnorm(p, 0, 1) / dcauchy(p, 0, 1)

index <- ifelse (p >= 2, 1, 0)

int[i] <- mean(w*index)

result <- cbind(mean(int), var(int))

colnames(result) <- c("mean", "var")

return(result)

compare <- rbind(reject(100, 100), imp(100, 10000),

imp(100, 100))

rownames(compare) <- c("rejection (100 draws)",

"importance (10000 draws)",

"importance (100 draws)")

> print(compare)

mean var

rejection (100 draws) 0.02380000 1.995556e-04

importance (10000 draws) 0.02286258 1.041832e-06

importance (100 draws) 0.02213381 1.197089e-04

15. The time to failure of two groups are observed, which are assumed to

be distributed exponential with unknown θ. The time to failure for the

first group, X1, X2, . . . , Xm, are fully observed, but the second group is

censored at time t > 0 because the study grant ran out, therefore the


Y1, Y2, . . . , Yn values are either t or some value less than t, which can be

given an indicator function: Oi = 1 if Yi = t and zero otherwise. The

expected value of the sufficient statistic is:

E

[n∑i=1

(Yi|Oi)

]=

(n∑i=1

Oi

)(t+ θ−1)

+

(n−

n∑i=1

Oi

)(θ−1 − t[1 + exp(t/θ)]−1),

and the complete data log likelihood is:

`(θ) = −m∑i=1

(log(θ−1) + xiθ)−n∑i=1

(log(θ−1) + yiθ).

Substitute the value of the sufficient statistic into the log likelihood

to get the Q(θ(k+1)|θ(k)) expression and run the EM algorithm for the

following data:

Xi 1 9 4 3 11 7 2 2 - -

Yi 2 3 2 4 4 2 4 4 3 1

I Answer

The estimated θ is 0.1456, which implies that the censored group (Yi)

should have 6.8697 as its mean. So, the number 4’s in Yi should be far

bigger than 4 for most of the time.

< R.Code >

x <- c(1,9,4,3,11,7,2,2)

y <- c(2,3,2,4,4,2,4,4,3,1)

n <- length(y)

t <- max(y)

on <- length(y[y==t])

fr <- function(theta, x, y.stat) theta <- as.numeric(theta)

x <- as.vector(x)

y.stat <- as.numeric(y.stat)

ll <- -sum(log(1/theta) + x*theta) -


n*log(1/theta) - theta*y.stat

return(-ll)

theta <- cbind(0,1)

c <- 2

while (abs(theta[c] - theta[c-1]) > 0.0001) new.theta <- theta[c]

y.stat <- on*(t + 1/new.theta) +

(n - on)*(1/new.theta - t/(1 + exp(t/new.theta)))

new.theta <- optimize(f=fr, x=x, y.stat=y.stat,

lower=0, upper=10)

theta <- cbind(theta, new.theta$minimum)

c <- c + 1

result <- cbind(theta[c], 1/theta[c])

colnames(result) <- c("est. theta", "est. mean of y")

> print(result)

est. theta est. mean of y

[1,] 0.1455665 6.869713

17. One of the easier, but very slow, ways to generate beta random variables

is Johnk’s method (Jonhk 1964; Kennedy and Gentle 1980), based on

the fact that order statistics from iid random uniforms are distributed

beta. The algorithm to deliver a single BE(A,B) random variable is

given by:

(a) Generate independent:

u1 ∼ U(0, 1), u2 ∼ U(0, 1)

(b) Transform:

v1 = (u1)1/A, v2 = (u2)1/B

(c) Calculate the sum:

w = v1 + v2

and return w if w ≤ 1, otherwise return to step 1.


Code this algorithm in R and produce 100 random variables each ac-

cording to: BE(15, 2), BE(1, 1), BE(2, 9), and BE(8, 8). For each of

these beta distributions, also produce 100 random variables with the

rbeta function in R and compare with the corresponding forms pro-

duced with Johnk’s method using qqplot. Graph these four com-

parisons in the same window by setting up the windowing command

par(mfrow=c(2,2)).

I Answer

Figure 8.1 shows that Johnk’s method produces the beta distribution

fairly well. As for the time, the first three cases do not take long. But,

BE(8, 8) case takes a little longer.

< R.Code >

draw.beta <- function(n, A, B) draw <- NULL

counter <- 0

while (counter < 100) u1 <- runif(1, 0, 1)

u2 <- runif(1, 0, 1)

v1 <- u1^(1/A)

v2 <- u2^(1/B)

w <- v1 + v2

if (w <=1) counter <- counter + 1

draw <- cbind(draw, v1/w)

return(draw)

par(mfrow=c(2,2))

qqplot(draw.beta(100, 15, 2), rbeta(100, 15, 2),

xlab="Johnk’s method", ylab="rbeta",

xlim=c(0,1), ylim=c(0,1),

main="BE(15, 2)")

abline(0, 1, lty=2)




0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

BE(15, 2)

Johnk's method

rbet

a

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

BE(1, 1)

Johnk's method

rbet

a

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

BE(2, 9)

Johnk's method

rbet

a

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

BE(8, 8)

Johnk's method

rbet

a

FIGURE 9.1: Drawing Beta Distribution: Johnk’s Method vs. rbeta


main="BE(1, 1)")

abline(0, 1, lty=2)




main="BE(2, 9)")

abline(0, 1, lty=2)





main="BE(8, 8)")

abline(0, 1, lty=2)

19. Hartley (1958, 182) fits a Poisson model to the following “drastically

censored” grouped data:

Number of Events 0 1 2 3+ Total

Group Frequency 11 37 64 128 240

Develop an EM algorithm to estimate the Poisson intensity parameter.

I Answer

Set all the values of 3 as missing. Then, the EM algorithm is:

(a) Start with an arbitrary value for λ: λ(0) = 20

(b) Draw m values for ymis from f(ymis|yobs, λ(0)),

where f(ymis|yobs, λ(0)) = f(ymis|λ(0)) (∵ data are i.i.d), and

f(ymis|λ(0)) = P(ymis|λ(0),ymis ≥ 3) (∵ missing data ≥ 3).

(c) For the ith draw of ymis, calculate

`(i)(λ|yobs,y(i)mis) =

240∑j=1

yj

log(λ)− 240λ

(d) Take the mean

¯=

∑mi=1

[(∑240j=1 yj

)log(λ)− 240λ

]m

(e) Find λ that maximizes the mean above

(f) Repeat the procedure until λ(t) and λ(t+1) becomes very close (say

0.001)

From R , we know that λ is estimated to be 2.8707, which is a little

bigger than the original λ value (2.2875). The λ was biased downwardly

because the data was censored. Put it differently, the number 3’s in the

data should be mostly 3 and some other numbers greater than 3.

< R.Code >


m <- 100

y.orginal <- c(rep(0,11),rep(1,37),rep(2,64),rep(3,128))

y <- c(rep(0,11),rep(1,37),rep(2,64),rep(NA,128))

fr <- function(lambda, y.draw) y.draw <- as.matrix(y.draw)

ll <- NULL

for (j in 1:m) y.new <- c(y[1:112], y.draw[j,])

log.lik[j] <- sum(y.new)*log(lambda) - 240*lambda

return(- sum(log.lik) / n)

y.draw <- matrix(rep(NA, 128*m), m)

lambda <- cbind(0, 20); c <- 2

while(abs(lambda[c] - lambda[c-1]) > 0.001) for (k in 1:128) count <- 1

while (count <= m) draw <- rpois(1, lambda[c])

if (draw >= 3) y.draw[count, k] <- draw

count <- count + 1

else count <- count

lambda.hat <- optimize(f=fr, y.draw=y.draw,

lower=0, upper=100)

lambda <- cbind(lambda, lambda.hat$minimum)

c <- c + 1

compare <- cbind(mean(y.orginal), l[c])

colnames(compare) <- c("Before EM", "After EM")

rownames(compare) <- "lambda"

> compare

Before EM After EM

lambda 2.2875 2.870666

> y.draw[1,]


[1] 3 4 3 4 3 3 5 3 7 4 3 6 7 5 3 5 8 5 3 3 5 4 4 4 5

[26] 4 4 5 4 3 3 3 4 3 3 3 3 5 3 3 5 7 4 3 6 3 3 5 3 3

[42] 5 3 3 6 3 3 5 3 3 5 3 3 6 3 3 5 3 3 3 4 3 4 4 3 5

[67] 3 3 3 4 6 4 5 4 3 3 5 4 8 6 6 4 4 4 4 3 3 3 5 7 4

[92] 4 4 5 4 3 3 4 3 4 3 6 3 5 3 3 3 4 4 5 3 4 3 3 6 3

[117] 3 4 4 3 5 3 3 5 7 5 3 4

10

Basics of Markov Chain Monte Carlo

Exercises

1. Find the values of α1, α2, and α3 that make the following a transition

matrix: 0.4 α1 0.0

0.3 α2 0.6

0.0 α3 0.4

. (10.1)

I Answer

0.4 + α1 + 0.0 = 1

0.3 + α2 + 0.6 = 1

0.0 + α3 + 0.4 = 1

⇒

α1 = 0.6

α2 = 0.1

α3 = 0.6

3. Using the transition matrix from Section 9.1.2, run the Markov chain

mechanically (step by step) in R using matrix multiplication. Start with

at least two very different initial states. Run the chains for at least ten

iterations.

I Answer

< R.Code >

transition <- matrix(c(.8, .6, .2, .4), ncol=2)

markov <- function(n, A, B) init <- matrix(c(A, B), 1)

out <- NULL

121


for (i in 1:n) init <- init %*% transition

out <- rbind(out, init)

return(out)

try1 <- markov(10, .9, .1)

try2 <- markov(10, .1, .9)

compare <- cbind(try1, try2)

colnames(compare) <- c("Pr(A)_try1", "Pr(B)_try1",

"Pr(A)_try2", "Pr(B)_try2")

rownames(compare) <- c("interation1", "interation2",

"interation3", "interation4",



"interation9", "interation10")

> print(compare)

Pr(A)_try1 Pr(B)_try1 Pr(A)_try2 Pr(B)_try2

interation1 0.7800000 0.2200000 0.6200000 0.3800000

interation2 0.7560000 0.2440000 0.7240000 0.2760000

interation3 0.7512000 0.2488000 0.7448000 0.2552000

interation4 0.7502400 0.2497600 0.7489600 0.2510400

interation5 0.7500480 0.2499520 0.7497920 0.2502080

interation6 0.7500096 0.2499904 0.7499584 0.2500416

interation7 0.7500019 0.2499981 0.7499917 0.2500083

interation8 0.7500004 0.2499996 0.7499983 0.2500017

interation9 0.7500001 0.2499999 0.7499997 0.2500003

interation10 0.7500000 0.2500000 0.7499999 0.2500001

Notice that, even with n=10, the state is almost approaching toward

(0.75, 0.25), which is the steady state of the given transition matrix.

5. For the following transition matrix, which classes are closed?0.50 0.50 0.00 0.00 0.00

0.00 0.50 0.00 0.50 0.00

0.00 0.00 0.50 0.50 0.00

0.00 0.00 0.75 0.25 0.00

0.50 0.00 0.00 0.00 0.50

Basics of Markov Chain Monte Carlo 123

I Answer

(a) State 1 can go to 1 and 2 at the first move, go to 4 through 2, and

go to 3 through (2 and 4). So, state 1 cannot go to state 5.

(b) State 2 can go to 2 and 4 at the first move, and go to 3 through 4.

So, state 2 cannot go to state 1 and state 5.

(c) State 3 can go to 3 and 4 at the first move, and nowhere else. So,

state 3 cannot go to state 1, state 2, and state 5.

(d) State 4 can go to 3 and 4 at the first move, and nowhere else. So,

state 4 cannot go to state 1, state 2, and state 5.

(e) State 5 can go to 1 and 5 at the first move, go to 2 through 1, go

to 4 through (1 and 2), and go to 3 through (1, 2, and 4). So, state

5 can go everywhere.

Therefore, according to the definition of closedness (if P(A, B) = 0, then

A is closed with respect to B),

(a) state 1, 2, 3, and 4 are closed with respect to state 5;

(b) state 2, 3, and 4 are closed with respect to state 1; and

(c) state 3 and 4 are closed with respect to state 2.

7. From the following transition matrix, calculate the stationary distribu-

tion for proportions: 0.0 0.4 0.6

0.1 0.0 0.9

0.5 0.5 0.0

.What is the substantive conclusion from the zeros on the diagonal of

this matrix?

I Answer

[a b c

]0.0 0.4 0.6

0.1 0.0 0.9

0.5 0.5 0.0

=[a b c

]

124Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEYb+ 5c = 10a

4a+ 5c = 10b

6a+ 9b = 10c

⇒

a = 11

14b

c = 4835b

Since a + b + c = 1, a = 55221 , b = 70

221 , c = 96221 , which produces the

stationary distribution as:[55

221

70

221

96

221

]As, R result shows, the substantive conclusion from zeros on the diagonal

of the transition matrix is that it converges slowly.

< R.Code >

transition <- matrix(c(0, .1, .5, .4, 0, .5, .6, .9, 0), ncol=3)

markov <- function(n, A, B, C) init <- matrix(c(A, B, C), 1)

out <- NULL

for (i in 1:n) init <- init %*% transition

out <- rbind(out, init)

return(out)

try <- markov(50, .8, .1, .1)

> print(try)

[,1] [,2] [,3]

[1,] 0.0600000 0.3700000 0.5700000

[2,] 0.3220000 0.3090000 0.3690000

[3,] 0.2154000 0.3133000 0.4713000

:

:

[36,] 0.2488689 0.3167422 0.4343889

[37,] 0.2488687 0.3167420 0.4343893

[38,] 0.2488689 0.3167421 0.4343890

[39,] 0.2488687 0.3167421 0.4343892

[40,] 0.2488688 0.3167421 0.4343891

[41,] 0.2488688 0.3167421 0.4343892

[42,] 0.2488688 0.3167421 0.4343891

[43,] 0.2488688 0.3167421 0.4343892


[44,] 0.2488688 0.3167421 0.4343891

[45,] 0.2488688 0.3167421 0.4343891

[46,] 0.2488688 0.3167421 0.4343891

[47,] 0.2488688 0.3167421 0.4343891

[48,] 0.2488688 0.3167421 0.4343891

[49,] 0.2488688 0.3167421 0.4343891

[50,] 0.2488688 0.3167421 0.4343891

9. Develop a Bayesian specification using BUGS to model the following

counts of the number of social contacts for children in a daycare center

with the objective of estimating λi, the contact rate:

Person 1 2 3 4 5 6 7 8 9 10

Age (ai) 11 3 2 2 7 5 6 9 7 4

Social contacts (xi) 22 4 2 2 9 3 4 5 2 6

Use the following specification:

λi ∼ G(α, β)

α ∼ G(1, 1)

β ∼ G(0.1, 1)

to produce a posterior summary for λi.

I Answer

< R.Code >

require(runjags)

posterior.summary <- function(posterior) # this function is widely used throughout the text

output <- NULL

n <- ncol(posterior)

for (i in 1:n) summary <- c(mean(posterior[,i]),

sd(posterior[,i]),

quantile(posterior[,i], .025),


median(posterior[,i]),

quantile(posterior[,i], .975))

output <- rbind(output, summary)

rownames(output) <- colnames(posterior)

colnames(output) <- c("mean", "sd", "2.5%", "25%",

"50%", "75%", "97.5%")

return(round(output, 4))

x <- c(22, 4, 2, 2, 9, 3, 4, 5, 2, 6)

a <- c(11, 3, 2, 2, 7, 5, 6, 9, 7, 4)

n <- length(x)

data <- dump.format(list(x=x, a=a, n=n))

inits <- dump.format(list(alpha=1, beta=1,

lambda=rep(1,n)))

daycare.model <- "model for (i in 1:n) x[i] ~ dpois(theta[i])

theta[i] <- lambda[i]*a[i]

lambda[i] ~ dgamma(alpha, beta)

alpha ~ dgamma(1, 1)

beta ~ dgamma(0.1, 1)

"daycare <- run.jags(model=daycare.model,

data=data, inits=inits, monitor=c("alpha", "beta",

"lambda"), n.chain=1, burnin=1000, sample=5000,

plot=TRUE)

> posterior.summary(daycare$mcmc[[1]])

mean sd 2.5% 50% 97.5%

alpha 1.7584 0.7708 0.6501 1.6356 3.6209

beta 1.5432 0.7726 0.4134 1.4192 3.3878

lambda[1] 1.8980 0.3964 1.1860 1.8685 2.7581

lambda[2] 1.2866 0.5585 0.4359 1.2067 2.6165

lambda[3] 1.0768 0.5739 0.2556 0.9807 2.4177

lambda[4] 1.0708 0.5674 0.2575 0.9807 2.4223

lambda[5] 1.2588 0.3874 0.6394 1.2195 2.1506

lambda[6] 0.7312 0.3456 0.2160 0.6834 1.5303


lambda[7] 0.7550 0.3202 0.2590 0.7137 1.4928

lambda[8] 0.6361 0.2504 0.2478 0.6052 1.2199

lambda[9] 0.4364 0.2295 0.0988 0.3982 0.9694

lambda[10] 1.4091 0.5212 0.5835 1.3411 2.6206

11. The table below provides the results of nine postwar elections in Italy

by proportion per political party. The listed parties are: Democrazia

Cristiana (DC), Partito Comunista Italiano (PCI), Partito Socialista

Italiano (PSI), Partito Socialista Democratico Italiano (PSDI), Partito

Repubblicano Italiano (PRI), and Partito Liberale Italiano (PLI). The

“Others” category is a collapsing of smaller parties: Partito Radicale

(PR), Democrazia Proletaria (DP), Partito di Unita Proletaria per il

Comunismo (PdUP), Movimento Sociale Italiano (MSI), South Tyrol

Peoples Party (SVP), Sardinian Action Party (PSA), Valdotaine Union

(UV), the Monarchists (Mon), and the Socialist Party of Proletarian

Unity (PSIUP). In two cases parties presented joint election lists and

the returns are split across the two parties here. The compositional data

suggest a sense of stability for postwar Italian elections even though Italy

has averaged more than one government per year since 1945.

Party 1948 1953 1958 1963 1968 1972 1976 1979 1983

DC 0.485 0.401 0.424 0.383 0.3910 0.388 0.387 0.383 0.329

PCI 0.155 0.226 0.227 0.253 0.2690 0.272 0.344 0.304 0.299

PSI 0.155 0.128 0.142 0.138 0.0725 0.096 0.096 0.098 0.114

PSDI 0.071 0.045 0.045 0.061 0.0725 0.051 0.034 0.038 0.041

PRI 0.025 0.016 0.014 0.014 0.0200 0.029 0.031 0.030 0.051

PLI 0.038 0.030 0.035 0.070 0.0580 0.039 0.013 0.019 0.029

Others 0.071 0.154 0.113 0.081 0.1170 0.125 0.095 0.128 0.137

Source: Instituto Centrale di Statistica, Italia

Develop a multinomial-logistic model for pij as the proportion received

by party i and election j with BUGS , using time as a explanatory variable


and specifying uninformative priors.

I Answer

The model is specified as:

pj ∼MN (1, pr1j , pr2j , · · · , pr7j)

prij =eα+βj∑9j=1 e

α+βj

α ∼ N (0, 100)

β ∼ N (0, 100)

< R.Code >

require(runjags)

p <- matrix(c(.485, .155, .155, .071, .025, .038, .071,

.401, .226, .128, .045, .016, .030, .154,

.424, .227, .142, .045, .014, .035, .113,

.383, .253, .138, .061, .014, .070, .081,

.391, .269, .0725, .0725, .02, .058, .117,

.388, .272, .096, .051, .029, .039, .125,

.387, .344, .096, .034, .031, .013, .095,

.383, .304, .098, .038, .030, .019, .128,

.329, .299, .114, .041, .051, .029, .137),

nrow=7)

m <- nrow(p); t <- ncol(p)

italy.data <- dump.format(list(p=p, m=m, t=t))

italy.inits <- dump.format(list(alpha=1, beta=1))

italy.model <- "model for (j in 1:t) p[1:m, j] ~ dmulti(pr[1:m, j], 1)

for (i in 1:m) pr[i,j] <- mu[i,j] / sum(mu[,j])

log(mu[i,j]) <- alpha + beta * j

alpha ~ dnorm(0, 0.01)

beta ~ dnorm(0, 0.01)

"italy <- run.jags(model=italy.model, data=italy.data,


inits=italy.inits, n.chain=1,

monitor=c("alpha", "beta"),

burnin=1000, sample=5000, plot=TRUE)

# the function posterior.summary was defined in Q9.5.

> posterior.summary(italy$mcmc[[1]])

mean sd 2.5% 50% 97.5%

alpha 0.0375 10.0505 -19.1669 -0.1279 20.5815

beta 0.4247 10.0329 -19.4075 0.2927 19.8428

13. Blom, Holst, and Sandell (1994) define a “homesick” Markov chain as

one where the probability of returning to the starting state after 2m

(m > 1) iterations is at least as large as moving to any other state:

p2m(x0, x¬0) ≤ p2m(x0, x0). Does the Markov chain defined by the

following transition matrix have homesickness?

P =

0.5 0.5 0.0 0.0

0.5 0.0 0.5 0.0

0.0 0.5 0.0 0.5

0.0 0.0 0.5 0.5

Do Markov chains becomes less homesick over time?

I Answer

Since the stationary state is [0.25, 0.25, 0.25, 0.25], the Markov chain can-

not overcome “homesickness.” As shown below, this can be confirmed

in R , too.

< R.Code >

P <- matrix(c(.5, .5, 0, 0, .5, 0, .5, 0,

0, .5, 0, .5, 0, 0, .5, .5),

nrow=4)

out <- list(state1=c(1,0,0,0), state2=c(0,1,0,0),

state3=c(0,0,1,0), state4=c(0,0,0,1))

for (m in 1:20) state <- list(state1=c(1,0,0,0), state2=c(0,1,0,0),

state3=c(0,0,1,0), state4=c(0,0,0,1))


for (i in 1:4) for (j in 1:m) state[[i]] <- state[[i]] %*% P %*% P

out[[i]] <- rbind(out[[i]], state[[i]])

> print(out)

$state1

[,1] [,2] [,3] [,4]

[1,] 1.0000000 0.00 0.00 0.0000000

[2,] 0.5000000 0.25 0.25 0.0000000

[3,] 0.3750000 0.25 0.25 0.1250000

[4,] 0.3125000 0.25 0.25 0.1875000

[5,] 0.2812500 0.25 0.25 0.2187500

[6,] 0.2656250 0.25 0.25 0.2343750

[7,] 0.2578125 0.25 0.25 0.2421875

[8,] 0.2539062 0.25 0.25 0.2460938

[9,] 0.2519531 0.25 0.25 0.2480469

[10,] 0.2509766 0.25 0.25 0.2490234

[11,] 0.2504883 0.25 0.25 0.2495117

[12,] 0.2502441 0.25 0.25 0.2497559

[13,] 0.2501221 0.25 0.25 0.2498779

[14,] 0.2500610 0.25 0.25 0.2499390

[15,] 0.2500305 0.25 0.25 0.2499695

[16,] 0.2500153 0.25 0.25 0.2499847

[17,] 0.2500076 0.25 0.25 0.2499924

[18,] 0.2500038 0.25 0.25 0.2499962

[19,] 0.2500019 0.25 0.25 0.2499981

[20,] 0.2500010 0.25 0.25 0.2499990

[21,] 0.2500005 0.25 0.25 0.2499995

$state2

[,1] [,2] [,3] [,4]

[1,] 0.00 1.0000000 0.0000000 0.00

[2,] 0.25 0.5000000 0.0000000 0.25

[3,] 0.25 0.3750000 0.1250000 0.25

[4,] 0.25 0.3125000 0.1875000 0.25

[5,] 0.25 0.2812500 0.2187500 0.25


[6,] 0.25 0.2656250 0.2343750 0.25

[7,] 0.25 0.2578125 0.2421875 0.25

[8,] 0.25 0.2539062 0.2460938 0.25

[9,] 0.25 0.2519531 0.2480469 0.25

[10,] 0.25 0.2509766 0.2490234 0.25

[11,] 0.25 0.2504883 0.2495117 0.25

[12,] 0.25 0.2502441 0.2497559 0.25

[13,] 0.25 0.2501221 0.2498779 0.25

[14,] 0.25 0.2500610 0.2499390 0.25

[15,] 0.25 0.2500305 0.2499695 0.25

[16,] 0.25 0.2500153 0.2499847 0.25

[17,] 0.25 0.2500076 0.2499924 0.25

[18,] 0.25 0.2500038 0.2499962 0.25

[19,] 0.25 0.2500019 0.2499981 0.25

[20,] 0.25 0.2500010 0.2499990 0.25

[21,] 0.25 0.2500005 0.2499995 0.25

$state3

[,1] [,2] [,3] [,4]

[1,] 0.00 0.0000000 1.0000000 0.00

[2,] 0.25 0.0000000 0.5000000 0.25

[3,] 0.25 0.1250000 0.3750000 0.25

[4,] 0.25 0.1875000 0.3125000 0.25

[5,] 0.25 0.2187500 0.2812500 0.25

[6,] 0.25 0.2343750 0.2656250 0.25

[7,] 0.25 0.2421875 0.2578125 0.25

[8,] 0.25 0.2460938 0.2539062 0.25

[9,] 0.25 0.2480469 0.2519531 0.25

[10,] 0.25 0.2490234 0.2509766 0.25

[11,] 0.25 0.2495117 0.2504883 0.25

[12,] 0.25 0.2497559 0.2502441 0.25

[13,] 0.25 0.2498779 0.2501221 0.25

[14,] 0.25 0.2499390 0.2500610 0.25

[15,] 0.25 0.2499695 0.2500305 0.25

[16,] 0.25 0.2499847 0.2500153 0.25

[17,] 0.25 0.2499924 0.2500076 0.25

[18,] 0.25 0.2499962 0.2500038 0.25

[19,] 0.25 0.2499981 0.2500019 0.25

[20,] 0.25 0.2499990 0.2500010 0.25

[21,] 0.25 0.2499995 0.2500005 0.25


$state4

[,1] [,2] [,3] [,4]

[1,] 0.0000000 0.00 0.00 1.0000000

[2,] 0.0000000 0.25 0.25 0.5000000

[3,] 0.1250000 0.25 0.25 0.3750000

[4,] 0.1875000 0.25 0.25 0.3125000

[5,] 0.2187500 0.25 0.25 0.2812500

[6,] 0.2343750 0.25 0.25 0.2656250

[7,] 0.2421875 0.25 0.25 0.2578125

[8,] 0.2460938 0.25 0.25 0.2539062

[9,] 0.2480469 0.25 0.25 0.2519531

[10,] 0.2490234 0.25 0.25 0.2509766

[11,] 0.2495117 0.25 0.25 0.2504883

[12,] 0.2497559 0.25 0.25 0.2502441

[13,] 0.2498779 0.25 0.25 0.2501221

[14,] 0.2499390 0.25 0.25 0.2500610

[15,] 0.2499695 0.25 0.25 0.2500305

[16,] 0.2499847 0.25 0.25 0.2500153

[17,] 0.2499924 0.25 0.25 0.2500076

[18,] 0.2499962 0.25 0.25 0.2500038

[19,] 0.2499981 0.25 0.25 0.2500019

[20,] 0.2499990 0.25 0.25 0.2500010

[21,] 0.2499995 0.25 0.25 0.2500005

15. Koppel (1999) studies political control in hybrid organizations (semi-

governmental) through an analysis of government purchase of a spe-

cific type of venture capital funds: investment funds sponsored by the

Overseas Private Investment Corporation (OPIC). The following data

provide three variables as of January 1999.

Develop a model using BUGS where the size of the fund is modeled by the

age of the fund and its investment status according to the specification:

Yi ∼ N (mi, τ)

mi = β0 + β1X1i + β2X2i + εi

ε ∼ N (0, k),

where: Y is the size of the fund, X1 is the age of the fund and X2 is a

dichotomous explanatory variable equal to one if the fund is investing


and zero otherwise. Set a value for the constant k and an appropriate

prior distribution for τ . Summarize the posterior distributions of the

unknown parameters of interest.

Fund Age Status Size ($M)

AIG Brunswick Millennium 3 Investing 300

Aqua International Partners 2 Investing 300

Newbridge Andean Capital Partners 4 Investing 250

PBO Property 1 Investing 240

First NIS Regional 5 Investing 200

South America Private Equity Growth 4 Investing 180

Russia Partners 5 Investing 155

South Asia Capital 3 Investing 150

Modern Africa Growth and Investment 2 Investing 150

India Private Equity 4 Investing 140

New Africa Opportunity 3 Investing 120

Global Environmental Emerging II 2 Investing 120

Bancroft Eastern Europe 3 Investing 100

Agribusiness Partners International 4 Investing 95

Caucus 1 Raising 92

Asia Pacific Growth 7 Divesting 75

Global Environmental Emerging I 5 Invested 70

Poland Partners 5 Invested 64

Emerging Europe 3 Investing 60

West Bank/Gaza and Jordan 2 Raising 60

Draper International India 3 Investing 55

EnterArab Investment 3 Investing 45

Israel Growth 5 Investing 40

Africa Growth 8 Divesting 25

Allied Capital Small Business 4 Divesting 20

I Answer

I set the constant k = 1, and assign the gamma distribution, G(.01.01)

to the prior for τ .

< R.Code >

require(runjags)

size <- c(300, 300, 250, 240, 200, 180, 155, 150,

150, 140, 120, 120, 100, 95, 92, 75,

70, 64, 60, 60, 55, 45, 40, 25, 20)

age <- c(3, 2, 4, 1, 5, 4, 5, 3, 2, 4, 3, 2, 3,


4, 1, 7, 5, 5, 3, 2, 3, 3, 5, 8, 4)

status <- c(rep(1, 14), 0,0,0,0,1,0,1,1,1,0,0)

n <- length(size)

e <- rnorm(n, 0, 1)

opic.data <- dump.format(list(size=size, age=age,

status=status,

n=n, e=e))

opic.inits <- dump.format(list(beta0=1, beta1=1,

beta2=1, pres=1))

opic.model <- "model for (i in 1:n) size[i] ~ dnorm(mu[i], pres)

mu[i] <- beta0+beta1*age[i]+beta2*status[i]+e[i]

beta0 ~ dnorm(0, .01)



pres ~ dgamma(.01, .01)

"opic <- run.jags(model=opic.model, data=opic.data,

inits=opic.inits, n.chain=1,

monitor=c("beta0", "beta1",

"beta2", "pres"),



> posterior.summary(opic$mcmc[[1]])

mean sd 2.5% 50% 97.5%

beta0 10.0548 9.8456 -9.4405 9.9184 29.2703

beta1 16.8120 5.1294 6.5871 16.9374 26.6733

beta2 12.4065 9.8219 -6.7362 12.4348 31.2947

pres 0.0001 0.0000 0.0000 0.0001 0.0002

17. (Norris 1997) A Markov chain is reversible if the distribution of θn|θn+1 =

t is the same as the θn|θn−1 = t. This means that direction of time does

not alter the properties of the chain. Show that the following irreducible


matrix does not define a reversible Markov chain. 0 p 1− p1− p 0 p

p 1− p 0

.See Besag et al. (1995) for details on reversibility.

I Answer

According to the definition of “reversibility,” there exist π = (π1, π2, π3)

s.t. πipij = πjpji,∀i, j.π1p12 = π2p21

π1p13 = π3p31

π2p23 = π3p32

π1 + π2 + π3 = 1

⇒

π1p = π2(1− p)π1(1− p) = π3p

π2p = π3(1− p)π1 + π2 + π3 = 1

∴ π1 = π2 = π3 =1

3, p =

1

2

Therefore, the transition matrix that is provided in the question defines

a reversible Markov chain ONLY when p = 12 , and therefore not in

general.

19. (Grimmett and Stirzaker 1992) A random walk is recurrent if the mean

size of the jumps is zero. Define a random walk on the integers by the

transition from integer i to either integer i+2 or i−1 with probabilities:

p(i, i+ 2) = p, p(i, i− 1) = 1− p.

A random walk is recurrent if the mean recurrence time,∑nfii(n), is

finite, otherwise it is transient. What values of p make this random walk

recurrent?

I Answer

Let fii(n) be the probability that the state i is first visited at time n,

given X0 = i.

The state i can be first visited only after the third transition, meaning

fii(1) = fii(2) = 0. With the same logic, fii(4) = fii(5) = 0, and fii(n)

for all n that is not divisible by 3.


At the third transition, the state i is first reached through three ways:

1) (i) → (i − 1) → (i − 2) → (i); 2) (i) → (i − 1) → (i + 1) → (i); and

3) (i)→ (i+ 2)→ (i+ 1)→ (i). This gives us:

fii(3) = p(1− p)2 + p(1− p)2 + p(1− p)2 = 3p(1− p)2.

So, we can eliminate the three cases that we counted. With the same

procedure, we know that the state i is first visited through (i−3) at the

third transition AND through (i+ 3) at the third transition. This gives

us:

fii(6) = 3p2(1− p)4 + 3p2(1− p)4 = 6p2(1− p)4.

With the same logic, we can proceed more to get fii(9), fii(12), · · · .

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.5

1.0

1.5

2.0

2.5

value of p

mea

n re

curr

ence

tim

e

FIGURE 10.1: Mean recurrence time with respect to p


Therefore, the mean recurrence time is:∑nfii(n) = 3× 3p(1− p)2 + 6× 6p2(1− p)4 + · · ·

= 32(12x+ 22x2 + 32x3 + 42x4 + · · ·

)=

32x(1 + x)

(1− x)3

where x = p(1− p)2.

We can make a graph for the mean recurrence time with respect to the

value of p. As we can observe in Figure 9.1, the mean recurrence time

is always finite within the region of 0 ≤ p ≤ 1. Therefore, the random

walk specified in the question is recurrent for all 0 ≤ p ≤ 1.

11

Implementing Bayesian Models with Markov

Chain Monte Carlo

Exercises

To be provided.

139

12

Bayesian Hierarchical Models

Exercises

1. Show that inserting (10.2) into (10.1) produces (10.3).

I Answer

yij = βj0 + βj1Xij + εij

= (γ00 + γ10Zj0 + uj0) + (γ01 + γ11Zj1 + uj1)Xij + εij

= γ00 + γ01Xij + γ10Zj0 + γ11Zj1Xij + uj1Xij + uj0 + εij ,

which is the equation (10.3).

3. For the following simple hierarchical form,

Xi|θ ∼ N (θ, 1)

θ|σ2 ∼ N (0, σ2)

σ2 ∼ 1/k, 0 < σ ≤ k,

express the full joint distribution p(θ, σ,X).

I Answer

p(θ, σ,X) ∝ p(X|θ)p(θ|σ2)p(σ2)

= N (θ, 1) · N (0, σ2) · 1

k

141


=

(1√2π

)nexp

[n∑i=1

−1

2(Xi − θ)2

]1√

2πσ2exp

[− 1

2σ2θ2

]1

k

∝ 1

σexp

[−1

2

n∑i=1

(Xi − θ)2 − 1

2σ2θ2

]

=1

σexp

[−1

2

n∑i=1

(Xi − θ)2+

(θ

σ

)2]

5. Using the data below on firearm-related deaths per 100,000 in the United

States from 1980 to 1998 (Source: National Center for Health Statistics.

Health, United States, 1999, With Health and Aging Chartbook, Hy-

attsville, Maryland: 1999.), test the hypothesis that there is a period of

higher rates by specifying a normal mixture model and comparing the

posterior distributions of λ1 and λ2 in the model:

xi ∼ N (λki |τ)

ki ∼ Categorical2(K)

τ ∼ G(εg1 , εg2)

λi ∼ N (0, εli)

K ∼ D(1, 1),

where K is the proportion of the observations distributed N (λk2 |τ) and

1−K is the proportion of the observations distributed N (λk1 |τ). Spec-

ify a BUGS model choosing values of the ε.. parameters that specify

somewhat diffuse hyperpriors.

14.9 14.8 14.3 13.3 13.3 13.3 13.9 13.6 13.9

14.1 14.9 15.2 14.8 15.4 14.8 13.7 12.8 12.1

I Answer

< R.Code >

require(runjags)

x <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3,

13.9, 13.6, 13.9, 14.1, 14.9, 15.2,

Bayesian Hierarchical Models 143

14.8, 15.4, 14.8, 13.7, 12.8, 12.1)

n <- length(x)

d <- c(1, 1)

firearm.data <- dump.format(list(x=x, n=n, d=d))

firearm.inits <- dump.format(list(tau=1,

lambda=c(1,1),

K=c(.5,.5)))

firearm.model <- "model for (i in 1:n) x[i] ~ dnorm(mu[i], tau)

mu[i] <- lambda[k[i]]

k[i] ~ dcat(K[])

tau ~ dgamma(.01, .01)

lambda[1] ~ dnorm(0, .01)


K[1:2] ~ ddirch(d[])

"firearm <- run.jags(model=firearm.model,

data=firearm.data,

inits=firearm.inits, n.chain=1,

monitor=c("tau", "lambda", "K"),



> posterior.summary(firearm$mcmc[[1]])

mean sd 2.5% 50% 97.5%

tau 1.4672 0.8043 0.5649 1.2882 3.6936

lambda[1] 10.5694 7.7897 -12.7068 13.9496 15.2927

lambda[2] 8.5250 9.2400 -14.9768 13.6290 15.5635

K[1] 0.5768 0.3896 0.0063 0.7252 0.9965

K[2] 0.4232 0.3896 0.0035 0.2748 0.9937

Even though there is no strictly strong evidence, the hypothesis of a

higher rate period could be reasonable: λ1 and λ2 are a little different

with fairly equal proportion of K[1] and K[2].


7. (Carlin and Louis 2001). Given the hierarchical model:

Yi|λi ∼ P(λiti)

λi ∼ G(α, β)

β ∼ IG(A,B), α known

where the hyperpriors A and B are fixed. Express the full conditional

distributions for all of the unknown parameters.

I Answer

p(Yi, λi, β) ∝ p(Yi|λi)p(λi|β)p(β)

=

n∏i=1

(λiti)yi

yi!exp[−λiti]×

βα

Γ(α)λα−1i exp[−λiti]

× BA

Γ(A)β−(A+1) exp[−B/β]

∴ π(λi|β, Yi) ∝ λ∑yi+α−1

i exp [−(ti + β)λi]

= G(λi|∑

yi + α, ti + beta)

π(β|λi, Yi) ∝ βα−A−1 exp [−λiβ −B/β]

9. Hobert and Casella (1998) specify the following one-way random effects

model:

yij ∼ N (β + ui, σ2ε )

ui ∼ N (0, σ2)

p(β) ∝ 1

p(σ2ε ) ∼ 1/σ2

ε ), 0 < σ2ε <∞

p(σ2) ∼ 1/σ2), 0 < σ2 <∞,

where i = 1, . . . ,K and j = 1, . . . , J . Using the R function rnorm() gen-

erate contrived data for yij with K = 50, J = 5, and estimate the model

in BUGS . When improper priors are specified in hierarchical models, it

is important to make sure that the resulting posterior is not itself then


improper. Non-detection of posterior impropriety from looking at chain

values was a problem in the early MCMC literature. This model gives

improper posteriors although it is not possible to observe this problem

without running the chain for quite some time. Compare the standard

diagnostics for a short run with a very long run. What do you see? Now

specify proper priors and contrast the results.

I Answer

< R.Code >

# generating data

u <- matrix(rep(NA, 250), nrow=50)

for (i in 1:50) u[i,1:5] <- rnorm(5, 0, 1)

b <- matrix(rnorm(250, 3, 1), nrow=50)

e <- matrix(rnorm(250, 0, 1), nrow=50)

y <- u + b + e

# modeling

require(runjags)

K <- nrow(y); J <- ncol(y)

casella.data <- dump.format(list(y=y, K=K, J=J))

casella.inits <- dump.format(list(beta=1, sige=1,

sigma=1))

casella.improper.model <- "model for (i in 1:K) for (j in 1:J) y[i,j] ~ dnorm(mu[i,j], sige)

mu[i,j] <- beta + u[i]

u[i] ~ dnorm(0, sigma)

beta ~ dunif(-10000, 10000)

sige ~ dunif(-10000, 10000)

sigma ~ dunif(-10000, 10000)

"casella.improper1 <- run.jags(model=casella.improper.model,

data=casella.data,


inits=casella.inits,

n.chain=1, burnin=1000,

sample=5000, plot=TRUE,

monitor=c("beta", "sige",

"sigma"))

casella.improper2 <- run.jags(model=casella.improper.model,

data=casella.data,





"sigma"))

casella.proper.model <- "model for (i in 1:K) for (j in 1:J) y[i,j] ~ dnorm(mu[i,j], sige)

mu[i,j] <- beta + u[i]

u[i] ~ dnorm(0, sigma)

beta ~ dnorm(2, 1)

sige ~ dgamma(.5, 1)

sigma ~ dgamma(.5, 1)

"casella.proper <- run.jags(model=casella.proper.model,

data=casella.data,





"sigma"))

par(mfrow=c(3,3))

traceplot(casella.improper1$mcmc[[1]][,1],

main="beta (short run)",

ylim=c(2.2, 3.5))


main="beta (long run)",

ylim=c(2.2, 3.5))

traceplot(casella.proper$mcmc[[1]][,1],

main="beta (proper prior, short)",


ylim=c(2.2, 3.5))


main="sigma_e (short run)",

ylim=c(.2, .5))


main="sigma_e (long run)",

ylim=c(.2, .5))


main="sigma_e (proper prior, short run)",

ylim=c(.2, .5))


main="sigma (short run)",

ylim=c(0, 10000))


main="sigma (long run)",

ylim=c(0, 10000))


main="sigma (proper prior, short run)",

ylim=c(0, 15))


> posterior.summary(casella.improper1$mcmc[[1]])

mean sd 2.5% 50% 97.5%

beta 2.8828 0.109 2.6783 2.8826 3.0994

sige 0.3268 0.030 0.2698 0.3262 0.3864

sigma 5011.0088 2910.723 386.2101 5284.9250 9677.6095

> posterior.summary(casella.improper2$mcmc[[1]])

mean sd 2.5% 50% 97.5%

beta 2.8830 0.1105 2.6590 2.8845 3.0979

sige 0.3269 0.0296 0.2717 0.3259 0.3879

sigma 4850.2739 3007.6788 59.9005 4831.7800 9742.1902

> posterior.summary(casella.proper$mcmc[[1]])

mean sd 2.5% 50% 97.5%

beta 2.8689 0.1337 2.6106 2.8684 3.1394

sige 0.3493 0.0329 0.2873 0.3478 0.4166

sigma 3.0508 1.2014 1.3813 2.8223 5.9766

As shown in Figure 10.1, improper priors cause bad convergences, which

requires a very long iterations. However, proper priors produce relatively


FIGURE 12.1: Traceplots for One-Way Random Effects Model

well converged traceplots even with a fairly small number of iterations.

11. To introduce the “near-ignorance prior” Sanso and Pericchi (1994) spec-

ify the hierarchical model starting with a multivariate normal specifica-

tion:

Y|θ ∼MVN (Xθ, σ2I)

θi|λi ∼ N (µi, λ2i )

λ2i ∼ EX (1/τ2

i )


where i = 1, . . . , p and µi, τi, σ2 are known. Show that by substitution

and integration over λ that this model is equivalent to the hierarchical

specification:

Y|θ ∼MVN (Xθ, σ2I)

θi ∼ DE(µi, τi).

I Answer

f(θi) =

∫ ∞0

f(θi|λ2i )f(λ2

i )dλ2i

=

∫ ∞0

(2πλ2

i

)− 12 exp

[− 1

2λ2i

(θi − µi)2

]1

τ2i

exp

[−λ2

i

(1

τ2i

)]dλ2

i

= (2π)1

τ2i

∫ ∞0

(λ2i )− 1

2︸︷︷︸=xν−1

exp

− 1

2λ2i

(θi − µi)2︸︷︷︸=−β/x

−λ2i

(1

τ2i

)︸︷︷︸

=γx

dλ2i ,

where ν = 12 ,β = 1

2 (θi − µi)2, and γ = 1/τ2i .

Then, we have

f(θi) =

∫ ∞0

xν−1 exp

[−β

x− γx

]dx = 2

(β

γ

) ν2

Kν(2√βγ),

where Kν(z) is a modified Bessel function of imaginary argument:

Kν(z) = −πi2

exp[−π

2νi]

H(2)ν

(ze−

12πi)

H(2)ν (z) = ιν(z)− iΥν(z)

ιν(z) =zν

2ν

∞∑k=0

(−1)kz2k

22kk!Γ(ν + k + 1)

Υν(z) =1

sin(νπ)[cos(νπ)ιν(z)− ι−ν(z)]

Since z = 2√βγ = 2

√1

2τ2i

(θi − µi)2 = 2√2τi| θi − µi | and ν = 1

2 , we


have:

ι 12(z) =

(1√2τi| θi − µi |

) 12∞∑k=0

(−1)k(

1√2τi| θi − µi |

)2k

k!Γ(k + 1.5)

ι− 12(z) =

(1√2τi| θi − µi |

)− 12∞∑k=0

(−1)k(

1√2τi| θi − µi |

)2k

k!Γ(k − 0.5)

Υ 12(z) =

1

sin(π2 )

[cos(

π

2)ι 1

2(z)− ι− 1

2(z)]

Plugging these in gives us

f(θi) ∝1

τiexp

[−| θi − µi |

τi

],

which is a canonical form for DE(µi, τi).

13. Clyde, Muller, and Parmigiani (1995) specify the following hierarchical

model for 10 Bernoulli outcomes:

yi|βi, γi ∼ BR(p(β, λ,x))

p(yi = 1|βi, λi,x) = (1 + exp[−βi(x− λi)− log(0.95/0.05)])−1[

log βi

log λi

]∼MVN (µ,Σ)

µ ∼MVN

[ 0

2

],

[25 5

5 25

]−1

Σ ∼ W

10, 10×

[0.44 −0.12

−0.12 0.14

]−1 .

Using BUGS, obtain the posterior distribution for log β and log λ using the

data: y = [1, 0, 1, 1, 1, 1, 0, 1, 1, 0] with x generated in R from N (1, 2).

I Answer

< R.Code >

require(runjags)


y <- c(1, 0, 1, 1, 1, 1, 0, 1, 1, 0)

x <- rnorm(10, 1, 2)

n <- length(y)

mumu <- c(0, 2)

musig <- matrix(c(25, 5, 5, 25), 2)

sigsig <- matrix(c(0.44, -0.12, -0.12, 0.14), 2)

theta.in <- cbind(rep(1,10), rep(1, 10))

clyde.data <- dump.format(list(y=y, x=x, mumu=mumu, n=n,

musig=musig,

sigsig=sigsig))

clyde.inits <- dump.format(list(theta=theta.in))

clyde.model <- "model for (i in 1:n) y[i] ~ dbern(p[i])

p[i] <- 1/(1 + g[i])

log(g[i]) <- - beta[i] * (x[i] - lambda[i])

- log(0.95/0.05)

log(beta[i]) <- theta[i,1]

log(lambda[i]) <- theta[i,2]

theta[i,1:2] ~ dmnorm(mu[1:2], S[1:2,1:2])

mu[1:2] ~ dmnorm(mumu[1:2], musig[1:2,1:2])

S[1:2,1:2] ~ dwish(sigsig[1:2,1:2], 10)

"clyde <- run.jags(model=clyde.model, data=clyde.data,

inits=clyde.inits, n.chain=1,

burnin=1000, sample=5000, plot=TRUE,

monitor=c("theta"))

colnames(clyde$mcmc[[1]]) <- c("beta[1]", "beta[2]",

"beta[3]", "beta[4]",

"beta[5]", "beta[6]",

"beta[7]", "beta[8]",

"beta[9]", "beta[10]",

"lambda[1]", "lambda[2]",




"lambda[9]", "lambda[10]")



> posterior.summary(clyde$mcmc[[1]])

mean sd 2.5% 50% 97.5%

beta[1] -0.6457 0.2816 -1.2149 -0.6394 -0.1038

beta[2] -0.3872 0.3527 -1.0017 -0.4259 0.3786

beta[3] -0.7028 0.2982 -1.2711 -0.7021 -0.1088

beta[4] -0.5088 0.3380 -1.1601 -0.5026 0.1884

beta[5] -0.7409 0.3038 -1.3286 -0.7283 -0.1150

beta[6] -0.5157 0.3470 -1.1704 -0.5212 0.2559

beta[7] -0.4188 0.3648 -1.0883 -0.4360 0.3569

beta[8] -0.6620 0.3076 -1.2663 -0.6573 -0.0916

beta[9] -0.6162 0.3132 -1.3290 -0.6057 -0.0341

beta[10] -0.4134 0.3546 -1.0529 -0.4510 0.3488

lambda[1] 1.8409 0.2021 1.4365 1.8437 2.1969

lambda[2] 1.8685 0.2009 1.4436 1.8926 2.2618

lambda[3] 1.8479 0.2104 1.4033 1.8578 2.2298

lambda[4] 1.8396 0.2090 1.4415 1.8510 2.2186

lambda[5] 1.8548 0.2117 1.4356 1.8546 2.2480

lambda[6] 1.8393 0.2090 1.3686 1.8545 2.2326

lambda[7] 1.8717 0.2021 1.4624 1.8821 2.2402

lambda[8] 1.8404 0.2103 1.4083 1.8441 2.2452

lambda[9] 1.8412 0.2154 1.4007 1.8515 2.2478

lambda[10] 1.8920 0.1991 1.4910 1.9040 2.2714

15. Scollnik (2001) considers the following actuarial claims data for three

groups of insurance policyholders, with missing values.

Group 1 Group 2 Group 3

Year Payroll Claims Payroll Claims Payroll Claims

1 280 9 260 6 NA NA

2 320 7 275 4 145 8

3 265 6 240 2 120 3

4 340 13 265 8 105 4

5 NA NA 285 NA 115 NA


Replicate Scollnik’s hierarchical Poisson model using BUGS :

Yij ∼ P(λij)

λij = Pijθj

θj ∼ G(α, β) Pij ∼ G(γ, δ)

α ∼ G(5, 5) γ ∼ U(0, 100)

β ∼ G(25, 1) δ ∼ U(0, 100),

where i = 1, 2, 3 and j = 1, . . . , 5.

I Answer

< R.Code >

require(runjags)

y <- matrix(c(9, 6, NA, 7, 4, 8, 6, 2, 3,

13, 8, 4, NA, NA, NA), nrow=3)

K <- nrow(y); J <- ncol(y)

scollnik.data <- dump.format(list(y=y, K=K, J=J))

scollnik.inits <- dump.format(list(alpha=1, beta=1,

gamma=1, delta=1))

scollnik.model <- "model for (j in 1:J) for (i in 1:K) y[i,j] ~ dpois(lambda[i,j])

lambda[i,j] <- p[i,j] * theta[j]

p[i,j] ~ dgamma(gamma, delta)

theta[j] ~ dgamma(alpha, beta)

alpha ~ dgamma(5,5)

beta ~ dgamma(25,1)

gamma ~ dunif(0,100)

delta ~ dunif(0,100)

"scollnik <- run.jags(model=scollnik.model,

data=scollnik.data,

inits=scollnik.inits,




monitor=c("alpha", "beta",

"gamma", "delta"))

> posterior.summary(scollnik$mcmc[[1]])

mean sd 2.5% 50% 97.5%

alpha 1.3470 0.4674 0.5819 1.2878 2.4244

beta 24.7008 4.7755 16.3794 24.4065 34.8211

gamma 53.7372 24.5962 14.3150 50.2538 97.7965

delta 0.4334 0.2205 0.1232 0.3869 0.8882

17. Show that de Finetti’s property holds true for a mixture of normals.

Hint: see Freedman and Diaconis (1982).

I Answer

Let’s say

x1[1], x

1[2], · · · , x

1[n] ∼ N 1

x2[1], x

2[2], · · · , x

2[n] ∼ N 2

The exchangeability in N 1 and N 2 means that changing the order of

x1’s and x2’s do NOT matter.

Construct a mixture normal model by w1·N 1+w2·N 2 where w1+w2 = 1.

So, the draws from the mixture normal model are:

w1 · x1[1] + w2 · x2

[1],

w1 · x1[2] + w2 · x2

[2],

· · · ,w1 · x1

[n] + w2 · x2[n]

What we care about is the order of the above draws. And, since the

orders of x1’s and x2’s do NOT matter, the order of(w1 · x1 + w2 · x2

)’s

does NOT matter, either. Therefore, de Finetti’s property holds true

for a mixture of normals.

19. Plug (10.39) in to (10.36) to obtain (10.40), the empirical Bayes esti-

mates of the posterior mean. Using the same logic and (10.37), produce


the empirical Bayes estimate of the posterior variance, assuming that it

is now not know.

I Answer

µi =σ2

0m+ s2Xi

σ20 + s2

=σ2

0

σ20 + s2

m+s2

σ20 + s2

Xi

=σ2

0

σ20 + s2

m+

(1− σ2

0

σ20 + s2

)Xi

= E

[(p− 3)σ2

0∑(Xi − ¯X)2

]¯X +

(1− E

[(p− 3)σ2

0∑(Xi − ¯X)2

])Xi

∴ µEBi =(p− 3)σ2

0∑(Xi − ¯X)2

¯X +

(1− (p− 3)σ2

0∑(Xi − ¯X)2

)Xi,

which is (10.40).

From (10.39), we have:

s2 =σ2

0 − σ20E[

(p−3)σ20∑

(Xi− ¯X)2

]E[

(p−3)σ20∑

(Xi− ¯X)2

] ,

which gives us:

σ2µ = E

[(p− 3)σ2

0∑(Xi − ¯X)2

]×σ2

0 − σ20E[

(p−3)σ20∑

(Xi− ¯X)2

]E[

(p−3)σ20∑

(Xi− ¯X)2

]=

(1− E

[(p− 3)σ2

0∑(Xi − ¯X)2

])σ2

0

∴ σ2EB

µ =

(1− E

[(p− 3)σ2

0∑(Xi − ¯X)2

])σ2

0 ,

which is the empirical Bayes estimate of the posterior variance.

13

Some Markov Chain Monte Carlo Theory

Exercises

1. In Section , three conditions were given for F to be an associated field

of Ω. Show that the first condition could be replaced with ∅ ∈ F using

properties of one of the other two conditions. Similarly, prove that the

Kolmogorov axioms can be stated with respect the probability of the

null set or the probability of the complete set.

I Answer By assumption,

φ ∈ F

If A ∈ F → AC ∈ F

If A ∈ F ,∀i →∞⋃i=1

Ai ∈ F

Since φC = Ω, from the conditions (1) and (2), φ ∈ F → Ω ∈ F .

3. Suppose we have the probability space (Ω,F , P ), sometimes called a

triple, and A1, A2, . . . , Ak ∈ F . Prove the finite sub-additive property

that:

P

(k⋃i=1

Ai

)≤

k∑i=1

p(Ai)

(Boole’s Inequality).

I Answer Start with a basic principle:

p(Aj⋃A`) = p(Aj) + p(A`)− p(Aj

⋂A`).

157


Which is generalized as:

p(

k⋃i=1

Ai) =

k∑i=1

p(Ai)−k∑i 6=1

(A1

⋂Ai)−

k∑i 6=2

(A2

⋂Ai) . . .−

k∑i6=k

(Ak⋂Ai).

Since each of the subtracted terms must be non-negative, we know that

p(k⋃i=1

Ai) ≤k∑i=1

p(Ai).

5. Prove that uniform ergodicity gives a faster rate of convergence than

geometric ergodicity and that geometric ergodicity gives a faster rate of

convergence than ergodicity of degree 2.

I Answer Using notation from Tierny (1994). Consider two Markov

chains P1 and P2 at step n:

||Pn1 (x, ·)− π(·)||TV ≤M(x)rn1

||Pn2 (x, ·)− π(·)||TV ≤ K0rn2

where 0 < r1, r2 < 1, K0 is a positive constant, and || ||TV denotes the

total variation norm. Therefore the first chain is geometrically ergodic

and the second chain is uniformly ergodic. If M(x) is bounded then the

first chain is also uniformly ergodic since we can set the upper bound

as K.

Given a specific n, let Kn =rn2rn1

(K0), then there is a An ⊂ X with

π(An) > 0 such that for all x ∈ An:

||Pn1 (x, ·)− π(·)||TV ≥ Krn1

=

(K0

rn2rn1

)rn1 = K0r

n2

≥ ||Pn2 (x, ·)− π(·)||TV .

7. Construct a stochastic process without the Markovian property and con-

struct deliberately non-homogeneous Markov chain.

I Answer (A) Define a stochastic process, θ[t](t ≥ 0) on the prob-

ability space (Ω,F , P ) as discussed in Section 11.2 where the random

Some Markov Chain Monte Carlo Theory 159

variable θ[t] is conditional on the history of the process in the following

manner:

• E|θ[t]| < inf

• E(θ[t+1]|X1, X2, . . . , Xt) = θ[t]

• θ[t] = f(n(X1, X2, . . . , Xt)

where fn() often the arithmetic mean. This is called a Martingale.

(B) A deliberately nonhomogeneous Markov chain on I+, with k, c ∈ Iis defined by:

p(θ[t+1]) =∣∣∣f(θ[t] + t(modk) + c

∣∣∣ ,where f() is any positive bounded function. For example,

X <- 50; k <- 10

move <- function(X,t,k) abs( rpois(1,X) + t %% k - 5 )

for (i in 1:5000)

X <- move(X,i,k)

print(X)

9. Give an example of a maximally ψ-irreducible measure and one that it

dominates.

I Answer Define ψ as a postiive, σ-finite measure on (Ω,F) with

arbitrary A ∈ F 3 ψ(A) > 0, where for any other positive σ-finite

measure on (Ω,F) that ψ > ψ′ (maximally).

When F is discrete and A ⊂ F :

x −→ y ∀x ∈ F , y ∈ A,

and the irreducible measure is just the counting measure on A. Thus

for matrices:

x −→ y ∀x, y ∈ F .

Denote this as:

CardB(F) = Card(A ∩B), B ⊂ F ,


for some other subset B.

Now suppose that k is a CardB-irreducible transition matrix, with:

G = y ∈ F : x −→ y for somex ∈ B.

Then the counting measure on G, CardG, is maximally irriducible and

dominates CardF .

11. A single-dimension Markov chain, M(θt, t ≥ 0), with transition matrix

P and unique stationary distribution π(θ) is time-reversible iff:

π(θi)p(θi, θj) = π(θj)p(θj , θi).

for all θi and θj in Ω. Prove that the reversed chain, defined by M(θs, s ≤0) = M(θt,−t ≥ 0), has the identical transition matrix.

I Answer Notate Xn as the nth value from a Markov chain with

stationary distribution π(θ) and transition kernel p(θi, θj) = pij . Now

Yn = X−n is the nth value from a Markov chain with stationary distri-

bution also π(θj), but transition kernel p(θj , θi) = qji. Then:

qij = p(Yn+1 = j|Yn = i)

= p(X−n−1 = j|X−n = i) [set k = −n]

= p(Xk−1 = j|Xk = i)

=p(Xk−1 = j,Xk = i)

p(Xk = i)

=p(Xk−1 = j)

p(Xk = i)p((Xk = i|Xk−1 = ji).

Therefore π(θi)p(θi, θj) = π(θj)p(θj , θi) holds only when qij = pij ,

meaning that the transition kernels are the same but run in opposite

directions with regard to time.

13. Meyn and Tweedie (1993, 73) define an n-step taboo probability as:∑AP

n(x,B) = Px(Φn ∈ B, τA ≥ n), x ∈ X, A,B ∈ B(X),

meaning the probability that the Markov chain, Φ, transitions to set Bin n steps avoiding (not hitting) set A. Here X is a general state space


with countably generated σ-field B(X), and τA is the return time to A(their notation differs slightly from that in this chapter). Show that:∑

APn(x,B) =

∫AcP (x, dy)AP

n−1(y,B),

where Ac denotes the complementary set to A.

I Answer For some y ∈ X as well:∑AP

n(x,B) = p(moves from x to y at step n without hitting A)

× p(reached step n− 1 without hitting A)

= p(Φn 6∈ A|Φn−1 = x)p(Φ1, . . . ,Φn−1 6∈ A)

=

∫AcP (x, dy)AP

n−1(y,B).

15. The Langevin algorithm is a Metropolis-Hastings variant where on each

step a small increment is added to the proposal point in the direction

of higher density. Show that making this this increment an increment

of the log gradient in the positive direction produces an ergodic Markov

chain by preserving the detailed balance equation.

I Answer Define the Langevin Diffusion (LD) for an n-dimensional

posterior distribution, smooth πn, having variance σ2, as the diffusion

process that satisfies the stochastic differential equation:

d∧t = σdBt +σ2

2∇ log [πn(∧t)] dt,

where Bt is n-dimensional standard Brownian motion, and ∇ denotes a

gradient df/dx. A discrete approximation is given by:

d∧t+1 = ∧t + σnZt +σ2n

2∇ log [πn(∧t)] dt,

where Ztiid∼ N(0, 1), t = 1, . . . , n, and σ2

n is a chosen step variance.

This can be transient without a Metropolis step (Roberts and Tweedie

1996), so for producing a candidate from the current position, Xt −→Yt+1, we calculate:

Yt+1 = Xt + σnZt + delformXt,


which is a random where σnZt+1 is random step andσ2n

2 ∇ log [πn(Xt)]

another random step but always towards higher density. Note that in

practice the choice of σ2n is critical to the performance of this chain.

We accept the candidate Yt+1 as move Xt+1 with probabiliity:

αn(Xt, Yt+1) =πn(Yt+1)qn(Yt+1, Xt)

πn(Yt+1)qn(Xt, Yt+1),

where:

qn(x, y) = (2πσ2n)−

n2 exp

[− 1

2σ2n

∥∥∥∥y − x− σ2n

2∇ log [πn(x)]

∥∥∥∥2

2

]≡

n∏i=1

q(xni , yi),

(‖ ‖ denotes the standard L2 norm for vectors), otherwise set Xt+1 =

Xt. We need to show detailed balance holds, π(x)p(x, y) = π(y)p(y, x),

where

p(x, y) = qn(x, y)min

[1,π(y)

π(x)

], x 6= y.

Substituting in the definitions. . .

π(x)(2πσ2n)−

n2 exp

[− 1

2σ2n

∥∥∥∥y − x− σ2n

2∇ log [πn(x)]

∥∥∥∥2

2

]min

[1,π(y)

π(x)

]?= π(y)(2πσ2

n)−n2 exp

[− 1

2σ2n

∥∥∥∥x− y − σ2n

2∇ log [πn(y)]

∥∥∥∥2

2

]min

[1,π(x)

π(y)

]

min [π(y), π(x)] exp

[− 1

2σ2n

∥∥∥∥y − x− σ2n

2∇ log [πn(x)]

∥∥∥∥2

2

]?= min [π(x), π(y)] exp

[− 1

2σ2n

∥∥∥∥x− y − σ2n

2∇ log [πn(y)]

∥∥∥∥2

2

]

exp

[− 1

2σ2n

∥∥∥∥y − x− σ2n

2∇ log [πn(x)]

∥∥∥∥2

2

]?= exp

[− 1

2σ2n

∥∥∥∥x− y − σ2n

2∇ log [πn(y)]

∥∥∥∥2

2

]

n∑i=1

[yi − xi

σ2n

2∇ log [πn(xi)]

]2?=

n∑i=1

[xi − yi

σ2n

2∇ log [πn(yi)]

]2


n∑i=1

[y2i − yixi − yi

σ2n

2∇ log [πn(xi)]− xiyi + x2

i + xiσ2n

2∇ log [πn(xi)]

−σ2n

2∇ log [πn(xi)] yi +

σ2n

2∇ log [πn(xi)]xi +

(σ2n

2∇ log [πn(xi)]

)2]

?=

n∑i=1

[x2i − xiyi − xi

σ2n

2∇ log [πn(yi)]− yixi + y2

i + yiσ2n

2∇ log [πn(yi)]

−σ2n

2∇ log [πn(yi)]xi +

σ2n

2∇ log [πn(yi)] yi +

(σ2n

2∇ log [πn(yi)]

)2]

n∑i=1

[−2y

σ2n

2∇ log [πn(xi)] + 2xi

σ2n

2∇ log [πn(xi)] +

(σ2n

2∇ log [πn(xi)]

)2]

?=

n∑i=1

[−2x

σ2n

2∇ log [πn(yi)] + 2yi

σ2n

2∇ log [πn(yi)] +

(σ2n

2∇ log [πn(yi)]

)2].

Now use the discrete approximation:σ2n

2 ∇ log [πn(xi)] ≈ yi− xi + σnZi:

n∑i=1

[−2y(yi − xi + σnZi) + 2xi(yi − xi + σnZi) + (yi − xi + σnZi)

2]

?=

n∑i=1

[−2x

σ2n


σ2n


(σ2n

2∇ log [πn(yi)]

)2]

n∑i=1

[−2y2

i + 2yixi − 2yiσnZi + 2xiyi − 2x2i + 2xiσnZi + y2

i

−yixi + yiσnZi − xiyi + x2i − xiσnZi + σnZiyi − σnZixiσ2

nZ2i

]?=

n∑i=1

[−2x

σ2n


σ2n


(σ2n

2∇ log [πn(yi)]

)2]

n∑i=1

[−y2

i + 2yixi − x2i + σ2

nZ2i

]?=

n∑i=1

[−2x

σ2n


σ2n


(σ2n

2∇ log [πn(yi)]

)2].

Langevin diffusion is reversible, soσ2n

2 ∇ log [πn(yi)] ≈ xi − yi + σnZi.


Inserting this into the RHS above:

n∑i=1

[−y2


nZ2i

] ?=

n∑i=1

[−2xxi − yi + σnZi + 2yixi − yi + σnZi + (xi − yi + σnZi)

2]

n∑i=1

[−y2


nZ2i

] ?=

n∑i=1

[−2x2

i + 2xiyi − 2xiσnZi + 2yixi − 2y2i + 2yiσnZi + x2

i

−xiyi + xiσnZi − yixi + y2i − yiσnZi + σnZixi = σnZiyi + σ2

nZ2i

]n∑i=1

[−y2


nZ2i

] ?=

n∑i=1

[−y2

i − x2i + xiyi + σ2

nZ2i

].

This last equality is cleary true in expectation since Zi is iid.

17. Consider a stochastic process, θ[t] (t ≥ 0), on the probability space

(Ω,F , P ). If: Ft ⊂ Ft+1, θ[t] is mearsureable on Fn with finite first

moment, and with probability one E[θ[t+1]|Ft] = θ[t], then this process

is called a Martingale (Billingsley 1995, 458). Prove that Martingales

do or do not have the Markovian property.

I Answer

Since E[θ[t+1]|Ft] = θ[t], then E[θ[t+1] − θ[t]|Ft] = 0, by the definition

of conditional expectation (see Norris [1997, p.130] for details). Define

∇t = E[θ[t+1] − θ[t], which indicates that:

θ[t+1] = ∇t + θ[t]

= ∇t +∇t−1 + θ[t−1]

= ∇t +∇t−1 +∇t−2 + θ[t−2]

=

t∑1

∇i.

Note that the ∇i are independent and integrable random variables with

mean zero and finite variance. Further note that for some fixed num-


ber of iterations θ[1], . . . , θ[τ ] and ∇1, . . . ,∇τ generate the same σ-

algebra:

σ(θ[1], . . . , θ[τ ]) = σ(∇1, . . . ,∇τ ) ⊂ Fτ .

So either both are Markovian or neither are Markovian. But since the

random variable defined by a sum of iid random variables cannot be

Markovian, the generation of θ[i] cannot be either.

19. Prove that a Markov chain that is positive Harris recurrent, ∃σ−finite

probability measure, P, on S so that for an irreducible Markov chain,

Xt, at time t, p(Xn ∈ A) = 1, ∀A ∈ S where P(A) > 0, and aperiodic

is also α-mixing,

α(t) = supA,B

∣∣∣∣p(θt ∈ B,θ0 ∈ A)− p(θt ∈ B)p(θ0 ∈ A)

∣∣∣∣ −→t→∞ 0

I Answer This proof follows that of Jones (2004).∗ The coupling

inequality states that if there are two ergodic (implied by the condi-

tions above) chains, one started in the stationarity distribution and one

started at some arbitrary point, then:

‖Pn(X, ·)− π(·)‖TV ≤ p(C > n),

where C is the coupling time of these two Markov chains defined in

Lindvall (1992).† See also Rosenthal (1995, p.560). Therefore, both of

the folloing hold:

|Pn(X,A)− π(A)| ≤ p(C > n)

|Pn(X,B)− π(B)| ≤ p(C > n),

for A and B defined above. Under the assumptions of the problem,

p(C > n) → 0 as n → ∞. By Chen (1999)‡ the integral of the

∗Jones, Galin. (2004). “On the Markov Chain Central Limit Theorem.” Probability Surveys

1, 299320.†Lindvall, T. (1992). Lectures on the Coupling Method. Wiley, New York.‡Chen, X. (1999). Limit Theorems for Functionals of Ergodic Markov Chains With General

State Space. Memoirs of the American Mathematical Society, 139.


posterior distance in A over the set B is also smaller than the coupling

time: ∫B

|Pn(X,A)− π(A)|π(dx) ≤ p(C > n).

With standard regularity conditions and Minkowski’s inequality the ab-

solute value function can be moved outside of the integral, allowing:

p(C > n) ≥∣∣∣∣∫B

[Pn(X,A)− π(A)]π(dx)

∣∣∣∣≥∣∣∣∣∫B

Pn(X,A)π(dx)−∫B

π(A)π(dx)

∣∣∣∣=∣∣Pn(Xn ∈ A,X0 ∈ B)− π(A)π(B)

∣∣ = α(t).

Therefore α(t) also goes to zero as n goes to infinity since the rate

of TVN convergence bounds the rate of α-mixing. For demonstrated

geometric and uniform ergodicity this rate can be shown to be much

faster still.

14

Utilitarian Markov Chain Monte Carlo

Exercises

1. For the hierarchical model of firearm-related deaths in Chapter 10, per-

form the following diagnostic tests from Section 12.2 in CODA or BOA

: Gelman and Rubin, Geweke, Raftery and Lewis, Heidelberger and

Welsh.

I Answer

< R.Code >

require(runjags)

x <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3,

13.9, 13.6, 13.9, 14.1, 14.9, 15.2,

14.8, 15.4, 14.8, 13.7, 12.8, 12.1)

n <- length(x)

d <- c(1, 1)

firearm.data <- dump.format(list(x=x, n=n, d=d))

firearm.init1 <- dump.format(list(tau=1, lambda=c(1,20),

K=c(.5,.5)))

firearm.init2 <- dump.format(list(tau=1, lambda=c(20,1),

K=c(.5,.5)))

firearm.model <- "model for (i in 1:n) x[i] ~ dnorm(mu[i], tau)

mu[i] <- lambda[k[i]]

k[i] ~ dcat(K[])

tau ~ dgamma(.01, .01)


167



K[1:2] ~ ddirch(d[])

"firearm <- run.jags(model=firearm.model, n.chain=2,

data=firearm.data, plot=TRUE,

inits=c(firearm.init1,firearm.init2),

burnin=5000, sample=10000,

monitor=c("tau","lambda","K"))

> geweke.diag(firearm$mcmc)

[[1]]

Fraction in 1st window = 0.1

Fraction in 2nd window = 0.5

tau lambda[1] lambda[2] K[1] K[2]

1.7860 -0.1373 -0.1122 -0.1828 0.1828

[[2]]

Fraction in 1st window = 0.1

Fraction in 2nd window = 0.5

tau lambda[1] lambda[2] K[1] K[2]

1.5434 -0.5641 2.3849 -1.5759 1.5759

> gelman.diag(firearm$mcmc)

Potential scale reduction factors:

Point est. 97.5% quantile

tau 1.00 1.01

lambda[1] 1.00 1.00

lambda[2] 1.00 1.01

K[1] 1.00 1.00

K[2] 1.00 1.00

Multivariate psrf

1.00

Utilitarian Markov Chain Monte Carlo 169

> raftery.diag(firearm$mcmc)

[[1]]

Quantile (q) = 0.025

Accuracy (r) = +/- 0.005

Probability (s) = 0.95

Burn-in Total Lower bound Dependence

(M) (N) (Nmin) factor (I)

tau 3 4338 3746 1.16

lambda[1] 6 10809 3746 2.89

lambda[2] 3 4129 3746 1.10

K[1] 16 18252 3746 4.87

K[2] 3 4129 3746 1.10

[[2]]

Quantile (q) = 0.025

Accuracy (r) = +/- 0.005

Probability (s) = 0.95

Burn-in Total Lower bound Dependence

(M) (N) (Nmin) factor (I)

tau 3 4163 3746 1.11

lambda[1] 9 13413 3746 3.58

lambda[2] 4 7346 3746 1.96

K[1] 12 12060 3746 3.22

K[2] 3 4061 3746 1.08

> heidel.diag(firearm$mcmc)

[[1]]

Stationarity start p-value

test iteration

tau passed 1 0.706

lambda[1] passed 1 0.631



K[1] passed 1 0.640

K[2] passed 1 0.640

Halfwidth Mean Halfwidth

test

tau passed 1.507 0.0523

lambda[1] passed 10.785 1.0125

lambda[2] failed 8.959 1.2892

K[1] failed 0.571 0.0841

K[2] failed 0.429 0.0841

[[2]]

Stationarity start p-value

test iteration

tau passed 1 0.252



K[1] passed 1 0.355

K[2] passed 1 0.355

Halfwidth Mean Halfwidth

test

tau passed 1.478 0.0499

lambda[1] passed 10.636 1.0311

lambda[2] failed 8.987 1.2375

K[1] failed 0.557 0.0782

K[2] failed 0.443 0.0782

3. Plot in the same graph a time sequence of 0 to 100 versus the annealing

cooling schedules: logarithmic, geometric, semi-quadratic, and linear.

What are the tradeoffs associated with each with regard to convergence

and processing time?

I Answer

Different cooling schedules are as follows:

• Geometric: Tt = εTt−1, 0 < ε < 1


• Logarithmic: Tt = KT1

log(t)

• Linear: Tt = T1 − at, a > 0

• Semiquadratic: Tt = T1 − t− at2, a > 0

And, as shown below in the figure, the semiquadratic form is the slowest

annealing schedule whereas the geometric form is the fastest annealing

schedule. When annealing is slower, the chain is more likely to converge

to its stationary distribution, but it requires longer processing time. So,

the convergence and the processing time has a trade-off relationship.

Time Sequence

Tem

erat

ure

1 25 50 75 100

010

020

030

040

050

0

logarithmic

geometric

semiquadratic

linear

FIGURE 14.1: Different Annealing Cooling Schedules

< R.Code >

Logarithmic <- function(t, T0) T <- rep(NA, t+1)

T[1] <- T0

for (i in 1:t) T[i+1] <- 0.8*T[1] / (log(i+1)) -


0.8*T[1] / log(t+1)

return(T)

Geometric <- function(t, T0)

T <- rep(NA, t+1)

T[1] <- T0

for (i in 1:t) T[i+1] <- 0.3 * T[i]

return(T)

Semiquadratic <- function(t, T0) T <- rep(NA, t+1)

T[1] <- T0

for (i in 1:t) T[i+1] <- T[1] - i - 0.04*i^2

return(T)

Linear <- function(t, T0) T <- rep(NA, t+1)

T[1] <- T0

for (i in 1:t) T[i+1] <- T[1] - 5*i

return(T)

ruler <- seq(1, 101)

l1 <- Logarithmic(100,500)

l2 <- Geometric(100,500)

l3 <- Semiquadratic(100,500)

l4 <- Linear(100,500)

plot(ruler, l1, type="n",

axes=F, xaxs="i", yaxs="i",

xlim=c(1,100), ylim=c(0,500),

xlab="Time Sequence", ylab="Temerature")

box(col="gray80")

axis(1, at=c(1,25,50,75,100),col="gray80")

axis(2,col="gray80")


lines(l1,lwd=2)

lines(l2)

lines(l3,lty=3, lwd=2)

lines(l4,lty=2,lwd=2)

text(30, 50, "logarithmic")

text(12, 10, "geometric")

text(60, 360, "semiquadratic")

text(45, 250, "linear")

5. Show how the full conditional distributions for Gibbs sampling in the

Bayesian Tobit model on page 466 can be changed into grouped or col-

lapsed Gibbs sampler forms (Section 12.3.2).

I Answer

Since the full conditional distribution of β (k × 1 vector) is “Multivari-

ate Normal” distribution (with k degree of freedom), we can sample

β1, β2, · · · , βk at the same time. Therefore, it is (by itself) a grouped

Gibbs sampler form.

7. BUGS includes a quick diagnostic command diag which implements a

version of Geweke’s (1992) time-series diagnostic for the first 25% of

the data and the last 50% of the data. Rather than calculate spec-

tral densities for the denominator of the G statistic, BUGS divides the

two periods into 25 equal-sized bins and then takes the variance of the

means from within these bins. Run the example in Section 12.2 using

the BUGS code in the Computational Addendum with 10,000 itera-

tions, and load the data into R with the command military.mcmc <-

boa.importBUGS("my.bugs.dir/bugs") . Write a function to calculate

the shortcut for Geweke’s diagnostic for each of the five chains (columns

here). Compare your answer with BUGS .

I Answer

As shown in the table below, Geweke’s diagnostic from BUGS tends to

make the statistics smaller in absolute term. It is especially so when a

chain doesn’t converge. However, the decision on whether the chain is


non-convergent (statistic 2) or not doesn’t seem to be affected by this

change.

From BUGS From math

αµ -5.14526874 -30.69368272

βµ -0.05462386 -0.05198337

ατ -4.25638592 -28.65358373

βτ -2.09015361 -2.37649673

τc -0.74985086 -0.82777295

Table 14.1: Geweke’s Diagnostics from Different Sources

< R.Code >

require(runjags)

require(BaM) # BaM pacakage contains data - military

y <- as.matrix(t(military[,-1]))

x <- as.vector(military[,1])

n <- nrow(y); k <- ncol(y)

military.data <- dump.format(list(x=x, y=y, n=n, k=k))

military.inits <- dump.format(list(alpha.mu=1, beta.mu=1,

alpha.tau=1,

beta.tau=1, tau.c=1))

military.model <- "model for (i in 1:n) for (j in 1:k) mu[i,j] <- alpha[i] + beta[i]*x[j]

y[i,j] ~ dnorm(mu[i,j], tau.c)

alpha[i] ~ dnorm(alpha.mu, alpha.tau)

beta[i] ~ dnorm(beta.mu, beta.tau)

alpha.mu ~ dnorm(1, 0.01)

beta.mu ~ dnorm(1, 0.01)

alpha.tau ~ dgamma(0.01, 0.01)

beta.tau ~ dgamma(0.01, 0.01)

tau.c ~ dgamma(0.01, 0.01)


"military.jags <- run.jags(model=military.model,

data=military.data,

inits=military.inits,

burnin=10000, sample=10000,

monitor=c("alpha.mu",

"beta.mu", "alpha.tau",

"beta.tau", "tau.c"),

n.chain=1, plot=TRUE)

almu <- military.jags$mcmc[[1]][,1]

bemu <- military.jags$mcmc[[1]][,2]

altau <- military.jags$mcmc[[1]][,3]

betau <- military.jags$mcmc[[1]][,4]

tau <- military.jags$mcmc[[1]][,5]

geweke.hand <- function(theta) n <- length(theta)

n1 <- n*0.25

n2 <- n*0.5+1

theta25 <- theta[1:n1]

theta50 <- theta[n2:n]

G <- abs(mean(theta25) - mean(theta50)) /

(sqrt(var(theta25)/n1 + var(theta50)/(n2-1)))

return(G)

geweke.bugs <- geweke.diag(military.jags$mcmc,

frac1=0.25, frac2=0.5)[[1]]$z

geweke.math <- c(geweke.hand(almu), geweke.hand(bemu),

geweke.hand(altau),geweke.hand(betau),

geweke.hand(tau))

geweke.compare <- cbind(geweke.bugs, geweke.math)

> print(geweke.compare)

geweke.bugs geweke.math

alpha.mu -5.14526874 -30.69368272

beta.mu -0.05462386 -0.05198337

alpha.tau -4.25638592 -28.65358373

beta.tau -2.09015361 -2.37649673

tau.c -0.74985086 -0.82777295

9. For the model of marriage rates in Italy given in Section 10.3, calculate


the Zellner and Min diagnostic with the segmentation: θ1 = α, θ2 = β,

for the posteriors defined at the iteration points: π1,000(α, β,λ,X) and

π10,000(α, β,λ,X). Test H0: η = 0 versus H1: η 6= 0 with the statistic

KOA assuming serial independence.

I Answer

First begin with some background on the Zellner and Min diagnos-

tic for two parameters (α and β) whose posterior is described by a

Markov chain. For this diagnostic the Markov chain typically comes

from a Gibbs sampler, so in addition to a statement of the joint poste-

rior, π(α, β|y) = cp(α, β)`(α, β|y), we require the full conditional dis-

tributions π(α|β,y), π(β|α,y). We obtain empirical estimates of the

marginals from MCMC sample vlaues (“smoothed empirical estimates”

in the paper) by:

pN (αi|y) =1

N

N∑j=1

p(αi|β(j),y)

pN (β|y) =1

N

N∑j=1

p(β|α(j),y),

which is smoothed over the period 1, . . . , N for parameters that are

conditioned upon and given for some marginal estimate at the point i

in the chain. Normally two values for i are selected and compared.

Zellner and Min (1995) give three statistics (ARC2, DC2, RC2) and a

set of formal tests. To perform the formal test of convergence hypothe-

ses, we need to first calculate the Difference of Convergence Criterion

statistic, which is based on:

η = p(α|y)p(β|α,y)− p(β|y)p(α|β,y) = 0.

Under the null hypothesis of convergence at the point i in the Markov

chain, we get an empirical statistic from this expression by using the

empirical estimates for the first part of each product and plug-in values

to the conditionals for the second part of each product:

ηi = pN (αi|y)p(βi|αi,y)− pN (βi|y)p(αi|βi,y),


which is approximately zero under H0 :η = 0. So the user selected two

parameters from the model (α and β here in the general sense), and some

point (or points) and obtains a statistic that is normally distributed for

large sample sizes. The Ratio Convergence Criterion simply converts

the difference above to a ratio and provides a posterior odds statistic.

The corresponding variance term for ηi is:

φ2i = [p(αi|βi,y)]2V ar[pN (βi|y)] + [p(βi|αi,y)]2V ar[pN (αi|y)]

− 2p(αi|βi,y)p(βi|αi,y)Cov[pN (αi|y), pN (βi|y)].

So we can use this to create:

zi =ηi − ηi

ˆphiiasym

∼N(0, 1).

Furthermore, the posterior odds for this null of convergence are given

in the paper’s appendix as:

KOA =√Nπ/2(1 + (zi

2/N)) exp[−z2i /2]

(there are different notations for this expression).

Question 12.5 asks for this setup to be used on the Italian marriage rates

example in Section 10.3 for iteration points i = 1, 000 and i = 10, 000.

The three relevant full conditional distributions are:

p(α|β, λ) = BAΓ(A)−1 exp[−αB]αA−1

p(β|α, λ) = DCΓ(C)−1 exp[−βD]αC−1

p(λ|α, β) = (1 + β)∑

yi+αΓ(∑yi + α)−1 exp[−λ(1 + β)]λ∑

yi+α−1.

This test is created for two parameters at a time. It does not make sense

to test α and β together since they are parent nodes in the model, but

we could test either or both against any of the 16 λ parameters. We can

also test two λ parameters together.

The most interesting test is a pair of tests for α and β, against a λ,

and the first one is chosen arbitrarily here. This sets-up the following


starting point from above:

p(α, λ1|β,y) = p(α|β,y)p(λ1|α, β, ]y)

= p(λ1|β,y)p(α|λ1, β, ]y)

p(β, λ1|α,y) = p(β|α,y)p(λ1|α, β, ]y)

= p(λ1|α,y)p(β|λ1, β, ]y).

This sets up two parralel tests as described above. The following BUGS

code creates the MCMC iterations.

< Bugs.Code >

model italy;

for (i in 1:N)

lambda[i] ~ dgamma(alpha,beta);

y[i] ~ dpois(lambda[i]);

alpha ~ dgamma(1,1);

beta ~ dgamma(1,1);

list(y=c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7), N=16)

list(alpha=1, beta=1)

list(alpha=10, beta=10)

Now use the following R code to create the tests.

< R.Code >

library(boa)

italy.mcmc <- boa.importBUGS(

"Users/jgill/Bugs/Italy.Model/italy")

y=c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7)

A <- B <- C <- D <- 1

# SMOOTHED VALUES

p.alpha.1000 <- B^A * gamma(A+1)^(-1) *

exp(-italy.mcmc[1000 ,17]*B) *


italy.mcmc[1000 ,17]^(A-1)

p.alpha.10000 <- B^A * gamma(A+1)^(-1) *

exp(-italy.mcmc[10000,17]*B) *

italy.mcmc[10000,17]^(A-1)

p.beta.1000 <- C^D * gamma(C+1)^(-1) *

exp(-italy.mcmc[1000 ,18]*D) *

italy.mcmc[1000 ,18]^(C-1)

p.beta.10000 <- C^D * gamma(C+1)^(-1) *

exp(-italy.mcmc[10000,18]*D) *

italy.mcmc[10000,18]^(C-1)

p.lambda.1000 <- mean ( (1+italy.mcmc[1:1000,18])^(

sum(y)+italy.mcmc[1:1000,17]) *

gamma(sum(y)+

italy.mcmc[1:1000,17])^(-1) *

exp(-italy.mcmc[1:1000,1]*

(1+italy.mcmc[1:1000,18])) *

italy.mcmc[1000,1]^(

sum(y)+italy.mcmc[1:1000,17]-1) )

p.lambda.10000 <- mean ( (1+italy.mcmc[1:10000,18])^(


gamma(sum(y)+

italy.mcmc[1:10000,17])^(-1) *


(1+italy.mcmc[1:10000,18])) *


sum(y)+italy.mcmc[1:10000,17]-1) )

v.lambda.1000 <- var ( (1+italy.mcmc[1:1000,18])^(


gamma(sum(y)+

italy.mcmc[1:1000,17])^(-1) *


(1+italy.mcmc[1:1000,18])) *


sum(y)+italy.mcmc[1:1000,17]-1) )

v.lambda.10000 <- var ( (1+italy.mcmc[1:10000,1])^(


gamma(sum(y)+

italy.mcmc[1:10000,17])^(-1) *


(1+italy.mcmc[1:10000,18])) *



sum(y)+italy.mcmc[1:10000,17]-1) )

# PLUG-IN VALUES

p.alpha.plug.1000 <- p.alpha.1000

p.alpha.plug.10000 <- p.alpha.10000

p.beta.plug.1000 <- p.beta.1000

p.beta.plug.10000 <- p.beta.10000

p.lambda.plug.1000 <- (1+italy.mcmc[1000,18])^(

sum(y)+italy.mcmc[1000,17]) *

gamma(sum(y)+

italy.mcmc[1000,17])^(-1) *

exp(-italy.mcmc[1000,17]*

(1+italy.mcmc[1000,18])) *


sum(y)+italy.mcmc[1000,17]-1)

p.lambda.plug.10000 <- (1+italy.mcmc[10000,18])^(

sum(y)+italy.mcmc[10000,17]) *

gamma(sum(y)+

italy.mcmc[10000,17])^(-1) *

exp(-italy.mcmc[10000,17]*

(1+italy.mcmc[10000,18])) *


sum(y)+italy.mcmc[10000,17]-1)

# TEST STATISTICS

eta.alpha.lambda.1000 <- p.alpha.1000*p.lambda.plug.1000 -

p.lambda.1000*p.alpha.plug.1000

eta.alpha.lambda.10000 <- p.alpha.10000*p.lambda.plug.10000 -

p.lambda.10000*p.alpha.plug.10000

eta.beta.lambda.1000 <- p.beta.1000*p.lambda.plug.1000 -

p.lambda.1000*p.beta.plug.1000

eta.beta.lambda.10000 <- p.beta.10000*p.lambda.plug.10000 -

p.lambda.10000*p.beta.plug.10000

phi.alpha.lambda.1000 <- p.alpha.plug.1000^2 *

v.lambda.1000

phi.alpha.lambda.10000 <- p.alpha.plug.1000^2 *

v.lambda.10000

phi.beta.lambda.1000 <- p.beta.plug.1000^2 *


v.lambda.1000

phi.beta.lambda.10000 <- p.beta.plug.10000^2 *

v.lambda.10000

( z.alpha.lambda.1000 <- eta.alpha.lambda.1000/

sqrt(phi.alpha.lambda.1000) )

[1] -0.034080

( z.alpha.lambda.10000 <- eta.alpha.lambda.10000/

sqrt(phi.alpha.lambda.10000) )

[1] -1.6812e-110

( z.beta.lambda.1000 <- eta.beta.lambda.1000/

sqrt(phi.beta.lambda.1000) )

[1] -0.034080

( z.beta.lambda.10000 <- eta.beta.lambda.10000/

sqrt(phi.beta.lambda.10000) )

[1] -1.7071e-110

# POSTERIOR ODDS FOR ETA=0 (CONVERGENCE)

( K.0A.alpha.lambda.1000 <- sqrt(pi*1000/2) *

(1+z.alpha.lambda.1000^2/1000) *

exp(-z.alpha.lambda.1000^2/2) )

[1] 39.61

( K.0A.alpha.lambda.10000 <- sqrt(pi*10000/2) *

(1+z.alpha.lambda.10000^2/10000) *

exp(-z.alpha.lambda.10000^2/2) )

[1] 125.33

( K.0A.beta.lambda.1000 <- sqrt(pi*1000/2) *

(1+z.beta.lambda.1000^2/1000) *

exp(-z.beta.lambda.1000^2/2) )

[1] 39.61

( K.0A.beta.lambda.10000 <- sqrt(pi*10000/2) *

(1+z.beta.lambda.10000^2/10000) *

exp(-z.beta.lambda.10000^2/2) )

[1] 125.33

# NOTE: SIMILARITY OCCURS BECAUSE THEY ARE BOTH PARENTAL

NODES TO LAMBDA

11. (Raftery and Lewis 1996). Write a Gibbs sampler in R to produce

a Markov chain to sample for θ1, θ2, and θ3, which are distributed


according to the trivariate normal distribution θ1

θ2

θ3

∼ N0

0

0

, 99 −7 −7

−7 1 0

−7 0 1

by cycling through the normal conditional distributions with variances:

V (θ1|θ2, θ3) = 1, V (θ2|θ1, θ3) = 1/50 and V (θ3|θ1, θ2) = 1/50. Contrast

the Geweke diagnostic with the Gelman and Rubin diagnostic for 1,000

iterations.

I Answer

If we have the joint distribution,

X =

[X1

X2

]∼ N

([µ1

µ2

],

[Σ11 Σ12

Σ21 Σ22

]),

then the conditional distribution is

X1|X2 ∼ N(µ1|2,Σ1|2

),

where µ1|2 = µ1 + Σ12Σ−122 (X2 − µ2) and Σ1|2 = Σ11 −Σ′12Σ

−122 Σ12.

Therefore, the full conditional distributions are:

θ1|θ2, θ3 ∼ N (−7θ2 − 7θ3, 1)

θ2|θ1, θ3 ∼ N(− 7

50θ1 −

49

50θ3,

1

50

)θ3|θ1, θ2 ∼ N

(− 7

50θ1 −

49

50θ2,

1

50

)Since we have all full conditional distribution, we use Gibbs sampler.

And, based on the posterior draws from the Gibbs sampler, we can man-

nually calculate the Geweke’s diagnostic and Gelman-Rubin’s diagnostic

(shown in the table below). It turns out that Gelman-Rubin diagnos-

tic doesn’t report any evidence for non-convergence whereas some of

Geweke’s diagnostic does report evidence for non-convergence (θ2 with

θ[0]2 = 1, and θ3 with θ

[0]3 = 0).

< R.Code >


θ1 θ2 θ3

Geweke(init=0) 0.65881323 0.01406866 2.611673

Geweke(init=1) 0.05630465 2.26367041 1.165261

Gelman-Rubin 1.00126375 0.99950074 1.001659

Table 14.2: Geweke versus Gelman-Rubin

post.draw <- function(bi, sim, inits) n <- bi + sim + 1

chain <- length(inits)

theta1 <- theta2 <- theta3 <- NULL

for (k in 1:chain) draw1 <- draw2 <- draw3 <- NULL

draw1[1] <- draw2[1] <- draw3[1] <- 0

for (i in 1:n) draw1.mean <- - 7*(draw2[i] + draw3[i])

draw1[i+1] <- rnorm(1, draw1.mean, 1)

draw2.mean <- - 0.07*(draw1[i+1] + 7*draw3[i])

draw2[i+1] <- rnorm(1, draw2.mean, 0.02)

draw3.mean <- - 0.07*(draw1[i+1] + 7*draw2[i+1])

draw3[i+1] <- rnorm(1, draw3.mean, 0.02)

theta1 <- cbind(theta1, draw1[(bi+2):n])



output <- list(theta1=theta1, theta2=theta2,

theta3=theta3)

theta1.post <- post.draw(500, 1000, c(0,1))$theta1



geweke.hand <- function(theta) n <- nrow(theta)

k <- ncol(theta)

G <- NULL

for (i in 1:k) n1 <- n*0.25

n2 <- n*0.5+1


theta25 <- theta[1:n1,i]

theta50 <- theta[n2:n,i]

geweke <- abs(mean(theta25) - mean(theta50)) /

(sqrt(var(theta25)/n1 + var(theta50)/(n2-1)))

G <- cbind(G, geweke)

return(G)

gelman.hand <- function(theta) n <- nrow(theta)

k <- ncol(theta)

inmean <- NULL

invariance <- NULL

bwvariance <- 0

for (i in 1:k) inmean[i] <- mean(theta[,i])

allmean <- mean(theta)

for (i in 1:k) invariance[i] <- 0

for (j in 1:n) stat1 <- (theta[j,i] - inmean[i])^2

invariance[i] <- invariance[i] + stat1

stat2 <- (inmean[i] - allmean)^2

bwvariance <- bwvariance + stat2

invar <- sum(invariance) / (k*(n-1))

bwvar <- n*bwvariance / (k-1)

estvar <- (1 - 1/n) * invar + (1/n) * bwvar

R <- sqrt(estvar / invar)

return(R)

diag.compare <- matrix(c(geweke.hand(theta1.post),

gelman.hand(theta1.post),

geweke.hand(theta2.post),

gelman.hand(theta2.post),

geweke.hand(theta3.post),

gelman.hand(theta3.post)),

ncol=3)

colnames(diag.compare) <- c("theta1", "theta2", "theta3")

rownames(diag.compare) <- c("Geweke(init=0)",


"Geweke(init=1)",

"Gelman-Rubin")

> print(diag.compare)

theta1 theta2 theta3

Geweke(init=0) 0.65881323 0.01406866 2.611673

Geweke(init=1) 0.05630465 2.26367041 1.165261

Gelman-Rubin 1.00126375 0.99950074 1.001659

13. Prove that data augmentation is a special case of Gibbs sampling.

I Answer

Data augmentation runs as follows: at the ith iteration,

• generate θ[i] from p[i−1] (θ|Xobs)

• generate m values of Xmis from p(Xmis|θ[i],Xobs

)• p[i] (θ|Xobs) = 1

m

∑mj=1 p

(θ|Xobs,X

[j]mis

)Substitute m = 1,Xmis = δ,Xobs = y.

Then, the algorithm becomes: at the ith iteration,

• generate θ[i] from p(θ|y, δ[i−1]

)• generate δ[i] from p

(δ|y, θ[i−1]

)This is a two-block (θ, δ) Gibbs sampler. (Q.E.D.)

15. For the following simple bivariate normal model[θ1

θ2

]∼ N

([0

0

],

[1 ρ

ρ 1

]),

the Gibbs sampler draws iteratively according to:

θ1|θ2 ∼ N (ρθ2, 1− ρ2)

θ2|θ1 ∼ N (ρθ1, 1− ρ2).

Write a simple implementation of the Gibbs sampler in R , calculate the

unconditional and conditional (Rao-Blackwellized) posterior standard


error, for both ρ = 0.05 and ρ = 0.95. Compare these four values and

comment.

I Answer

The unconditional posterior is from the Gibbs sampler:

at the (g + 1)th iteration,

• draw θ[g+1]1 from N

(ρθ

[g]2 , 1− ρ2

)• draw θ

[g+1]2 from N

(ρθ

[g+1]1 , 1− ρ2

)And, the Rao-Blackwellized posterior is from the Gibbs sampler:

at the (g + 1)th iteration,

• draw k values of θ1 from N(ρθ

[g]2 , 1− ρ2

)• θ[g+1]

1 = mean of k values of θ1

• draw k values of θ2 from N(ρθ

[g+1]1 , 1− ρ2

)• θ[g+1]

2 = mean of k values of θ2

As shown in the table below, standard errors of posterior distributions

becomes significantly smaller in the Rao-Blackwellized case.

θ1 θ2

ρ=0.05, Unconditional 0.899678592 0.896445864

ρ=0.05, Rao-Blackwellized 0.019853010 0.021344061

ρ=0.95, Unconditional 0.079108334 0.081591576

ρ=0.95, Rao-Blackwellized 0.001961144 0.001969048

Table 14.3: Unconditional vs. Rao-Blackwellized Posteriors

< R.Code >

gibbs.uc <- function(rho, bi, sim) n <- bi + sim + 1

theta1 <- theta2 <- NULL

theta1[1] <- theta2[1] <- 0


for (i in 1:n) theta.var <- 1 - rho^2

theta1.mean <- rho*theta2[i]

theta1[i+1] <- rnorm(1, theta1.mean, theta.var)

theta2.mean <- rho*theta1[i+1]

theta2[i+1] <- rnorm(1, theta2.mean, theta.var)

theta1.post <- theta1[(bi+2):n]


theta1.post.mean <- mean(theta1.post)


stat1 <- stat2 <- 0

for (k in 1:sim) stat1 <- stat1+(theta1.post[k]-theta1.post.mean)^2

stat2 <- stat2+(theta2.post[k]-theta2.post.mean)^2

theta1.post.sd <- stat1 / (sim - 1)


post.sd <- c(theta1.post.sd, theta2.post.sd)

return(post.sd)

gibbs.rb <- function(rho, m, bi, sim) n <- bi + sim + 1

theta1 <- theta2 <- NULL

theta1[1] <- theta2[1] <- 0

for (i in 1:n) theta.var <- 1 - rho^2

theta1.mean <- rho*theta2[i]

theta1[i+1] <- mean(rnorm(m, theta1.mean, theta.var))

theta2.mean <- rho*theta1[i+1]

theta2[i+1] <- mean(rnorm(m, theta2.mean, theta.var))





stat1 <- stat2 <- 0

for (k in 1:sim) stat1 <- stat1+(theta1.post[k]-theta1.post.mean)^2

stat2 <- stat2+(theta2.post[k]-theta2.post.mean)^2




post.sd <- c(theta1.post.sd, theta2.post.sd)

return(post.sd)

sd.compare <- rbind(gibbs.uc(0.05, 500, 1000),

gibbs.rb(0.05, 50, 500, 1000),

gibbs.uc(0.95, 500, 1000),

gibbs.rb(0.95, 50, 500, 1000))

colnames(sd.compare) <- c("theta1", "theta2")

rownames(sd.compare) <- c("rho=0.05, Unconditional",

"rho=0.05, Rao-Blackwellized",

"rho=0.95, Unconditional",

"rho=0.95, Rao-Blackwellized")

> print(sd.compare)

theta1 theta2

rho=0.05, Unconditional 0.899678592 0.896445864

rho=0.05, Rao-Blackwellized 0.019853010 0.021344061

rho=0.95, Unconditional 0.079108334 0.081591576

rho=0.95, Rao-Blackwellized 0.001961144 0.001969048

17. Produce the ARE result for the Rao-Blackwellization from (12.33) with

a two-variable Metropolis-Hastings algorithm instead of the Gibbs sam-

pler that was used.

I Answer

To demonstrate the Rao-Blackwell property for Markov chains, consider

the following code that gives a comparison of simulated students-t dis-

tribution values that are simulated from R’s regular function rt() and

a Rao-Blackwellized version based on Dickey’s decomposition (1968),

which is an old idea but conditions on a sufficient statistic using mix-

tures of normals and a chi-square variate.

< R.Code >

nu <- 5


x <- y <- (rgamma(2000,nu/2,nu/2))^(-1)

for (i in 1:length(y)) x[i] <- rnorm(1,0,mean(y))

delta.m.star <- delta.m <- x <- rt(10000,5)

y <- 1/(rgamma(length(x),nu/2,nu/2))

for (j in 1:length(x)) delta.m[j] <- sum(exp(-x[1:j]^2))/j

for (j in 1:length(x)) delta.m.star[j] <- sum(1/sqrt(2*y[1:j]+1))/j

par(mfrow=c(1,1),col="grey80")

plot(1:length(delta.m), delta.m, type="l",

col="darkblue", ylim=c(0.4,0.6))

lines(1:length(delta.m.star), delta.m.star,

col="firebrick")

This code is self-contained and the interested reader should simply cut-

and-paste it into their R window. Notice that the red line, which is the

Rao-Blackwellized simulated estimate, converges much more quickly to

the empirical mean than the blue line.

The outline of this solution is as follows. If we can show that the

Metropolis-Hastings algorithm constitutes an interleaved Markov chain,

then we can use Theorem 9.19 in Robert and Casella to show the desired

ARE property.

Consider a M-H algorithm where the candidate φ at time t is generated

from a conditional distribution according to p(φ|θ[t−1]). That is, θ is our

value of interest from the Markov chain and the candidate is generated

conditionally on the current position (a random walk Metropolis kernel).

Define the the transition function for φ according to:

φ[t+1] =

φ[t] with probability 1− pt+1

θ[t+1] with probability pt+1

where:

pt+1 =π(φ[t+1])/p(φ[t+1]|θ[t])

π(θ[t])/p(θ[t]|φ[t+1])∧ 1.


This is an ergodic Markov chain provided that the support of π and p

are identical. In addition, the Metropolis step indicated by “∧” implies

a second ancillary stochastic process of uniforms. This algorithm meets

the interleaving process since at each step θ and φ are independent of

their previous values conditionally on the current value of the other,

and θ is iid in stationarity (asymptotically guaranteed by the ergodic

property).

Now define the standard generic Monte Carlo estimator from the runs of

the Metropolis-Hastings algorithm for both Standard and Rao-Blackwellized

forms according to:

τS =1

n

n∑t=1

h(θ[t])

τRB =1

n

n∑t=1

h(θ[t]|φ[t])

From Lemma 9.17 and Proposition 9.18 in Robert and Casella (2004,

p.355), we know that for consecutive draws of these interleaved random

variables:

Cov[h(θ[t]), h(θ[t+1])

]= V ar(E[h(θ)|φ]),

and this is positive and decreasing in t to t+ 1, t+ 2, . . . , t+ n. Define

δt+j as the indicating a comparison between the tth iteration of θ in the

Markov chain and the t+ jth of θ iteration to be used in a summation

such that j = 1 means a comparison with the next value and j = n the

last value. Now define the variances of the two estimators with their

summed pairwise covariance calculations:

V ar(τS) =1

n(n− 1)

n∑δt+j ,j=1

Cov(h(θ[t], h(θ[t+j]))

V ar(τRB) =1

n(n− 1)

n∑δt+j ,j=1

Cov(h(θ[t]|φ[t]), h(θ[t+j]|φ[t]))

from the definitions above. Using the Lemma 9.17 and Proposition 9.18


results iteratively, these become nested expectations:

V ar(τS) =1

n(n− 1)

n∑δt+j ,j=1

V ar(E[· · ·E[E[h(θ)|φ] · · · ])

V ar(τRB) =1

n(n− 1)

n∑δt+j ,j=1

V ar(E[· · ·E[E[h(θ)|φ]θ] · · · ])

Since the second variance has an identical number of summed terms all

of which are identical or smaller by the Rao-Blackwell theorem, then

this total for V ar(τRB) can never be larger than this total for V ar(τS).

19. Modify the Metropolis-Hastings R code on Section 9.4.6 to produce the

marginal likelihood with Chib’s method.

I Answer

The marginal likelihood is defined as:

p(x) =L(x|θ′)π(θ′)

π(θ′|x),

where θ′ is an arbitrary point in the sample space (say θ′ = (1, 2)).

For all practical cases, the likelihood function (L(x|θ)) and the prior

distribution (π(θ)) are known to us. We can just plug in a specific value

there (θ = θ′). So, the main task below in R is to show how to produce

π(θ′|x).

We know that the posterior distribution is Bivariate Normal and the

proposal density function is Cauchy (following the example in 9.4.6).

And, π(θ′|x) can be produced by the following formula:

π(θ′|x) =

∫α(θ′, θ)q(θ′|θ)π(θ|x)dθ∫

α(θ, θ′)q(θ|θ′)dθ

=1G

∑Gg=1 α(θ′, θ[g])q(θ′|θ[g])

1G

∑Gm=1 α(θ[m], θ′)

,

where θ[g] are drawn from π(θ|x) and θ[m] are drawn from q(θ|θ′).

< R.Code >


require(BaM)

bi <- 1000; sim <- 5000; n <- bi + sim + 1

start1 <- start2 <- 0

mu <- c(1,2)

s <- matrix(c(1,-.9,-.9,1), ncol=2)

alpha <- function(from1, from2, to1, to2) output <- min(1, dmultinorm(to1, to2, mu, s) /

dmultinorm(from1, from2, mu, s))

return(output)

qvalue <- function(th1, th2) qsum <- 0

for (i in 1:1000) qind <- dmultinorm(th1,th2,mu,s)/sqrt(rchisq(1,5)/5)

qsum <- qsum + qind

return(qsum/1000)

qmean <- qvalue(mu[1],mu[2])

theta1 <- theta2 <- Theta1 <- Theta2 <- NULL

theta1[1] <- start1; theta2[1] <- start2

nom <- denom <- NULL

for (i in 1:(n-1)) cand <- mvrnorm(1, mu, s)/(sqrt(rchisq(2,5)/5))

a <- alpha(theta1[i], theta2[i], cand[1], cand[2])

if (a > runif(1)) theta1[i+1] <- cand[1]

theta2[i+1] <- cand[2]

else

theta1[i+1] <- theta1[i]

theta2[i+1] <- theta2[i]

Theta1[i+1] <- cand[1]

Theta2[i+1] <- cand[2]

nom[i+1] <- alpha(theta1[i+1], theta2[i+1],

mu[1], mu[2]) * qmean

denom[i+1] <- alpha(mu[1], mu[2], Theta1[i+1],

Theta2[i+1])


post.value <- sum(nom[(bi+2):n])/sum(denom[(bi+2):n])

post.sim <- matrix(c(mean(theta1[(bi+2):n]),

sd(theta1[(bi+2):n]),

mean(theta2[(bi+2):n]),

sd(theta2[(bi+2):n])), ncol=2)

colnames(post.sim) <- c("theta1", "theta2")

rownames(post.sim) <- c("mean", "sd")

post.summary <- list(posterior=post.sim,

posterior.value=post.value)

> print(post.summary)

$posterior

theta1 theta2

mean 1.0410134 1.9487204

sd 0.6934439 0.7035153

$posterior.value

[1] 1.511966

15

Markov Chain Monte Carlo Extensions

Exercises

To be provided.

195

A

Generalized Linear Model Review

Exercises

1. Suppose X1, . . . , Xn are iid exponential: f(x|θ) = θe−θx, θ > 0. Find

the maximum likelihood estimate of θ by constructing the joint distri-

bution, express the log likelihood function, take the first derivative with

respect to θ, set this function equal to zero, and solve for θ the maximum

likelihood value.

I Answer

Since Xi ∼iid f(x|θ), the joint distribution is:

f(X|θ) =

n∏i=1

θe−θxi ,

which gives us the log likelihood function as:

` (θ|X) =

n∑i=1

(log θ − θxi)

= n log θ − nxθ.

Hence, set the first derivative with respect to θ equal to zero:

∂`

∂θ=n

θ− nx ≡ 0

Therefore, the maximum likelihood value is:

θ =1

x

197


3. Below are two sets of data each with least square regression lines cal-

culated (y = 6 + 0x). Answer the following questions by looking at the

plots.

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

• •

•

•

•

•

••

•

•

•

•

•

•

•

••

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

•

•

•

•

••

•

•

•

••

•

•

•

••

•

•

•

•

•

• •

•

•

•

•

••

••

•

•

•

•

•

•

••

•

•

•

•

•

•

•

••••

•••

•••• •

•••

•

•

•

•

••

•

••

•••

•

••

••

•

•

•••

•

•••••

•

•

•

•

•

•

•••

•

•

••

•

••

•

•

••

•

•

•

•

• •••

•

•••

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

•

••

•

(a) Does the construction of the least squares line in panel 1 violate

any of the Gauss-Markov assumptions?

(b) Does the construction of the least squares line in panel 2 violate

any of the Gauss-Markov assumptions?

(c) Does the identified point (identically located in both panels) have

a substantively different interpretation?

I Answer

(a) No.

(b) Yes. The construction of the least squares line in panel 2 appears to

violate the assumption where the error terms must be independent

of one another. That is, the residuals in panel 2 appear to be

correlated with one another.

(c) Yes. The point in panel 2 could be an outlier, while it looks normal

in panel 1.

5. Consider the bivariate normal PDF:

f(x1, x2) =(

2πσ1σ2

√1− ρ2

)−1

×

exp

[− 1

2(1− ρ2)

((x1 − µ1)2

σ21

− 2ρ(x1 − µ1)(x2 − µ2)

σ1σ2+

(x2 − µ2)2

σ22

)].

Generalized Linear Model Review 199

for −∞ < µ1, µ2 <∞, σ1, σ2 > 0, and ρ ∈ [−1 : 1].

For µ1 = 3, µ2 = 2, σ1 = 0.5, σ2 = 1.5, ρ = 0.75, calculate a grid search

using R for the mode of this bivariate distribution on R2. A grid search

bins the parameter space into equal space intervals on each axis and

then systematically evaluates each resulting subspace. First set up a

two dimensional coordinate system stored in a matrix covering 99% of

the support of this bivariate density, then do a systematic analysis of

the density to show the mode without using “for” loops. Hint: see the R

help menu for the outer function. Use the contour function to make a

figure depicting bivariate contours lines at 0.05, 0.1, 0.2, and 0.3 levels.

I Answer

< R.Code >

dmultnorm <- function(x, y, mu1, mu2, sigma.mat) rho <- sigma.mat[1,2]/prod(sqrt(diag(sigma.mat)))

nlizer <- 1/(2*pi*prod(sqrt(diag(sigma.mat)))*

sqrt(1-rho^2))

e.term1 <- (x - mu1)/sqrt(sigma.mat[1,1])

e.term2 <- (y - mu2)/sqrt(sigma.mat[2,2])

like <- exp(- 1/(2*(1-rho^2)) *

(e.term1^2 + e.term2^2 -

2*rho*e.term1*e.term2) )

return(nlizer*like)

par(mfrow=c(1,2), mar=c(12, 5, 12, 0.25))

x.ruler1 <- seq(1.8, 4.2, length=30)

y.ruler1 <- seq(-1.2, 5, length=30)

xy.cov <- matrix(c(0.5^2,0.75*0.5*1.5,0.75*0.5*1.5,

1.5^2),2,2)

xy.grid1 <- outer(x.ruler1, y.ruler1, dmultnorm,

3, 2, xy.cov)

contours1 <- c(0.05, 0.1, 0.2, 0.3)

contour(x.ruler1, y.ruler1, xy.grid1, levels=contours1,

xlab="X",ylab="Y", cex=2)

x.ruler2 <- seq(2.8, 3.2, length=30)

y.ruler2 <- seq(1.5, 2.5, length=30)

xy.grid2 <- outer(x.ruler2, y.ruler2, dmultnorm,


X

Y

0.05

0.1

0.2

0.3

2.0 2.5 3.0 3.5 4.0

−1

01

23

45

X

Y

0.31

0.315

0.32

2.8 2.9 3.0 3.1 3.2

1.6

1.8

2.0

2.2

2.4

FIGURE A.1: Contour Plot of Bivariate Normal Distribution

3, 2, xy.cov)

contours2 <- c(0.31, 0.315, 0.320, 0.3205)

contour(x.ruler2, y.ruler2, xy.grid2,

levels=contours2,

xlab="X",ylab="Y", cex=2)

7. Derive the exponential family form and b(θ) for the Inverse Gaussian

distribution: (2 points.)

f(x|µ, λ) =

(λ

2πx3

)1/2

exp

[− λ

2µ2x(x− µ)2

], x > 0, µ > 0.


Assume λ = 1.

I Answer

f(x|µ, λ) = exp

[−1

2log(2πx3)− 1

2µ2x(x− µ)2

]= exp

[−1

2log(2πx3)− 1

2µ−2x+ µ−1 − 1

2x−1

]

= exp

−1

2µ−2x︸︷︷︸θy

+ µ−1︸︷︷︸−b(θ)

−1

2log(2πx3)− 1

2x−1︸︷︷︸

c(y)

Since θ = − 1

2µ−2, we have µ = (−2θ)−

12 , which also gives us:

b(θ) = − 1

µ= −√−2θ, θ < 0.

9. Show that the Weibull distribution, f(y|γ, β) = γβ y

γ−1 exp(−yγ/β), is

an exponential family form when γ = 2, labeling each part of the final

answer.

I Answer

f(y|γ = 2, β) =2

βy exp(−y2/β)

= exp[log(2)− log(β) + log(y)− y2/β

]= exp

y2(−1/β)︸︷︷︸zθ

− log(β)︸︷︷︸b(θ)

+ log(y) + log(2)︸︷︷︸c(y)

Since θ = − 1

β , we have β = − 1θ , which also gives us:

b(θ) = log(−1

θ), θ < 0 (i.e. β > 0)

11. For normally distributed data a robust estimate of the standard devia-

tion can be calculated by (Devore 1995):

σ =1

0.6745n

n∑i=1

∣∣Xi − X∣∣ .


Histogram of FARM

FARM

Fre

quen

cy

0 20000 40000 60000 80000

02

46

810

1214

FIGURE A.2: Finding Influential Outliers

Write an R function to calculate σ for the FARM variable in Exercise 15

below. Compare this value to the standard deviation. Is there evidence

of influential outliers in this variable?

I Answer

The robust estimate of the standard deviation is 12269.88, which is

smaller than the original standard deviation (13251.51). So, we can say

that there are outliers. From Figure A.2, we can also confirm that we

have one outlier.

< R.Code >

sum <- 0

for (i in 1:length(FARM)) dis <- abs(FARM[i] - mean(FARM))

sum <- sum + dis


robustsd <- (1/(0.6745*length(FARM)))*sum

originalsd <- sd(FARM)

result <- cbind(originalsd, robustsd)

colnames(result) <- c("original s.d.", "robust s.d.")

hist(FARM, breaks=25) # this shows there is one outlier

> print(result)

original s.d. robust s.d.

[1,] 13251.51 12269.88

13. Derive the generalized hat matrix from β produced the final (nth) IWLS

step. Specifically, give the quantities of interest from the nth and (n −1)th step, show how the hat matrix is produced. Secondly, give the form

of Cook’s D appropriate to a GLM and show that it contains hat matrix

values.

I Answer

(a) Derive the generalized hat matrix.

At the nth (last) step of IWLS, we have:

βn = (X′wn−1X)−1X′wn−1zn−1,

where zn−1 and wn−1 are defined as:

zn−1 = ηn−1 +

(∂η

∂µ

∣∣∣∣µn−1

)(y − µn−1),

w−1n−1 =

(∂η

∂µ

∣∣∣∣µn−1

)2(v(µ)

∣∣∣∣µn−1

), and

v(µ) =∂2b(θ)

∂θ2

Then, we have:

X∗ = w1/2n−1X,

and

H∗ =(w

1/2n−1X

)(Xw

1/2n−1X

)1/2 (w

1/2n−1X

)′.


(b) Cook’s distance for a GLM

Consider the Cook’s distance for a linear model:

Di =(β(i) − β)(X′X)(β(i) − β)

p MSE,

where p is the number of parameters in the model, and MSE is the

mean square error of the regression model.

Then, it can be modified by inserting the IWLS weighting:

Di(GLM) =(β(i) − β)(X′wn−1X)(β(i) − β)

pφ,

where φ is the standard GLM dispersion parameter.

And, the relationship to the hat matrix is found in inserting the

equivalence:

β(i) − β =−(X′X)−1Xir

′i

p(1− hii), r′i =

yi − µi√1− hii

,

where hii is the i-th diagonal element of the hat matrix, H∗.

15. The likelihood function for dichotomous choice regression is given by:

L(b,y) =

n∏i=1

[F (xib)]yi [1− F (xib)]

1−yi .

There are several common choices for the F () function:

Logit: Λ(xib) = 11+exp[xib]

Probit: Φ(xib) =∫ xib

−∞1√2π

exp[−t2/2]dt

Cloglog: CLL(xib) = 1− exp (−exp(xib)).


State FDR PRE.DEP POST.DEP FARM

Alabama 1 323 162 4067

Arizona 1 600 321 6100

Arkansas 1 310 157 8134

California 1 991 580 83371

Colorado 1 634 354 10167

Connecticut 0 1024 620 10167

Delaware 0 1032 590 3050

District of Columbia 1 1269 1054 0

Florida 1 518 319 14234

Georgia 1 347 200 10167

Idaho 1 507 274 7117

Illinois 1 948 486 27451

Indiana 1 607 310 11184

Iowa 1 581 297 28468

Kansas 1 532 266 22368

Kentucky 1 393 211 8134

Louisiana 1 414 241 10167

Maine 0 601 377 6100

Maryland 1 768 512 14234

Massachusetts 1 906 613 14234

Michigan 1 790 394 14234

Minnesota 1 599 363 25418

Mississippi 1 286 127 4067

Missouri 1 621 365 15251

Montana 1 592 339 10167

Nebraska 1 596 307 17284

Nevada 1 868 550 4067

New Hampshire 0 686 427 3050

New Jersey 1 918 587 18301

New Mexico 1 410 208 5084

New York 1 1152 1676 38635

North Carolina 1 332 187 8134

North Dakota 1 382 176 14234

Ohio 1 771 400 18301

Oklahoma 1 455 216 9150

Oregon 1 668 379 11184

Pennsylvania 0 772 449 25418

Rhode Island 1 874 575 2033

South Carolina 1 271 159 7117

South Dakota 1 426 189 8134

Tennessee 1 378 198 6100

Texas 1 479 266 33552

Utah 1 551 305 4067

Vermont 0 634 365 5084

Virginia 1 434 284 15251

Washington 1 741 402 14234

West Virginia 1 460 257 6100

Wisconsin 1 673 362 21351

Wyoming 1 675 374 5084

Using the depression era economic and electoral data, calculate a di-

chotomous regression model for using each of these link functions ac-

cording to the following model specification:

FDR ∼ f [(POST.DEP − PRE.DEP ) + FARM ]

where: FDR indicates whether or not Roosevelt carried that state in the

1932 presidential elections, PRE.DEP is the mean per-state income be-

fore the onset of the Great Depression (1929) in dollars, POST.DEP is

the mean per-state income after the onset of the great depression (1932)

in dollars, and FARM is the total farm wage and salary disbursements

in thousands of dollars per state in 1932.


Do you find substantially different results with the three link functions?

Explain. Which one would you use to report results? Why?

I Answer

Table A.2 shows the results from three different models. In general, all

three models lead to the same substantive conclusion (i.e. Z-values from

different models do not differ a lot). While AIC of the logit model is

the lowest, all AIC’s are very similar. So, we cannot tell which model

is better. Nevertheless, if we have to report only one result, it would be

better to report the logit model because it has the lowest AIC.

Table A.1: Three Different Binomial Models

Estimate Std. Error z value Pr(>|z|)

LOGIT

(intercept) 4.9261 1.9459 2.5315 0.0114

(POST.DEP - PRE.DEP) 0.0152 0.0070 2.1728 0.0298

FARM 0.0001 0.0001 1.5229 0.1278

AIC 34.7525

PROBIT

(intercept) 2.8516 1.0065 2.8332 0.0046

(POST.DEP - PRE.DEP) 0.0085 0.0037 2.3094 0.0209

FARM 0.0001 0.0000 1.4493 0.1473

AIC 34.8669

CLOGLOG

(intercept) 2.1827 0.8285 2.6346 0.0084

(POST.DEP - PRE.DEP) 0.0071 0.0033 2.1643 0.0304

FARM 0.0000 0.0000 1.2725 0.2032

AIC 35.1390

< R.Code >

FDR <- c(1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,

0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,

1,1,0,1,1,1,1,1,1,0,1,1,1,1,1)

PREDEP <- c(323, 600, 310, 991, 634, 1024, 1032, 1269,


518, 347, 507, 948, 607, 581, 532, 393,

414, 601, 768, 906, 790, 599, 286, 621,

592, 596, 868, 686, 918, 410, 1152, 332,

382, 771, 455, 668, 772, 874, 271, 426,

378, 479, 551, 634, 434, 741, 460, 673, 675)

POSTDEP <- c(162, 321, 157, 580, 354, 620, 590, 1054,

319, 200, 274, 486, 310, 297, 266, 211,

241, 377, 512, 613, 394, 363, 127, 365,

339, 307, 550, 427, 587, 208, 1676, 187,

176, 400, 216, 379, 449, 575, 159, 189,

198, 266, 305, 365, 284, 402, 257, 362, 374)

FARM <- c(4067, 6100, 8134, 83371, 10167, 10167, 3050,

0, 14234, 10167, 7117, 27451, 11184, 28468,

22368, 8134, 10167, 6100, 14234, 14234,

14234, 25418, 4067, 15251, 10167, 17284,

4067, 3050, 18301, 5084, 38635, 8134, 14234,

18301, 9150, 11184, 25418, 2033, 7117, 8134,

6100, 33552, 4067, 5084, 15251, 14234, 6100,

21351, 5084)

DEP <- POSTDEP - PREDEP

logit <- glm(FDR ~ I(POSTDEP - PREDEP) + FARM,

family=binomial(link=logit))

probit <- glm(FDR ~ I(POSTDEP - PREDEP) + FARM,

family=binomial(link=probit))

cloglog <- glm(FDR ~ I(POSTDEP - PREDEP) + FARM,

family=binomial(link=cloglog))

summary(logit); summary(probit); summary(cloglog)

result <- rbind(summary(logit)$coef,

summary(probit)$coef,

summary(cloglog)$coef)

aic <- c(AIC(logit), AIC(probit), AIC(cloglog))

17. Tobit regression (censored regression) deals with an interval-measured

outcome variable that is censored such that all values that would have

naturally been observed as negative are reported as zero, generalizeable

to other values (Tobin 1958, Amemiya (1985, Chapter 10), Chib (1992).

There can be left censoring and right censoring at any arbitrary value,

single and double censoring, mixed truncating and censoring. If z is a


latent outcome variable in this context with the assumed relation:

z = xβ + η and zi ∼ N (xβ, σ2),

then for left censoring at zero, the observed outcome variable is produced

according to:

yi =

zi if zi > 0

0 if zi ≤ 0.

The resulting likelihood function is:

L(β, σ2|y,x) =∏yi=0

[1−Φ

(xiβ

σ

)] ∏yi>0

(σ−1) exp

[− 1

2σ2(yi − xiβ)2

],

where σ2 is called the scale. Explain the structure of the likelihood

function: what parts accommodate observed values and what parts ac-

commodate unobserved values? Replicate Tobin’s original analysis in

R using the survreg function in the survival package. The data are

obtainable by:

tobin <- read.table(

"http://artsci.wustl.edu/~jgill/data/tobin.dat",header=TRUE)

or in the BaM package.

I Answer < R.Code >

tobin.tob <- survreg(Surv(durable,durable>0,type=’left’) ~ age + quant, dist=’gaussian’,data=tobin)

summary(tobin.tob)

Value Std. Error z p

(Intercept) 15.1449 16.0795 0.942 3.46e-01

age -0.1291 0.2186 -0.590 5.55e-01

quant -0.0455 0.0583 -0.782 4.34e-01

Log(scale) 1.7179 0.3103 5.536 3.10e-08

Scale= 5.57

Gaussian distribution

Loglik(model)= -28.9 Loglik(intercept only)= -29.5


Chisq= 1.1 on 2 degrees of freedom, p= 0.58

Number of Newton-Raphson Iterations: 3

n= 20

19. Sometimes when estimation is problematic, Restricted Maximum Like-

lihood (REML) estimation is helpful (Bartlett 1937). REML uses a

likelihood function calculated from a transformed set of data so that

some parameters have no effect on the estimation on the others. For a

One-Way ANOVA, y|G ∼ N (Xβ + ZG, σ2I), G ∼ N (0, σ2D), such

that V ar[y] = V ar[ZG] + V ar[ε] = σ2ZDZ′ + σ2I. The uncondi-

tional distribution is now: y ∼ N (Xβ, σ2(I + ZDZ′) where D needs

to be estimated. If V = I + ZDZ′, then the likelihood for the data is:

`(β, σ,D|y) = −n2 log(2π) − 12 log |σ2V | − 1

2σ2 (y −Xβ)′V−1(y −Xβ).

The REML steps for this model are: (1) find a linear transformation k

such that k′X = 0, so k′y ∼ N (0, k′σk), (2) run MLE on this model

to get D, which no longer has any fixed effects, (3) then estimate the

fixed effects with ML in the normal way. Code this procedure in R and

run the function on the depression era economic and electoral data in

Exercise ??.

bayesian methods: a social and behavioral sciences...

Documents