bayesian methods: a social and behavioral sciences...
TRANSCRIPT
Hong Min Park & Jeff Gill
Bayesian Methods: ASocial and BehavioralSciences Approach,ANSWER KEY
SECOND EDITION
September 2010
CRC PRESS
Boca Raton Ann Arbor London Tokyo
Contents
vii
List of Tables
ix
List of Figures
xi
1
Background and Introduction
Exercises
1. Restate the three general steps of Bayesian inference from page 5 in your
own words.
I Answer
• Use your prior knowledge of the system or set of parameters you
are interested in and incorporate this into the modeling process.
This should be done by using probability distributions and may be
very specific or fairly vague or uncertain.
• Use a parametric model (i.e. assume its form) to describe what you
think your data looks like. Then combine what your data says and
what you thought the system/parameters behaves like to uncover
more about the system/parameters you are investigating.
• Investigate how good your model is and to what extent the results
were dependent on your initial ideas you might have had about the
system or set of parameters you are interested in.
3. Rewrite Bayes’ Law when the two events are independent. How do you
interpret this?
I Answer
If A and B are independent then the joint probability of A and B is
p(A,B) = p(A)p(B). Again according to the conditional probability,
1
2Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
p(A|B) =p(A,B)
p(B)= p(A)
and
p(B|A) =p(A,B)
p(A)= p(B)
Hence Bayes’ Law is trivial here if A and B are independent because
p(A|B) =p(B|A)p(A)
p(B)
= p(B)p(A)
p(B)
= p(A).
Incorporating B into the model will not help or obtain any further in-
formation about A.
5. Suppose f(θ|X) is the posterior distribution of θ given the data X.
Describe the shape of this distribution when the mode, argmaxθ
f(θ|X),
is equal to the mean,∫θθf(θ|X)dθ.
I Answer
The distribution f(θ|X) is symmetric.
7. Using R run the Gibbs sampling function given on page 32. What effect
do you seen in varying the B parameter? What is the effect of producing
200 sampled values instead of 5000?
I Answer
From Figure 1.1 below, we know that, when we change the B parameter,
the horizontal axis of the graphs (both in x and y graphs) changes and
the distributions become more skewed and flat to the right in B = 10
than in B = 5 and in B = 1. When we change the m value (the number
of sampling) from 200 to 5000, the density graphs become smoother.
< R.Code >
Background and Introduction 3
< Marginal Distribution of x >
B=1, m=5000
Marginal Distribution of x
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2
B=5, m=5000
Marginal Distribution of x
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
B=10, m=5000
Marginal Distribution of x
Den
sity
0 2 4 6 8 10
0.0
0.2
0.4
B=1, m=200
Marginal Distribution of x
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2
B=5, m=200
Marginal Distribution of x
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
B=10, m=200
Marginal Distribution of x
Den
sity
0 2 4 6 8 10
0.0
0.2
0.4
< Marginal Distribution of y >
B=1, m=5000
Marginal Distribution of y
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
1.2
B=5, m=5000
Marginal Distribution of y
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
B=10, m=5000
Marginal Distribution of y
Den
sity
0 2 4 6 8 10
0.0
0.2
0.4
B=1, m=200
Marginal Distribution of y
Den
sity
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
B=5, m=200
Marginal Distribution of y
Den
sity
0 1 2 3 4 5
0.0
0.2
0.4
0.6
0.8
B=10, m=200
Marginal Distribution of y
Den
sity
0 2 4 6 8 10
0.0
0.2
0.4
FIGURE 1.1: Plots of marginal distributions for different values of B and m.
4Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
# Define the Gibbs sampling algorithm
gibbs <- function(B, m) B <- B; k <- 15; m <- m; x <- NULL; y <- NULL
while (length(x) < m) x.val <- c(runif(1,0,B),rep((B+1),length=k))
y.val <- c(runif(1,0,B),rep((B+1),length=k))
for (j in 2:(k+1)) while(x.val[j] > B) x.val[j] <- rexp(1,y.val[j-1])
while(y.val[j] > B) y.val[j] <- rexp(1,x.val[j])
x <- c(x,x.val[(k+1)])
y <- c(y,y.val[(k+1)])
return(list("x"=x, "y"=y))
# Try different values
gibbs1 <- gibbs(B=1, m=5000)
gibbs2 <- gibbs(B=5, m=5000)
gibbs3 <- gibbs(B=10, m=5000)
gibbs4 <- gibbs(B=1, m=200)
gibbs5 <- gibbs(B=5, m=200)
gibbs6 <- gibbs(B=10, m=200)
# Draw graphs
par(mfrow=c(2,3), mgp=c(1.75,.25,0), mar=c(3,3,3,0), tcl=-0.2)
hist(gibbs1[["x"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=1, m=5000",
xlab="Marginal Distribution of x")
lines(density(gibbs1[["x"]]))
hist(gibbs2[["x"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=5, m=5000",
xlab="Marginal Distribution of x")
lines(density(gibbs2[["x"]]))
hist(gibbs3[["x"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=10, m=5000",
xlab="Marginal Distribution of x")
lines(density(gibbs3[["x"]]))
hist(gibbs4[["x"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=1, m=200",
Background and Introduction 5
xlab="Marginal Distribution of x")
lines(density(gibbs4[["x"]]))
hist(gibbs5[["x"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=5, m=200",
xlab="Marginal Distribution of x")
lines(density(gibbs5[["x"]]))
hist(gibbs6[["x"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=10, m=200",
xlab="Marginal Distribution of x")
lines(density(gibbs6[["x"]]))
hist(gibbs1[["y"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=1, m=5000"
xlab="Marginal Distribution of y")
lines(density(gibbs1[["y"]]))
hist(gibbs2[["y"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=5, m=5000",
xlab="Marginal Distribution of y")
lines(density(gibbs2[["y"]]))
hist(gibbs3[["y"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=10, m=5000",
xlab="Marginal Distribution of y")
lines(density(gibbs3[["y"]]))
hist(gibbs4[["y"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=1, m=200",
xlab="Marginal Distribution of y")
lines(density(gibbs4[["y"]]))
hist(gibbs5[["y"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=5, m=200",
xlab="Marginal Distribution of y")
lines(density(gibbs5[["y"]]))
hist(gibbs6[["y"]], freq=FALSE, border="white",
col="gray", ylab="Density", main="B=10, m=200",
xlab="Marginal Distribution of y")
lines(density(gibbs6[["y"]]))
9. Rerun the Metropolis algorithm on section 1.8.2 in R but replacing the
uniform generation of candidate values in cand.gen with a normal trun-
cated to fit in the appropriate range. What differences do you observe?
6Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Mar. dist. of x when using unif. for cand.gen
x
Density
0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
Mar. dist. of y when using unif. for cand.gen
y
Density
0 2 4 6 8
0.00
0.05
0.10
0.15
Mar. dist. of x when using tnorm. for cand.gen
x
Density
0 2 4 6 8
0.0
0.1
0.2
0.3
0.4
Mar. dist. of y when using tnorm. for cand.gen
y
Density
0 2 4 6 8
0.00
0.05
0.10
0.15
0.20
FIGURE 1.2: Plots of the marginal distributions of x and y for different
candidate generating distributions.
I Answer
From Figure 1.2 below, we know that a normal truncated distribution
does not make a big difference.
< R.Code >
# Setup
require(msm) # for the truncated normal distribution
m <- 5000; x <- 0.5; y <- 0.5 # initial values
L1 <- 0.5; L2 <- 0.1; L <- 0.01; B <- 8 # initial values
# Define the bivariate exponential distribution
biv.exp <- function (x, y, L1, L2, L) exp ( -(L1+L)*x - (L2+L)*y - L*max(c(x,y)))
Background and Introduction 7
# Define the M-H algorithm
mh <- function (A) if (A == 0) cand.gen <- function (max.x, max.y) c (runif (1, 0, max.x), runif (1, 0, max.y))
else cand.gen <- function (max.x, max.y) c (rtnorm (1, 0, 4, lower=0, upper=max.x),
rtnorm (1, 0, 10, lower=0,upper=max.y))
for (i in 1:m) cand.val <- cand.gen (B, B)
a <- biv.exp (cand.val[1], cand.val[2], L1, L2, L) /
biv.exp (x[i], y[i], L1, L2, L)
if (a > runif(1)) x <- c (x, cand.val[1])
y <- c (y, cand.val[2])
else x <- c (x, x[i])
y <- c (y, y[i])
return (list ("x" = x, "y" = y))
# Case with uniform cand.gen distribution is mh1
# Case with truncated normal cand.gen distribution is mh2
mh1 <- mh (0)
mh2 <- mh (1)
par(mfrow = c (2, 2))
hist (mh1[["x"]], freq=FALSE, xlim=c(0,8), xlab="x",
main="Mar. dist. of x when using unif. for cand.gen")
hist (mh1[["y"]], freq=FALSE, xlim=c(0,8), xlab="y",
main="Mar. dist. of y when using unif. for cand.gen")
hist (mh2[["x"]], freq=FALSE, xlim=c(0,8), xlab="x",
8Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
main="Mar. dist. of x when using tnorm. for cand.gen")
hist (mh2[["y"]], freq=FALSE, xlim=c(0,8), xlab="y",
main="Mar. dist. of y when using tnorm. for cand.gen")
11. If p(D|θ) = 0.5, and p(D) = 1, calculate the value of p(θ|D) for priors
p(θ), [0.001, 0.01, 0.1, 0.9].
I Answer
According to Bayes’ Law:
p(θ|D) =p(θ)p(D|θ)p(D)
Therefore, when priors p(θ) equals [0.001, 0.01, 0.1, 0.9], the value of
p(θ|D) are [0.0005, 0.005, 0.05, 0.45], respectively.
13. Sometimes Bayesian results are given as posterior odds ratios, which for
two possible alternative hypotheses is expressed as:
odds(θ1, θ2) =p(θ1|D)
p(θ2|D).
If the prior probabilities for θ1 and θ2 are identical, how can this be
re-expressed using Bayes’ Law?
I Answer
odds(θ1, θ2) =
p(θ1)p(D|θ1)p(D)
p(θ2)p(D|θ2)p(D)
=p(D|θ1)
p(D|θ2)
15. Suppose we had data, D, distributed p(D|θ) = θe−θD as in Section 1.3,
but now p(θ) = 1/θ, for θ ∈ (0:∞). Calculate the posterior mean.
I Answer
∫ ∞0
1
θ× θe−θDdθ =
∫ ∞0
e−θDdθ =
[−e−θD
D
]∞0
=1
D
Background and Introduction 9
17. Since the posterior distribution is a compromise between prior informa-
tion and the information provided by the new data, then it is interesting
to compare relative strengths. Perform an experiment where you flip a
coin 10 times, recording the data as zeros and ones. Produce the pos-
terior expected value (mean) for two priors on p (the probability of a
heads): a uniform distribution between zero and one, and a beta distri-
bution (Appendix B) with parameters [10, 1]. Which prior appears to
influence the posterior mean more than the other?
I Answer
Suppose that the data you got from an experiment is [1, 1, 0, 1, 0, 0,
1, 1, 0, 1]. Consider the beta-binomial model (refer to section 2.3.3 for
the complete discussion). Then, we have:
Xi ∼ BR(p) → y =
10∑i=1
Xi ∼ BN (10, 6)
Now, we have two options for the prior specification of p, which produces
the Table 1.1. Therefore, the beta distribution (BE(10, 1)) influences the
posterior mean more than the uniform distribution (U[0, 1]) does.
Table 1.1: Posteriors from Different Priors
prior of p prior mean posterior of p posterior mean
U[0, 1] = BE(1, 1) 0.5 BE(7, 5) 7/12 = 0.58
BE(10, 1) 10/11 = 0.91 BE(16, 5) 16/21 = 0.76
19. The Stopping Rule Principle states that if a sequence of experiments
(trials) is governed by a stopping rule, η, that dictates cessation, then
inference about the unknown parameters of interest must depend on
η only through the collected sample. Obvious stopping rules include
10Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
setting the number of trials in advance (i.e. reaching n is the stopping
rule), and stopping once a certain number of successes are achieved.
Consider the following experiment and stopping rule. Standard normal
distributed data are collected until the absolute value of the mean of the
data exceeds 1.96/√n. Explain why this fails as a non-Bayesian stopping
rule for testing the hypothesis that µ = 0 (the underlying population is
zero), but is perfectly acceptable for Bayesian inference.
I Answer
In a non-Bayesain setting, we are not allowed to draw parameter values
from its distribution for the purpose of hypothesis testing. We are to
replicate the sampling distribution several times and to check how often
those replications produce an interval (usually an interval, although not
neccesarry to be an interval with a particular fixed width) that contains
the true value of the parameter.
The stopping rule here involves an assumption in the distributional na-
ture of the parameter value, which is not allowed in a non-Bayesian
setting as explained. However, the Bayesian setup is to assume that
the parameter of our interest has a probability distribution, from which
we can draw values. Therefore, the stopping rule works under Bayesian
inference.
2
Specifying Bayesian Models
Exercises
1. Suppose that 25 out of 30 firms develop new marketing plans during
the next year. Using the beta-binomial model from Section 2.3.3, apply
a BE(0.5, 0.5) (Jeffreys prior) and then specify normal prior centered
at zero and truncated to fit on [0:1] as prior distributions and plot the
respective posterior densities. What differences do you observe?
I Answer
We know that there are n = 30 firms, y = 25 develop new marketing
plans. Suppose θ is the unknown parameter for the probability of an
firm to develop a new market plan.
Applying a Jefferys prior BE(A = 0.5, B = 0.5), we get:
p(θ|y) ∝ p(θ)p(y|θ)
∝ BE(θ|0.5, 0.5)BN (25|30, θ)
∝ BE(θ|25 + 0.5, 30− 25 + 0.5)
Applying a normal prior N (0, 1), we get:
p(θ|y) ∝ p(θ)p(y|θ)
= N (θ|0, 1)BN (25|30, θ)
=1√2π
exp
(−θ2
2
)(30
25
)θ25(1− θ)5
∝ exp
(−θ2
2
)θ25(1− θ)5
11
12Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
0.0 0.2 0.4 0.6 0.8 1.0
Posterior Density of p((θθ|y)) w/ Different Priors
w/ Jeffreys Prior Beta(0.5,0.5)w/ Normal Prior N(0,1)
FIGURE 2.1: The Posterior Density of p(θ|y) with Different Priors
The posterior density of p(θ|y) using Jeffreys prior BE(0.5, 0.5) and using
the normal prior N (0,1) differ in their posterior mean. Applying the
normal prior drags the data mean slightly leftward than applying the
Jeffreys prior.
< R.Code >
# Setup
n <- 30; y <- 25; A <- 0.5; B <- 0.5
par(mar=c(1,1,3,1),mgp=c(1.5,0.25,0),
tcl=-0.2, cex.axis=0.8, cex.main=1)
# Jeffrey’s prior Beta(0.5,0.5)
curve(dbeta(x, y+A, n-y+B),
from=0, to=1, n=1000, # n=1000 to smooth the curv
ylim=c(0,6.2),
xaxs="i", yaxs="i", yaxt="n", bty="n",
ylab="", xlab="",
main=expression("Posterior Density of "
*p(theta*"|"*y)*" w/ Different Priors"))
# Normal prior N(0,1)
# Intentionally used the log posterior first, and then make it
# normalized just for the computational purpose
unnormalized.post <- function(x) exp(-x^2/2*1 + 25*log(x) + 5*log((1-x)))
Specifying Bayesian Models 13
k <- integrate(unnormalized.post, 0, 1)$value
posterior <- function(x) unnormalized.post(x)/k
# insert k back in to normalized posterior
curve(posterior(x), from=0, to=1, n=1000, lty=2, add=T)
legend(locator(1), lty=c(1,2), lwd=c(1,1),
legend=c("w/ Jeffreys Prior Beta(0.5,0.5)",
"w/ Normal Prior N(0,1)"), cex=0.8, box.lty=0)
3. Prove that the gamma distribution,
f(µ|α, β) =1
Γ(α)βαµα−1e−βµ, µ, α, β > 0,
is the conjugate prior distribution for µ in a Poisson likelihood function,
f(y|µ) =
(n∏i=1
yi!
)−1
exp
[log(µ)
n∑i=1
yi
]exp[−nµ],
that is, calculate a form for the posterior distribution of µ and show
that it is also gamma distributed.
I Answer
The posterior distribution of µ is:
p(µ|y) ∝ p(µ|α, β)p(y|µ)
=1
Γ(α)βα︸ ︷︷ ︸
constant
µα−1 exp(−βµ)
×
(n∏i=1
yi!
)−1
︸ ︷︷ ︸constant
exp
[log(µ)
n∑i=1
yi
]exp[−nµ]
∝ µα−1 exp(−βµ) exp
[log(µ)
n∑i=1
yi
]exp[−nµ]
= µα−1µny exp(−(β + n)µ)
= µ(α+ny)−1e−(β+n)µ
So, µ|y ∝ G(α+ ny, β + n). (Q.E.D.)
14Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
5. Use the gamma-Poisson conjugate specification developed in the last
question to analyze the following count data on worker strikes in Ar-
gentina over the period 1984 to 1993, from McGuire (1996). Assign
your best guess as to reasonable values for the two parameters of the
gamma distribution: α and β. Produce the posterior distribution for µ
and describe it with quantiles and graphs using empirically simulated
values according to the following procedure:
Economic Sector Number of Strikes
Public Administrators 496
Teachers 421
Metalworkers 199
Municipal Workers 186
Private Hospital Workers 181
Bank Employees 133
Court Clerks 128
Bus Drivers 113
Construction Workers 92
Doctors 83
Nationalized Industries 77
Railway Workers 76
Maritime Workers 57
Meat Packers 56
Paper Industry Workers 55
Sugar Industry Workers 50
Public Services 47
University Staff Employees 43
Telephone Workers 39
Textile Workers 37
State Petroleum Workers 32
Food Industry Workers 28
Post Office Workers 26
Locomotive Drivers 25
Light and Power Workers 21
Total 2701
• The posterior distribution for µ is gamma(δ1, δ2) according to some
parameters δ1 and δ2 that you derived above which of course de-
pends on your choice of the gamma parameters.
• Generate a large number of values from this distribution in R, say
10,000 or so, using the command:
posterior.sample <- rgamma(10000,d1,d2)
Specifying Bayesian Models 15
• Produce posterior quantiles, such as the interquartile range, ac-
cording to:
iqr.posterior <- c(sort(posterior.sample)[2500],
sort(posterior.sample)[7500])
Note: the IQR function in R gives a single value for the difference,
which is not as useful.
• Graph the posterior in different ways, such as with a smoother
like lowess (a local-neighborhood smoother, see Cleveland [1979,
1981]):
post.hist <- hist(posterior.sample,plot=F,breaks=100)
plot(lowess(post.hist$mids,post.hist$intensities),
type="l")
I Answer
Since the Gamma distribution has α/β as its mean and α/β2 as its
variance, I subjectively assign a gamma prior G(100, 1). And we know
there are n = 25 sectors. The number of the total strikes is nx =
2701. Thus the posterior distribution for µ is G(100 + 2701, 1 + 25) =
G(2801, 26). And, the posterior quantile is [106.38, 109.06]:
84 86 88 90 92 94 96
0.00
0.05
0.10
0.15
Posterior Distribution of µµ
µµ
Den
sity
FIGURE 2.2: Posterior distribution of µ.
16Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
< R.Code >
A <- 100; B <- 1; nx <- 2701; n <- 25
d1 <- A + nx; d2 <- B + n
post.sample <- rgamma(10000,d1,d2)
iqr.post <- c(sort(post.sample)[2500],
sort(post.sample)[7500])
round(iqr.post,2)
post.hist <- hist(post.sample, plot=F, breaks=100)
plot(lowess(post.hist$mids, post.hist$intensities),
type="l", xlab=expression(mu), ylab="Density",
main=expression("Posterior Distribution of"*~~mu))
7. In his original essay (1763, 376) Bayes offers the following question:
Given the number of times in which an unknown event has
happened and failed: Required the chance that the probability
of its happening in a single trial lies somewhere between any
two degrees of probability that can be named.
Provide an analytical expression for this quantity using an appropriate
uniform prior (Bayes argued reluctantly for the use of the uniform as a
“no information” prior: Bayes postulate).
I Answer
θ is defined for the unknown event of our interest, and D is defined
for the set of observations in which the event has happened and failed.
Then, the question can be rephrased as: ”Given that we know p(D|θ),what is p(θ|D)?”
Bayesian way of answering this question involves the specification of
prior guess on θ: p(θ). Since the probability of an event happening or not
is bounded between zero and one, and since we have no prior knowledge
on p(θ), the general solution is the uniform distribution: p(θ) = U[0, 1].
Therefore, we can describe some details on p(θ|D) from:
p(θ|D) ∝ p(θ)p(D|θ).
Specifying Bayesian Models 17
9. Suppose we have two urns containing marbles; the first contains 6 red
marbles and 4 green marbles, and the second contains 9 red marbles and
1 green marble. Now we take one marble from the first urn (without
looking at it) and put it in the second urn. Subsequently, we take one
marble from the second urn (again without looking at it) and put it in
the first urn. Give the full probabilistic statement of the probability of
now drawing a red marble from the first urn, and calculate its value.
I Answer
Define the events as follows:
R1: draw a red marble from the first urn
G1: draw a green marble from the first urn
R2: draw a red marble from the second urn
G2: draw a green marble from the second urn
Rf : draw a red marble at the final step
Then, we have:
Pr(Rf) = Pr(R1)× [Pr(R2|R1)× Pr(Rf |R1, R2)
+ Pr(G2|R1)× Pr(Rf |R1, G2)]
+ Pr(G1)× [Pr(R2|G1)× Pr(Rf |G1, R2)
+ Pr(G2|R1)× Pr(Rf |G1, G2)]
=6
10×[
10
11× 6
10+
1
11× 5
10
]+
4
10×[
9
11× 7
10+
2
11× 4
10
]=
674
1100= 0.6127
11. This is the famous envelope problem. You and another contestant are
each given one sealed envelope containing some quantity of money with
equal probability of receiving either envelope. All you know at the mo-
ment is that one envelope contains twice the cash as the other. So if you
open your envelope and observe $10, then the other envelope contains
18Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
either $5 or $20 with equal probability. You are now given the opportu-
nity to trade with the other contestant. Should you? The expected value
of the unseen envelope is E[other] = 0.5(5) + 0.5(20) = 12.50, meaning
that you have a higher expected value by trading. Interestingly, so does
the other player for analogous reasons. Now suppose you are offered the
opportunity to trade again before you open the newly traded envelope.
Should you? What is the expected value of doing so? Explain how this
game leads to infinite cycling. There is a Bayesian solution. Define M
as the known maximum value in either envelope, stipulate a probability
distribution, and identify a suitable prior.
I Answer
Even after deciding to switch envelope, the new expected value calcu-
lation for switching again is exactly same as before: If I had a value of
V, the expected gain from switching would be always 14V, regardless of
how many times I’ve already switched the envelope. This leads to the
infinite cycling. However, this is NOT true if I consider the conditional
probabilities.
Some notations:
V: the value that I saw before switching envelope
M: the known maximum value in either envelope
Pr(V): the probability that I saw V
Pr(M): the probability that I got M
Then, the question becomes what is Pr(M|V)? Several things to note:
Pr(V) is equal to one, by definition. Whichever envelope I choose, I am
guaranteed to see the value, which implies that my having M and seeing
V are independent with each other. Since I am not informed about M,
I have an uninformed prior on M: Pr(M) = Pr(∼M) = 12 .
Applying the Bayes rule, we have:
Pr(M|V) =Pr(V|M)Pr(M)
Pr(V)=
Pr(V|M)Pr(M)
1
Specifying Bayesian Models 19
=
Pr(V, M)Pr(M) × Pr(M)
1= Pr(V, M)
= Pr(V)Pr(M) = 1× Pr(M) =1
2
Pr(∼M|V) = 1− Pr(M|V) =1
2
Expected values of switching and not switching are:
E(switching) = Pr(M|V)× M
2+ Pr(∼M|V)×M
=1
2× M
2+
1
2×M =
3
4M
E(not switching) = Pr(M|V)×M + Pr(∼M|V)× M
2
=1
2×M +
1
2× M
2=
3
4M
Therefore, switching the envelope does NOT change my fortune. No
more paradox. No more infinite cycling.
13. If the posterior distribution of θ is N (1, 3), then calculate a 99% HPD
region for θ.
I Answer
[-2.98, 5.01]
< R.Code >
post <- rnorm(10000, mean=1, sd=sqrt(3))
hpd.post <- c(sort(post)[100], sort(post)[9900])
round(hpd.post, 2)
15. Assume that the data [1,1,1,1,1,1,0,0,0,1,1,1,0,1,0,0,1,1,1,1] are produced
from iid Bernoulli trials. Produce a 1−α confidence set for the unknown
value of p using a uniform prior distribution.
I Answer
Since n = 20 and y = 14, we have:
20Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Xi ∼ BR(p) → y =
20∑i=1
Xi ∼ BN (20, 14)
Applying an uniform prior, U[0, 1] = BE(1, 1), we get the posterior as:
p|y ∼ BE(14 + 1, 20− 14 + 1) = BE(15, 7)
Based on the α values, we then have different HPDs:
α HPD
1 % (0.43, 0.88)
5 % (0.52, 0.83)
10 % (0.55, 0.80)
Table 2.1: HPD w/ Different Values
< R.Code >
post.beta <- rbeta(10000, 15, 7)
hpd1 <- c(sort(post.beta)[100], sort(post.beta)[9900])
hpd5 <- c(sort(post.beta)[500], sort(post.beta)[9500])
hpd10 <- c(sort(post.beta)[1000], sort(post.beta)[9000])
round(hpd1, 2); round(hpd5, 2); round(hpd10, 2)
17. The beta distribution, f(x|α, β) = Γ(α+β)Γ(α)Γ(β)x
α−1(1 − x)β−1, 0 < x <
1, α > 0, β > 0, is often used to model the probability parameter in a
binomial setup. If you were very unsure about the prior distribution of
p, what values would you assign to α and β to make it relatively “flat”?
I Answer
BE(1, 1), which is idential to U[0, 1].
19. Laplace (1774, 28) derives Bayes’ Law for uniform priors. His claim is
Specifying Bayesian Models 21
. . . je me propose de determiner la probabilite des causes par
les evenements matiere neuve a bien de egards et qui merit
d’autant plus d’etre cultivee que c’est principalement sous ce
point de vue que la science des hasards peut etre utile la vie
civile.
He starts with two events: E1 and E2 and n causes: A1, A2, . . . , An. The
assumptions are: (1) Ei are conditionally independent given Ai, and
(2) Ai are equally probable. Derive the Laplace’s inverse probability
relation:
p(Ai|E) =p(E|Ai)∑j p(E|Aj)
.
I Answer
Since Ai are equally probable, ∀i, Pr(Ai) = a (constant). Therefore, we
have:
(LHS) = p(Ai|E) =p(E|Ai)p(Ai)
p(E)
=p(E|Ai)p(Ai)∑nj=1 p(E|Aj)p(Aj)
=a× p(E|Ai)
a×∑nj=1 p(E|Aj)
=p(E|Ai)∑nj=1 p(E|Aj)
= (RHS) (Q.E.D.)
3
The Normal and Student’s-t Models
Exercises
1. The most important case of a two-parameter exponential family is when
the second parameter is a scale parameter. Designate ψ as such a scale
parameter, then the exponential family form expression of a PDF or
PMF is rewritten:
f(y|θ) = exp
[yθ − b(θ)a(ψ)
+ c(y, ψ)
].
Rewrite the normal PDF in exponential family form.
I Answer
f (y|θ) =(2πσ2
)− 12 exp
[− 1
2σ2(y − µ)
2
]= exp
[y2 − 2yµ+ µ2
2σ2+
(−1
2
)ln(2πσ2
)]
= exp
yµ− 12µ
2
σ2− y2
2σ2− 1
2ln(2πσ2
)︸ ︷︷ ︸
c(y,ψ)
= exp
[yθ − b (θ)
a (ψ)− c (y, ψ)
],
where θ = µ, b(θ) = 12µ
2 = 12θ
2, and a(ψ) = σ2.
3. Suppose the random variable X is distributed N (µ, σ2). Prove that the
random variable Y = (X − µ)/σ is N (0, 1): standard normal.
I Answer
23
24Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Since X is distributed N (µ, σ2), we have:
p(X|µ, σ2) =1√2πσ
exp
(− 1
2σ2(X − µ)2
)
And, we know Y = (X − µ)/σ ⇒ X = µ+ σY , which gives:
|J| =∣∣∣∣∂X∂Y
∣∣∣∣ = σ
Therefore, we get:
p(Y |µ, σ2, X) = |J| × p(X|µ, σ2)
= σ × 1√2πσ
exp
(− 1
2σ2(µ+ σY − µ)2
)=
1√2π
exp
(−1
2Y 2
)= N (Y |0, 1) (Q.E.D.)
5. Missing Data. Suppose we have an iid sample of collected data: X1, X2, . . . , Xk, Yk+1, . . . , Yn ∼N (µ, 1), where the Yi values represent data that has gone missing. Spec-
ify a joint posterior for µ and the missing data with a N (0, 1/n) prior.
I Answer
p(µ,y|x) ∝ p(x|µ,y)p(y|µ)p(µ)
= N (x|µ, 1)N (y|µ, 1)N(µ
∣∣∣∣0, 1
n
)=
[k∏i=1
(2π)−12 exp
(−1
2(xi − µ)2
)]
×
[n∏
i=k+1
(2π)−12 exp
(−1
2(yi − µ)2
)]
×
[(2π
n
)− 12
exp
(−1
2nµ2
)]
The Normal and Student’s-t Models 25
∝ exp
(−1
2
k∑i=1
(xi − µ)2
)exp
(−1
2
n∑i=k+1
(yi − µ)2
)exp
(−1
2nµ2
)
= exp
[−1
2
(k∑i=1
x2i − 2µ
k∑i=1
xi + kµ2 +
n∑k+1
(yi − µ)2 + nµ2
)]
= exp
−1
2
k∑i=1
x2i − 2µ(kx) + kµ2 + (n− k)×Var(y)︸ ︷︷ ︸
=1
+nµ2
= exp
[−1
2
((k + n)µ2 − 2kxµ+
k∑i=1
x2i + n− k
)]
Although not neccessary here, we can also consider a marginal posterior
distribution of µ as,
p(µ|x) ∝ exp
[−1
2
((k + n)µ2 − 2kxµ
)]∝ exp
[−1
2
(11
k+n
)(µ− kx
k + n
)2]
∼ N(
kx
k + n,
1
k + n
)
7. Suppose X ∼ N (µ, µ), µ > 0. Give an expression for the conjugate
prior for µ.
I Answer
In an usual Bayesian normal model, the conjugate prior for the mean
parameter is the normal distribution and the conjugate prior for the
variance parameter is the inverse gamma distribution. So, I try both
priors in this case ad find that the inverse gamma, not normal, is the
conjugate prior for µ.
Suppose µ ∼ IG(α, β), where µ, α, β > 0. Then, the posterior becomes:
p(µ|X) ∝ p(µ)p(X|µ)
=βα
Γ(α)µ−α−1 exp
[−βµ
](1√2πµ
)nexp
[− 1
2µ
n∑i=1
(Xi − µ)2
]
26Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
∝ µ−α−n2−1 exp
[−βµ− 1
2µ× nVar(X)
]= µ−(α+n
2 )−1 exp
[− 1
µ
(β +
n
2Var(X)
)]= µ−(α+n
2 )−1 exp
[− 1
µ
(β +
n
2
(1
n
n∑i=1
X2i − X2
))]
= µ−(α+n2 )−1 exp
[− 1
µ
(β +
1
2
(n∑i=1
X2i − nX2
))]
∼ IG
(α+
n
2, β +
1
2
(n∑i=1
X2i − nX2
)),
which means that my choice of inverse gamma distribution as a prior
works for the conjugate prior for µ.
9. Returning to the example where the normal mean is known and the
posterior distribution for the variance parameter is developed. Plot a
figure that illustrates how the likelihood increases in relative importance
to the prior by performing the following steps:
(a) Plot the IG(5, 5) density over its support.
(b) Specify a posterior distribution for σ2 using the IG(5, 5) prior.
(c) Generate four random vectors of size: 10, 100, 200, and 500 dis-
tributed standard normal. Using these create four different pos-
terior distributions of σ2, and add the new density curves to the
existing plot.
(d) Label all distributions, axes, and legends.
Hint: modify the R code in the Computational Addendum to pro-
vide a posterior for the variance parameter instead of the mean param-
eter.
I Answer
< R.Code >
# specify the prior and the posteriors
prior <- rgamma(10000, shape=5, rate=5)
The Normal and Student’s-t Models 27
ig <- function (x) n <- length (x)
xhat <- sum (t(x)%*%x)
a1 <- 5 + n/2
b1 <- 5 + xhat/2
y <- rgamma (10000, shape=a1, rate=b1)
return(list("y"=y, "a1"=a1, "b1"=b1))
x1 <- rnorm(10); x2 <- rnorm(100)
x3 <- rnorm(200); x4 <- rnorm(500)
post1 <- ig (x1); post2 <- ig (x2)
post3 <- ig (x3); post4 <- ig (x4)
# plot prior first, and then posteriors
hist.prior <- hist(prior, xlim=c(0,2), plot=TRUE,
freq=FALSE, breaks=100)
FIGURE 3.1: Relative Importance of Likelihood to Prior
0.0 0.5 1.0 1.5 2.0
01
23
45
67
value
Pro
babi
lity
dens
ity
priorn=10
n=100
n=200
n=500
28Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
hist.post1 <- hist(post1[["y"]], xlim=c(0,2), plot=TRUE,
freq=FALSE, breaks=100)
hist.post2 <- hist(post2[["y"]], xlim=c(0,2), plot=TRUE,
freq=FALSE, breaks=100)
hist.post3 <- hist(post3[["y"]], xlim=c(0,2), plot=TRUE,
freq=FALSE, breaks=100)
hist.post4 <- hist(post4[["y"]], xlim=c(0,2), plot=TRUE,
freq=FALSE, breaks=100)
plot(c(0, 2), c(0, 7), type="n", main="",
ylab="Probability density", xlab="value", )
lines(hist.prior$density ~ hist.prior$mids, col=1)
lines(hist.post1$density ~ hist.post1$mids, col=2)
lines(hist.post2$density ~ hist.post2$mids, col=3)
lines(hist.post3$density ~ hist.post3$mids, col=4)
lines(hist.post4$density ~ hist.post4$mids, col=5)
text (0.6, 0.5, "prior", col=1)
text (0.5, 1.5, "n=10", col=2)
text (1.25, 2, "n=100", col=3)
text (1.25, 3.5, "n=200", col=4)
text (1.15, 5.5, "n=500", col=5)
11. Rejection Method. Like the normal, the Cauchy distribution is a uni-
modal, symmetric density of the location-scale family: C(X|θ, σ) =1πσ
1
1+( x−θσ )2 , where −∞ < X, θ < ∞, 0 < σ < ∞. Unlike the nor-
mal, the Cauchy distribution has very heavy tails, heavy enough so that
the Cauchy distribution has no moments and is therefore less easy to
work with. Given a C(0, 3) distribution, find the probability of observing
a point between 3 and 7. To do this, simulate 100,000 values of C(0, 3)
(in R this is done by rcauchy(100000,0,3)), and count the number
of points in the desired range. Graph your results.
I Answer
< R.Code >
# count the number of points in (3,7)
cau <- rcauchy (100000, 0, 3)
range <- cau[cau>=3 & cau<=7]
The Normal and Student’s-t Models 29
length(range); length(range)/length(cau)
# hist(cau, c(-10000000,seq(-10,10,1),10000000),
xlim=c(-10,10))
# graph it (w/ normals)
ruler <- seq(-10, 10, length=1000)
dcau <- dcauchy(ruler, 0, 3)
dnorm1 <- dnorm(ruler,0,1); dnorm2 <- dnorm(ruler,0,3)
plot(c(-10, 10), c(0, 0.4), type=’’n’’,
xlab="value", ylab="probability density")
lines(dcau ~ ruler, col=2)
lines(dnorm1 ~ ruler, col=1); lines(dnorm2 ~ ruler, col=4)
for (i in 1:80) a <- 3 + i/20
segments(a, 0, a, dcauchy(a, 0, 3), col=2)
text (2.3, 0.3, "N(0,1)", col=1)
text (4.5, 0.08, "N(0,3)", col=4)
text (8.5, 0.025, "C(0,3)", col=2)
FIGURE 3.2: Cauchy distribution with Normal’s
−10 −5 0 5 10
0.0
0.1
0.2
0.3
0.4
value
prob
abili
ty d
ensi
ty
N(0,1)
N(0,3)
C(0,3)
30Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
13. The expressions for the mean and variance of the inverse gamma were
supplied in (3.9). Derive these from the inverse gamma PDF. Show all
steps.
I Answer
The inverse gamma distribution is:
IG(x|α, β) =βα
Γ(α)x−α−1 exp
[−βx
], x, α, β > 0
First, the mean can be calculated as:
E[X] =
∫ ∞0
x× βα
Γ(α)x−α−1 exp
[−βx
]dx
=
∫ ∞0
β
α− 1× βα−1
Γ(α− 1)x−(α−1)−1 exp
[−βx
]dx
=β
α− 1×∫ ∞
0
βα−1
Γ(α− 1)x−(α−1)−1 exp
[−βx
]dx︸ ︷︷ ︸
=1
=β
α− 1
Next, the variance can also be calculated as:
V ar[X] = E[X2]− (E[X])2
=
∫ ∞0
x2 × βα
Γ(α)x−α−1 exp
[−βx
]dx−
(β
α− 1
)2
=
∫ ∞0
βα
Γ(α)x−α−1+2 exp
[−βx
]dx−
(β
α− 1
)2
=β2
(α− 1)(α− 2)
∫ ∞0
βα−2
Γ(α− 2)x−(α−2)−1 exp
[−βx
]dx︸ ︷︷ ︸
=1
− β2
(α− 1)2
=β2
(α− 1)(α− 2)− β2
(α− 1)2=
β2
(α− 1)2(α− 2)
The Normal and Student’s-t Models 31
15. Modify the function biv.norm.post given below so that it provides
posterior samples for multivariate normals, given specified priors and
data. Also modify the function normal.posterior.summary so that
the user can specify any level of density coverage.
I Answer
< R.Code >
require(MASS); require(bayesm)
multi.norm.post <- function(data.mat,alpha,beta,m,n0,
n.reps=1000) n <- nrow(data.mat)
xbar <- apply(data.mat, 2, mean)
S.sq <- (n-1)*var(data.mat)
Beta.inv <- solve(beta)
K <- (n0*n)/(n0+n)*(xbar-m)%*%t(xbar-m)
V.inv <- solve(Beta.inv + S.sq + K)
rep.mat <- NULL
for (i in 1:n.reps) Sigma <- rwishart(alpha+n,V.inv)$IW
mu <- mvrnorm(1, (n0*m + n*xbar)/(n0+n),
Sigma/(n0+n))
out <- c(mu1=mu[1],mu2=mu[2],sig1=Sigma[1,1],
sig2=Sigma[2,2],rho=Sigma[2,1])
rep.mat <- rbind(rep.mat, out)
return(rep.mat)
normal.posterior.summary <- function(reps, alpha)
reps[,5] <- reps[,5]/sqrt(reps[,3]*reps[,4])
n <- nrow(reps)
reps <- apply(reps, 2, sort); bound <- (1-alpha)/2
L <- round(bound*n); H <- round((1-bound)*n)
out.mat <- cbind("mean"= apply(reps,2,mean),
"std.err"= apply(reps,2,sd),
"HPD Lower"= reps[L,],
"HPD Upper"= reps[H,])
round(out.mat,3)
32Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
17. Specify a normal mixture model for the following data from Brooks,
Dellaportas & Roberts (1997). Summarize the posterior distributions
of the model parameters with tabulated quantiles: X = 2.3, 3.7, 4.1,
10.9, 11.6, 13.8, 20.1, 21.4, 22.3.
I Answer
The model for the data can be expressed:
X ∼ pN (µ1, 1) + (1− p)N (µ2, 1), µ1 < µ2 (3.1)
where the variances are assumed known and standardized for conve-
nience. There are three parameters to estimate here, (p, µ1, µ2), as well
as a vector, Z, that contains indicator functions for each of the data
values: zi = 1 if xi is currently determined to be from the component
with mean µ1, and zero otherwise. Now assign priors:
p ∼ U(0, 1)
µ1 ∼ N (5, 20)
µ2 ∼ N (20, 20)
zi ∼ random BR(p), i = 1:n. (3.2)
Using a joint normal likelihood function produced from (3.1) and priors
in (3.2), the full conditional posteriors distributions are:
p|Z ∼ BE(
1 +∑
zi, n+ 1−∑
zi
)µ1|X,Z ∼ N
(20X′Z + 5∑
zi,
20
20∑zi + 1
)µ2|X,Z ∼ N
(20X′(1− Z) + 20∑
(1− zi),
20
20∑
(1− zi) + 1
)zi|xi, p ∼ BR
(pfN (xi|µ = µ1)
pfN (xi|µ = µ1) + (1− p)fN (xi|µ = µ2)
)where 1 indicates a vector of 1s of length n, and fN () is normal PDF
with mean as indicated and variance equal to one. At each iteration the
zi values are estimated to be either zero or one.
The Normal and Student’s-t Models 33
Note that this answer deals with the bi-modal case. However, the tri-
modal case that the question asks to do could be easily extended by
adding one more “distribution” to the model in (3.1).
< R.Code >
sigmasq.nu <- 20; nu1 <- 5; nu2 <- 20
x <- c(2.3,3.7,4.1,10.9,11.6,13.8,20.1,21.4,22.3)
start <- matrix(c(0.3,7,20,rbinom(length(x),1,0.5)),
1,length(x)+3)
tol <- .Machine$double.eps
# Gibbs sampler for two-component mixture problem
gibbs.bdr <- function(theta.matrix,x,reps) for (i in 2:(reps+1))
P <- theta.matrix[(i-1), 1]
u1 <- theta.matrix[(i-1), 2]
u2 <- theta.matrix[(i-1), 3]
z <- theta.matrix[(i-1), 4:ncol(theta.matrix)]
P <- rbeta(1, 1+sum(z), length(x)+1-sum(z))
u1 <- 1000
while (u1 > u2) u1 <- rnorm(1, (sigmasq.nu*(x%*%z) +nu1)/
(sigmasq.nu*sum(z) + 1),
sigmasq.nu/(sigmasq.nu*sum(z) + 1))
u2 <- -1000
while (u1 > u2) u2 <- rnorm(1, (sigmasq.nu*(x%*%(1-z))+nu2)/
(sigmasq.nu*sum(1-z)+1),
sigmasq.nu/(sigmasq.nu*sum(1-z) + 1))
p <- P * exp(-0.5*(x-u1)^2) /
((P * exp(-0.5*(x-u1)^2)
+ (1-P)*exp(-0.5*(x-u2)^2))
z <- rbinom(length(p),1,p)
theta.matrix <- rbind(theta.matrix,c(P,u1,u2,z))
mix.out <- gibbs.bdr(start,x,500)
apply(mix.out,2,mean)
34Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
# Now use a random scan version, blocked for the z’s
gibbs.bdr.rs <- function(theta.matrix,x,reps) for (i in 2:(reps+1))
P <- theta.matrix[(i-1),1];
u1 <- theta.matrix[(i-1),2];
u2 <- theta.matrix[(i-1),3];
z <- theta.matrix[(i-1),4:ncol(theta.matrix)]
scan.order <- sample(4)
for (j in 1:length(scan.order)) if (scan.order[j] == 1)
P <- rbeta(1,1+sum(z),length(x)+1-sum(z))
if (scan.order[j] == 2)
u1 <- 1000
while (u1 > u2) u1 <- rnorm(1,
(sigmasq.nu*(x%*%z) + nu1) /
(sigmasq.nu*sum(z) +1),
sigmasq.nu/(sigmasq.nu*sum(z) + 1))
if (scan.order[j] == 3)
u2 <- -1000
while (u1 > u2) u2 <- rnorm(1,
(sigmasq.nu*(x%*%(1-z)) + nu2) /
(sigmasq.nu*sum(1-z)+1),
sigmasq.nu/(sigmasq.nu*sum(1-z)+1))
if (scan.order[j] == 4)
p <- P * exp(-0.5*(x-u1)^2) /
(P * exp(-0.5*(x-u1)^2) +
(1-P) * exp(-0.5*(x-u2)^2))
z <- rbinom(length(p),1,p)
if (i %% 100 == 0) print(i)
theta.matrix <- rbind(theta.matrix,c(P,u1,u2,z))
mix.out <- gibbs.bdr.rs(start,x,12000)
The Normal and Student’s-t Models 35
19. Use the following data on the race of convicted murders in the United
States in 1999 to specify a bivariate model. Specify reasonable priors and
produce posterior descriptions of the unknown parameters according to
Section 3.5. (Source: FBI, Crime in the United States 1999.)
Age Group White Black
9 to 12 7 11
13 to 16 218 247
17 to 19 672 976
20 to 24 987 1,285
25 to 29 619 660
30 to 34 493 429
35 to 39 444 303
40 to 44 334 228
45 to 49 236 134
50 to 54 153 73
55 to 59 89 60
60 to 64 55 24
65 to 69 47 17
70 to 74 23 10
75 and up 57 14
I Answer
First, assign priors as:
µ|Σ ∼ N 2
(m,
Σ
n0
), Σ−1 ∼ W(α,β)
where µ = (300, 300), β is a 2 × 2 identity matrix, α = 3 degree of
freedom, and n0 = 0.01 so that priors are weighted less.
So according to Equation (3.24), the posteriors are:
µ|Σ ∼ N k+2
(0.01m + nx
0.01 + n,
Σ
n0 + n
)Σ−1 ∼ W
(3 + n,
(β−1 + S2 +
n0n
n0 + n(x−m)(x−m)′
)−1)
Table 3.1 summarizes the posterior of µ and Σ. And, the definition of
R code here is stated in Exercise 3.8.
36Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Table 3.1: Posterior of the bivariate model
mean std.err 95% lower HPD 95% Upper HPD
white 298.880 62.886 189.182 421.735
black 299.679 83.385 149.624 464.976
σ2white 79232.966 30992.859 43452.824 152118.011
σ2black 141160.434 54178.453 76347.078 278235.760
ρ 0.966 0.017 0.933 0.987
< R.Code >
library(bayesm)
library(MASS)
white <- c(7, 218, 672, 987, 619, 493, 444,
334, 236, 153, 89, 55, 47, 23, 57)
black <- c(11, 247, 976, 1285, 660, 429, 303,
228, 134, 73, 60, 24, 17, 10, 14)
X <- cbind(white, black)
multi.norm.post <- function(data.mat, alpha,beta, m,
n0, n.reps=1000) n <- nrow(data.mat)
xbar <- apply(data.mat, 2, mean)
S.sq <- (n-1)*var(data.mat)
Beta.inv <- solve(beta)
K <- (n0*n)/(n0+n)*(xbar-m)%*%t(xbar-m)
V.inv <- solve(Beta.inv + S.sq + K)
rep.mat <- NULL
for (i in 1:n.reps) Sigma <- rwishart(alpha+n,V.inv)$IW
mu <- mvrnorm(1, (n0*m + n*xbar)/(n0+n), Sigma/(n0+n))
out <- c(mu1=mu[1],mu2=mu[2],sig1=Sigma[1,1],
sig2=Sigma[2,2],rho=Sigma[2,1])
rep.mat <- rbind(rep.mat, out)
return(rep.mat)
The Normal and Student’s-t Models 37
normal.posterior.summary <- function(reps, alpha) reps[,5] <- reps[,5]/sqrt(reps[,3]*reps[,4])
n <- nrow(reps)
reps <- apply(reps, 2, sort)
bound <- (1-alpha)/2
L <- round(bound*n)
H <- round((1-bound)*n)
out.mat <- cbind("mean"= apply(reps,2,mean),
"std.err"= apply(reps,2,sd),
"HPD Lower"= reps[L,],
"HPD Upper"= reps[H,])
round(out.mat,3)
out <- multi.norm.post(X, 3, diag(2), m=c(200, 200),
n0=5, n.reps=1000)
normal.posterior.summary(out, 0.95)
4
The Bayesian Prior
Exercises
1. Show that the exponential PDF and the chi-square PDF is a special
case of the gamma PDF.
I Answer
The Gamma PDF has the form:
f(θ|α, β) =βα
Γ(α)θα−1 exp[−βθ], θ, α, β > 0,
First, by substituting α with 1, we get:
f(θ|1, β) =β1
Γ(1)θα−1 exp[−βθ] = β exp[−βθ],
which is an exponential PDF.
Second, by substituting α with ν, and β with 12 , we get:
f(θ|ν2,
1
2) =
( 12 )
ν2
Γ(ν2θν2−1 exp
[−(
1
2)θ
]=θν−22 exp[− θ2 ]
2ν2 Γ(ν2 )
,
which is an chi-square PDF. (Q.E.D.)
3. (Pearson 1920). An event has occurred p times in n trials. What is the
probability that it occurs r times in the next m trials?
I Answer
From the past experience, we know that Pr(Event) = pn . So, the prob-
ability that the even occurs r times from independent m trials would
39
40Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
follow the binomial PMF:(m
r
)( pn
)r (1− p
n
)m−r,
The Bayesian approach is a little different. I mainly follow the Beta-
Binomial model that was covered in Section 2.3.3.
Let’s define Xi as an event at ith trial. Xi gets 1 if the event occurs, and
0 if the event doesn’t occur. When Xi follows the Bernoulli distribution
with probability of Pr(Xi) = PR, the new random variable y =∑ni Xi
follows the Binomial distribution, BN (n, PR). By the way, we know
that y ≡ p from the question.
We can assign an uninformative flat prior on PR, such as BE(1, 1).
Then, the posterior for PR becomes π(PR|p) = BE(p + 1, n − p +
1). As a point estimate, we can take the mean ( p+1n+2 ), and use it for
new Pr(Event). Then, the probability of the event occurs r times from
independent m trials would be:(m
r
)(p+ 1
n+ 2
)r (1− p+ 1
n+ 2
)m−r,
which is slightly different from the frequentist solution above.
5. Suppose that you had a prior parameter with the restriction: [0 < η <
1]. If you believed that η had prior mean 0.4 and variance 0.1, and
wanted to specify a beta distribution, what prior parameters would you
assign?
I Answer
E(η) =α
α+ β= 0.4
V ar(η) =αβ
(α+ β)2(α+ β + 1)= 0.1,
which produces α = 1425 and β = 21
25 . Therefore, the prior distribution
would be BE( 1425 ,
2125 ).
The Bayesian Prior 41
7. Derive the Jeffrey’s prior for a normal likelihood model under three
circumstances: (1) σ2 = 1, (2) µ = 1, and (3) both µ and σ2 unknown
(nonindependent).
I Answer
(a) Suppose we have x ∼ N (µ, 1), where −∞ < µ < ∞. So the
likelihood is:
L(µ|x) ∝ exp
[−1
2
(∑x2i − 2nxµ+ nµ2
)]The loglikelihood is therefore:
`(µ|x) ∝ −1
2
(∑x2i − 2nxµ+ nµ2
)The first and the second derivative are:
∂
∂µ`(µ|x) = nx− µ,
∂2
∂µ2`(µ|x) = −n
So, the Jeffreys prior is:
pJ(µ) =
[−Ex|µ
(∂2
∂µ2`(µ|x)
)] 12
=[Ex|µ(n)
] 12
∝ 1 (constant)
(b) Suppose we have x ∼ N (1, σ2), where 0 < σ <∞. So the likelihood
is:
L(σ2|x) ∝ (σ2)−n2 exp
[−1
2σ−2
(∑(xi − 1)2
)]The loglikelihood is therefore:
`(σ2|x) ∝ −n2
log(σ2)− 1
2(σ2)−1
∑(xi − 1)2
The first and the second derivatives are:
∂
∂σ2`(σ2|x) = −n
2(σ2)−1 +
1
2(σ2)−2
∑(xi − 1)2,
∂2
∂(σ2)2`(σ2|x) =
n
2(σ2)−2 − (σ2)−3
∑(xi − 1)2︸ ︷︷ ︸=nσ2
=n
2(σ2)−2 − n(σ2)−2 = −n
2(σ2)−2
42Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
So the Jeffreys prior is:
pJ(σ2) =
[−Ex|σ2
(∂2
∂(σ2)2`(σ2|x)
)] 12
=[−Ex|σ2
(−n
2(σ2)−2
)] 12
∝ (σ2)−1
(c) Suppose we have x ∼ N (µ, σ2), where where 0 < σ2 < ∞,−∞ <
µ <∞. So the likelihood is:
L(µ, σ2|x) ∝ (σ2)−n2 exp
[−1
2(σ2)−1
∑(xi − µ)2
]The loglikelihood is therefore:
`(µ, σ2|x) = −n2
log(σ2)− 1
2(σ2)−1
∑(xi − µ)2
The first and the second derivatives are:
∂
∂µ`(µ, σ2|x) =
∂
∂µ
[−1
2(σ2)−1
∑(−2µxi + µ2)
]= (σ2)−1
∑xi − n(σ2)−1µ
∂
∂σ2`(µ, σ2|x) = −n
2(σ2)−1 +
1
2(σ2)−2
∑(−2µxi + µ2)
and
∂2
∂µ2`(µ, σ2|x) = −n(σ2)−1
∂2
µσ2`(µ, σ2|x) = −(σ2)−2
∑xi + n(σ2)−2µ
= (nµ−∑
xi)︸ ︷︷ ︸=0
(σ2)−2 = 0
∂2
∂(σ2)2`(µ, σ2|x) =
n
2(σ2)−2 − (σ2)−3
∑(xi − µ)2︸ ︷︷ ︸=nσ2
=n
2(σ2)−2 − n(σ2)−2 =
n
2(σ2)−2
So, the Jeffreys prior is:
pJ(µ, σ2) =
[−Ex|µ,σ2
(−n(σ2)−1 0
0 −n2 (σ2)−2
)] 12
=
(n(σ2)−
12 0
0 n2 (σ2)−1
),
The Bayesian Prior 43
which is:pJ(µ) ∝ 1 (constant)
pJ(σ2) ∝ (σ2)−1
Note that Jeffreys priors here are equivalent to those obtained from
(a) and (b).
10. (Robert 2001) Calculate the marginal posterior distributions for the
following setups:
• x|σ2 ∼ N (0, σ2), 1/σ2 ∼ G(1, 2).
• x|λ ∼ P(λ), λ ∼ G(2, 1).
• x|p ∼ NB(10, p), p ∼ BE(0.5, 0.5).
I Answer
(a)
π(σ2|x) ∝ f(x|σ2)π(σ2)
∝[(σ2)−
n2 exp
(− 1
2σ2
∑x2i
)][(σ2)−2 exp
(− 2
σ2
)]= (σ2)−(n2 +1)−1 exp
[− 1
σ2
(∑x2i
2+ 2
)]∼ IG
(n
2+ 1,
∑x2i
2+ 2
)(b)
π(λ|x) ∝ f(x|λ)π(λ)
∝[λ∑xi exp(−nλ)
][λ exp(−λ)]
= λ∑xi+1 exp [−(n+ 1)λ)]
∼ G(∑
xi + 2, n+ 1)
(c)
π(p|x) ∝ f(x|p)π(p)
=[p10n(1− p)
∑xi−10n
] [p−
12 (1− p)− 1
2
]= p10n− 1
2 (1− p)∑xi−10n− 1
2
∼ BE(
10n+1
2,∑
xi − 10n+1
2
)
44Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
12. Calculate the Jeffreys prior for the form in Exercise 5.5
I Answer
(a)
`(σ2|x) ∝ −n2
log(σ2)− 1
2(σ2)−1
∑x2i
∂`
∂σ2= −n
2(σ2)−1 +
1
2(σ2)−2
∑x2i
∂2`
∂(σ2)2=n
2(σ2)−2 − (σ2)−3
∑x2i
∴ pJ(σ2) =
−Ex|σ2
n2
(σ2)−2 − (σ2)−3∑
x2i︸ ︷︷ ︸
=nσ2
12
=[n
2(σ2)−2 − n(σ2)−2
] 12
= (n
2)
12 (σ2)−1
∝ (σ2)−1
(b)
`(λ|x) ∝(∑
xi
)log(λ)− nλ
∂`
∂λ=(∑
xi
)λ−1 − n
∂2`
∂λ2=(∑
xi
)λ−2
∴ pJ(λ) =
−Ex|λ(∑xi
)︸ ︷︷ ︸
=nλ
λ−2
12
= nλ−12
∝ λ− 12
(c)
`(p|x) ∝ 10n log(p) +(∑
xi − 10n)
log(1− p)
∂`
∂p= 10np−1 +
(∑xi − 10n
)(1− p)−1(−1)
∂2`
∂p2= −10np−2 +
(∑xi − 10n
)(1− p)−2(−1)
The Bayesian Prior 45
∴ pJ(p) =
−Ex|p−10np−2 − (
∑xi︸ ︷︷ ︸
=n10(1−p)
p
−10n)(1− p)−2
12
=
[10np−2 + 10n
(1− pp− 1
)(1− p)−2
] 12
=
[10n
1− p− p2
p2(1− p)2
]− 12
∝ (1− p− p2)12
p(1− p)
14. The Bayesian framework is easily adapted to problems in time-series.
One of the most simple time-series specifications is the AR(1), which
assumes that the previous period’s outcome is important in the current
period estimation.
Given and observed outcome variable vector, Yt measured at time t,
AR(1) model for T periods is:
Yt = Xtβ + εt
εt = ρεt−1 + ut, |ρ| < 1
ut ∼ iid N(0, σ2u). (4.1)
Here Xt is a matrix of explanatory variables at time t, εt and εt−1 are
residuals from period t and t− 1, respectively, ut is an additional zero-
mean error term for the autoregressive relationship, and β, ρ, σ2u are
unknown parameters to be estimated by the model. Backward substitu-
tion through time to arbitrary period s gives εt = ρ2εt−s+∑T−1j=0 ρ
jut−j ,
and since E[εt] = 0, then var[εt] = E[ε2t ] =σ2u
1−ρ2 , and the covariance be-
tween any two errors is: Cov[εt, εt−j ] =ρjσ2
u
1−ρ2 . Assuming asymptotic
normality, this setup leads to a general linear model with the following
46Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
tridiagonal T × T weight matrix (Amemiya 1985, 164):
Ω =1
σ2u
1 −ρ−ρ (1 + ρ2) −ρ 0
. . .
0 −ρ (1 + ρ2) −ρ−ρ 1
. (4.2)
Using the following data on worldwide fatalities from terrorism com-
pared to some other causes per 100,000 (source: Falkenrath 2001), de-
velop posteriors for β, ρ, and σu for each of the causes (Y ) as a separate
model using the date (minus 1983) as the explanatory variable (X).
Specify conjugate priors (Berger and Yang [1994] show some difficulties
with nonconjugate priors here), or use a truncated normal for the prior
on ρ (see Chib and Greenberg [1998]). Can you reach some conclusion
about the differences in these four processes?
Year Terrorism Car Accidents Suicide Murder
1983 0.116 14.900 12.100 8.3001984 0.005 15.390 12.420 7.9001985 0.016 15.150 12.380 8.0001986 0.005 15.920 12.870 8.6001987 0.003 19.106 12.710 8.3001988 0.078 19.218 12.440 8.4001989 0.006 18.429 12.250 8.7001990 0.004 18.800 11.500 9.4001991 0.003 17.300 11.400 9.8001992 0.001 16.100 11.100 9.3001993 0.002 16.300 11.300 9.5001994 0.002 16.300 11.200 9.0001995 0.005 16.500 11.200 8.2001996 0.009 16.500 10.800 7.400
I Answer
The joint posterior can be written as:
π(β, ρ, σ2u|X,y) ∝ f(y1, · · · , yT |β, ρ, σ2
u)π(β)π(ρ)π(σ2u)
= f(y1)f(y2|y1) · · · f(yT |yT−1)π(β)π(ρ)π(σ2u)
The Bayesian Prior 47
For f(yt|yt−1), we have:
yt − ρyt−1︸ ︷︷ ︸≡yt
= (xt − ρxt−1)︸ ︷︷ ︸≡xt
β + et − ρet−1︸ ︷︷ ︸≡ut
⇒ yt = xt′β + ut
∴ yt|β, ρ, σ2u ∼ N (xtβ, σ
2u)
For f(y1), we have:
∀t, E[et] = 0, Var[et] =σ2u
1− ρ2
⇒ E[y1] = E[x′1β + e1] = x′1β
Var[y1] = Var[e1] =σ2u
1− ρ2
∴ y1 ∼ N(x′1β,
σ2u
1− ρ2
)For priors, we have:
β ∼ N (βo,Bo) , ρ ∼ N (ρo,Φo) , σ2u ∼ IG(a, b)
Then, the joint posterior becomes:
π(β, ρ, σ2u|X,y) ∝
(1− ρ2
σ2u
) 12
exp
[−1− ρ2
σ2u
(y1 − x′1β)2
]×(
1
σ2u
) 12 (T−1)
exp
[− 1
σ2u
T∑i=2
(yt − xt′β)2
]
×(
1
Bo
) 12
exp
[− 1
2Bo(β − βo)
2
]× (σ2
u)a−1 exp[−bσ2
u
]×(
1
Φo
) 12
exp
[− 1
2Φo(ρ− φo)2
]It would be very demanding to mathematically derive the marginal pos-
terior distributions. But, using BUGS simplifies the computation. For the
introduction of BUGS (including WinBUGS and JAGS ), see the Appendix
C.
< Bugs.Code >
48Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
model z[1] ~ dnorm(xb1, v1)
z[1] <- y[1]
xb1 <- b * x[1]
v1 <- pow(smu,2) / (1 - pow(rho,2))
for (i in 2:n) z[i] ~ dnorm(xb2[i], v2)
z[i] <- y[i] - rho * y[i-1]
xb2[i] <- b * (x[i] - rho * x[i-1])
v2 <- pow(smu,2)
b ~ dnorm(0, 0.01)
rho ~ dnorm(0, 0.01)
smu ~ dgamma(0.01, 0.01)
16. Laplace (1774) wonders what the best choice is for a posterior point
estimate. He sets three conditions for the shape of the posterior: sym-
metry, asymptotes, and properness (integrating to one). In addition,
Laplace tacitly uses uniform priors. He proposes two possible criteria
for selecting the estimate:
• La primiere est l’instant tel qu’en egalement probable que le veritable
instant du phenomene tombe avant ou apres; on pourrait appeler
cet instant milieu de probabilite.
Meaning: use the median.
• Le seconde est l’instant tel qu’en le prenant pour milieu, la somme
des erreurs a craindre, multipliees par leur probabilite, soit un mini-
mum; on pourrait l’appeler milieu d’erreur ou milieu astronomique,
comm etant celui auquel les astronomes doivent s’arreter de preference.
Meaning: use the quantity at the “astronomical center of mass”
that minimizes: f(x) =∫|x− V |f(x)dx. In modern terms, this is
equivalent to minimizing the posterior expected loss: E[L(θ, d)|x] =∫ΘL(θ, d)π(θ|x)dθ, which is the average loss defined by the poste-
rior distribution and d is the “decision” to use the posterior esti-
mate of θ (see Berger [1985]).
The Bayesian Prior 49
Prove that these two criteria lead to the same point estimate.
I Answer
When L(θ, θ) = |θ − θ|, we have:
E[L(θ, θ)
]=
∫Θ
|θ − θ|π(θ|x)dθ
=
∫ θ
−∞(θ − θ)π(θ|x)dθ +
∫ ∞θ
(θ − θ)π(θ|x)dθ
To minimize E[L(θ, θ)
], we consider (F.O.C) w.r.t. θ:
∂E[L(θ, θ)
]∂θ
=
∫ θ
−∞
∂
∂θ
[(θ − θ)π(θ|x)
]dθ
+
∫ ∞θ
∂
∂θ
[(θ − θ)π(θ|x)
]dθ
=
∫ θ
−∞π(θ|x)dθ −
∫ ∞θ
π(θ|x)dθ ≡ 0
⇒∫ θ
−∞π(θ|x)dθ =
∫ ∞θ
π(θ|x)dθ
Therefore, by definition of median, θ is the median. (Q.E.D.)
18. Review one body of literature in your area of interest and develop a
detailed plan for creating elicited priors.
I Answer
One example would be Jeff Gill and Lee D. Walker, 2005, “Elicited
Priors for Bayesian Model Specification in Political Science Research,”
Journal of Politics 67(3): 841-872.
20. Test a Bayesian count model for the number of times that capital pun-
ishment is implemented on a state level in the United States for the
year 1997. Included in the data below (source: United States Census
Bureau, United States Department of Justice) are explanatory variables
for: median per capita income in dollars, the percent of the population
50Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
classified as living in poverty, the percent of Black citizens in the popu-
lation, the rate of violent crimes per 100,000 residents for the year before
(1996), a dummy variable to indicate whether the state is in the South,
and the proportion of the population with a college degree of some kind.
Exe- Median Percent Percent Violent Prop.State cutions Income Poverty Black Crime South Degrees
Texas 37 34453 16.7 12.2 644 1 0.16Virginia 9 41534 12.5 20.0 351 1 0.27Missouri 6 35802 10.6 11.2 591 0 0.21Arkansas 4 26954 18.4 16.1 524 1 0.16Alabama 3 31468 14.8 25.9 565 1 0.19Arizona 2 32552 18.8 3.5 632 0 0.25Illinois 2 40873 11.6 15.3 886 0 0.25S. Carolina 2 34861 13.1 30.1 997 1 0.21Colorado 1 42562 9.4 4.3 405 0 0.31Florida 1 31900 14.3 15.4 1051 1 0.24Indiana 1 37421 8.2 8.2 537 0 0.19Kentucky 1 33305 16.4 7.2 321 0 0.16Louisiana 1 32108 18.4 32.1 929 1 0.18Maryland 1 45844 9.3 27.4 931 0 0.29Nebraska 1 34743 10.0 4.0 435 0 0.24Oklahoma 1 29709 15.2 7.7 597 0 0.21Oregon 1 36777 11.7 1.8 463 0 0.25
EXE INC POV BLK CRI SOU DEG
In 1997, executions were carried out in 17 states with a national total
of 74. The model should be developed from the Poisson link function,
θ = log(µ), with the objective of finding the best β vector in:
g−1(θ)︸ ︷︷ ︸17×1
= exp [Xβ] = exp
[1β0 + INCβ1 + POVβ2
+ BLKβ3 + CRIβ4 + SOUβ5 + DEGβ6
].
Specify a suitable prior with assigned prior parameters, then summarize
the resulting posterior distribution.
I Answer
It would be mathematically demanding to derive marginal posterior
distribution for the Poisson model. But, using BUGS simplifies the
The Bayesian Prior 51
computation. For the introduction of BUGS (including WinBUGS and
JAGS ), see the Appendix C.
< Bugs.Code >
model for (i in 1:n)
EXE[i] ~ dpois(theta[i])
log(theta[i]) <- b0 + b1*INC[i] + b2*POV[i] +
b3*BLK[i] + b4*CRI[i] + b5*SOU[i] + b6*DEG[i]
b0 ~ dnorm(0,.0001)
b1 ~ dnorm(0,.0001)
b2 ~ dnorm(0,.0001)
b3 ~ dnorm(0,.0001)
b4 ~ dnorm(0,.0001)
b5 ~ dnorm(0,.0001)
b6 ~ dnorm(0,.0001)
Table 4.1: Summary Statistics of β’s
Mean S.E. 2.5% 25% 50% 75% 97.5%
b0 -0.2 0.3 -0.9 -0.4 -0.2 0.0 0.3
b1 2.7 0.5 1.7 2.4 2.7 3.0 3.8
b2 0.5 0.5 -0.6 0.1 0.5 0.8 1.6
b3 -1.9 0.5 -2.8 -2.2 -1.9 -1.5 -0.9
b4 0.0 0.4 -0.7 -0.2 0.0 0.3 0.8
b5 2.4 0.4 1.6 2.1 2.4 2.7 3.3
b6 -1.9 0.4 -2.7 -2.2 -1.9 -1.6 -1.1
5
The Bayesian Linear Model
Exercises
1. Derive the posterior marginal for β in (4.17) from the joint distribution
given by (4.16).
I Answer
From (4.16), we have:
π(β, σ2|X,y) ∝ σ−n−Aexp
[−
1
2σ2
(s+ (β + β)′(Σ−1 + X′X)(β + β)
)]
Now integrate with respect to σ2 to get the marginal for β:
π(β|X,y) ∝∫ ∞
0
σ−n−A exp
[−
1
2σ2
(s+ (β + β)′(Σ−1 + X′X)(β + β)
)]︸ ︷︷ ︸
inverse gamma PDF kernel
dσ2
We know that, ∫ ∞0
qp
Γ(p)x−(p+1) exp
[−qx
]︸ ︷︷ ︸
IG(x|p,q)
dx = 1
⇒∫ ∞
0
x−(p+1) exp
[−qx
]dx =
Γ(p)
qp
By setting p = n+a2 −1, q = 1
2
(s+ (β + β)′(Σ−1 + X′X)(β + β)
), and
x = σ2, we have:
π(β|X,y) ∝ q−p ∝[s+ (β + β)′(Σ−1 + X′X)(β + β)
]−(n+a2 )+1
,
which is (4.17). (Q.E.D.)
53
54Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
3. For uninformed priors, the joint posterior distribution of the regression
coefficients was shown to be multivariate-t (page 108), with covariance
matrix: R = (n−k)σ2(X′X)−1
n−k−2 . Under what conditions is this matrix
positive definite (a requirement for valid inferences here).
I Answer
R is positive definite.
⇐⇒ (X′X)−1 is positive definite and (n− k − 2) is positive.
(X′X)−1 is positive definite.
⇐⇒ X′X is positive definite.
⇐⇒ X is non-singular matrix (∵ X′X is symmetric).
⇐⇒ Columns of X are independent.
⇐⇒ rank(X) = k.
Note that, if rank(X) = k, then rank(X′X) = k, which guarantees that
the (k × k) matrix X′X is invertible.
Therefore, the condition is rank(X) = k and (n − k − 2) > 0. Since, k
is the number of explanatory variables, this condition also means that
there is no perfect multinolinearity in the data.
5. Under standard analysis of linear models, the hat matrix is given by
y = Hy where H is X(X′X)−1X′ where the diagonal values of this
matrix indicate leverage, which is obviously a function of X only. Can
the stipulation of strongly informed priors change data-point leverage?
Influence in linear model theory depends on both hat matrix diagonal
values and yi. Calculate the influence on the Technology variable of each
datapoint in the Palm Beach County model with uninformed priors by
jackknifing out these values one-at-a-time and re-running the analysis.
Which precinct has the greatest influence?
I Answer
< R.Code >
# data and setup
require(mvtnorm)
The Bayesian Linear Model 55
pbc.vote <- as.matrix(read.table("http://artsci.wustl.edu/
~stats/BMSBSA/Chapter06/pbc.dat"))
dimnames(pbc.vote)[[2]] <- c("badballots", "tech", "new",
"turnout", "rep", "whi")
X <- cbind(1, pbc.vote[,-1])
Y <- pbc.vote[,1]
# define posteriors
uninform.lm <- function(x, y) n <- nrow(x); k <- ncol(x)
H.inv <- solve(t(x)%*%x)
bhat <- H.inv%*%t(x)%*%y
s2 <- t(y- x%*%bhat)%*%(y- x%*%bhat)/(n-k)
R <- as.real((n-k)*s2/(n-k-2))*H.inv
alpha <- (n-k-1)/2
beta <- 0.5*s2*(n-k)
coef.draw <- rmvt(100000, sigma=R, df=n-k)
post.coef <- apply(coef.draw, 2, mean)
post.sd <- apply(coef.draw, 2, sd)
post.sigma <- mean(sqrt(1/rgamma(10000,alpha,beta)))
out <- list("coef"=post.coef, "se"=post.sd,
"sigma"=post.sigma)
# define jackknife procedure
jacknife <- function(X, Y, A=NULL, B=NULL) N <- nrow(X)
K <- nrow(X)
post.coef <- matrix(NA, N, K)
post.sigma <- rep(NA, N)
for (i in 1:N) x <- X[-i,]; y <- Y[-i]
n <- nrow(x); k <- ncol(x)
out <- uninform.lm(x, y)
post.coef[i,] <- out[["coef"]]
post.sigma[i] <- out[["sigma"]]
OUT <- list("coefs"=post.coef, "sigma"=post.sigma)
56Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
0 100 200 300 400 50021.2
21.4
21.6
21.8
22.0
Unit Index
Pos
terio
r M
ean
for
σσ
78
222
398
FIGURE 5.1: Plot of posterior σ using jackknifing
# make graph
post.sigma <- jacknife(X, Y)[["sigma"]]
par(mgp=c(2,0.25,0), mar=c(3,3,1,1), tcl=-0.2)
plot(1:516, post.sigma,
ylab=expression("Posterior Mean for "*sigma),
xlab="Unit Index")
true.sigma <- uninform.lm(X, Y)[["sigma"]]
abline(h=true.sigma, col=2, lty=2)
#identify(1:516, post.sigma, labels=1:516)
7. Meier and Keiser (1996) used a linear pooled model to examine the
impact on several federal laws on state-level child-support collection
policies. Calculate a Bayesian linear model and plot the posterior dis-
tribution of the parameter vector, β, as well as σ2, specifying an inverse
gamma prior with your selection of prior parameters, and the uninfor-
mative uniform prior: f(σ2) = 1σ . Use a diffuse normal prior for the β,
and identify outliers using the hat matrix method (try the R command
hat).
The Bayesian Linear Model 57
The data are collected for the 50 states over the period 1982 to 1991,
where the outcome variable, SCCOLL, is the change in child-support
collections. The explanatory variables are: chapters per population
(ACES), policy instability (INSTABIL), policy ambiguity (AAMBIG), the
change in agency staffing (CSTAFF), state divorce rate (ARD), organiza-
tional slack (ASLACK), and state-level expenditures (AEXPEND). These
data can be downloaded on the webpage for this book. For details, see
their original article or Meier and Gill (2000, Chapter 2).
SCCOLL ACES INSTABIL AAMBIG CSTAFF ARD ASLACK AEXPEND
1141.4 1.467351 70431.41 23.57 115.5502 26.2 3.470122 3440.1592667.0 1.754386 6407.35 32.10 39.5922 27.8 4.105816 8554.186307.3 0.421585 63411.55 35.14 106.0642 29.4 5.610657 2544.263840.7 0.000000 16396.95 23.46 50.5777 30.8 3.848957 2366.732482.5 1.184990 78906.07 31.95 -1.1440 20.3 3.700461 4954.157550.0 2.072846 28103.63 17.81 25.7648 22.8 2.742410 2976.427514.1 0.607718 22717.82 22.23 18.3592 14.9 5.063129 4905.780
1352.5 1.470588 6602.02 65.03 -232.2549 19.6 9.178106 5970.5201136.5 0.828500 123240.60 21.44 60.6078 30.8 2.495392 2567.2121575.4 1.660879 87061.77 31.26 86.8236 22.2 2.100618 2589.5311170.6 0.000000 13042.99 37.68 31.7781 19.1 4.491279 4141.2471679.6 0.962464 7197.37 21.72 79.9475 27.0 3.074239 3216.067882.3 1.212856 181247.60 16.97 45.6272 17.4 1.902308 2469.215
1347.5 1.782531 78822.32 11.84 33.7674 26.6 1.762038 1742.7451353.4 1.431127 11549.32 31.10 57.3484 16.6 3.434238 2659.6971363.4 2.404810 13084.74 18.11 72.4447 22.5 2.563138 3296.4651087.9 0.538648 26327.75 23.06 166.6439 21.9 2.649425 3330.793712.8 0.470367 27160.63 25.46 35.4368 15.6 4.030189 3416.769
1865.4 2.429150 9965.54 18.14 90.4288 20.9 3.209765 4109.0071088.4 1.234568 38821.03 40.84 24.8976 14.2 3.488844 5275.8241308.1 0.667111 76348.86 26.11 82.5835 12.7 3.641929 4926.7753485.4 1.174210 197802.90 25.89 80.5744 17.7 1.753225 5419.7841946.4 1.128159 35488.20 26.85 70.3593 14.9 4.764180 5208.2181116.8 0.385803 76166.11 16.39 225.7210 21.8 4.611794 2539.3692047.5 0.580495 54306.30 21.42 121.5814 22.0 2.799651 2754.369893.3 2.475247 7206.16 12.36 103.7574 22.9 3.882953 2083.009
1779.2 0.000000 26995.98 45.80 78.2787 17.6 3.511719 3982.897703.4 1.557632 3533.64 49.19 -32.1768 54.9 6.609568 4082.064695.1 2.714932 6912.47 43.87 32.3424 19.7 6.407141 2998.161
1287.0 0.257732 62029.57 36.51 17.6954 15.0 4.470681 6310.163573.0 0.000000 13153.38 28.29 48.0668 26.0 2.264973 2759.142955.6 0.332263 151116.20 32.31 50.3237 15.0 3.284161 5634.305
1296.1 1.187472 70689.08 23.75 31.7058 20.6 3.450963 2992.5311194.5 1.574803 5132.25 15.61 54.0354 15.4 2.777402 2679.4544299.2 3.565225 169930.40 22.62 136.6210 20.3 2.841627 3072.191896.4 1.259843 66218.98 23.09 35.2008 32.9 3.420866 2381.565815.4 4.106776 53441.92 37.42 -176.8579 23.8 6.527769 4303.525
2262.9 0.752445 121275.40 51.96 18.3963 14.8 2.932871 4130.6571108.8 0.996016 10880.08 38.61 22.0054 16.1 2.421065 3339.3501314.2 1.404494 27217.83 25.78 36.9890 17.4 1.432419 2533.1391268.9 4.207574 2172.79 17.05 23.4408 16.9 2.988671 2366.595952.8 0.807591 94001.98 41.90 33.3545 26.6 1.970768 2031.763763.2 0.403481 121752.90 51.39 53.8607 24.4 2.710884 1620.140
1228.6 0.000000 11480.06 21.33 73.4535 21.8 4.641821 5069.0721023.5 3.527337 3049.59 20.86 84.8460 18.9 4.317273 2983.7821646.3 1.272669 58245.25 19.83 105.2593 17.7 2.980112 3097.7822483.3 1.195695 80936.93 30.97 160.9784 24.7 6.063038 5996.7341045.3 0.555247 15810.82 15.11 78.1052 22.3 3.786695 2222.7283866.9 1.210898 70813.30 24.03 60.6050 15.2 3.335582 4972.4251450.2 0.000000 7217.25 10.33 163.1716 29.0 2.623226 1860.076
I Answer
Table 4.1 and Figure 4.2 summarizes the posteriors.
< R.Code >
58Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
constant ACES INSTABIL AAMBIG CSTAFF
Mean -95.5694 119.3877 0.0016 3.6618 1.9330
Std.Dev. 2.5e+02 5.9e+01 1.1e-03 5.2e+00 1.0e+00
ARD ASLACK AEXPEND σ2
Mean -0.9359 -53.4573 0.1242 453696
Std.Dev. 6.5e+00 4.4e+01 5.4e-02 105390
Table 5.1: Posteriors of Meier and Keiser (1996) model
ACES
0 200 400 600 800
0.00
00.
008
INSTABIL
0.000 0.005 0.010 0.015
020
0
AAMBIG
−40 −20 0 20 40
0.00
0.06
CSTAFF
0 5 10 15
0.0
0.3
ARD
−60 −40 −20 0 20 40
0.00
0.05
ASLACK
−400 −300 −200 −100 0 100 200
0.00
00.
008
AEXPEND
0.0 0.2 0.4 0.6 0.8 1.0
04
8
sigma^2
200000 600000 1000000 1400000
0e+
004e
−06
FIGURE 5.2: Posteriors of Meier and Keiser (1996) model
The Bayesian Linear Model 59
# data and setup
require(arm); require(mvtnorm)
mk <- as.matrix(read.table("http://artsci.wustl.edu/~stats/
BMSBSA/Chapter06/meier.keiser.dat", header=T))
y <- mk[,1]; x <- mk[,-1]
# priors
BBeta <- rep(0, 8); Sigma <- diag(c(5,5,5,5,5,5,5,5))
A <- 1; B <- 1
# posteriors
x <- cbind(1, x); n <- nrow(x); k <- ncol(x)
H.inv <- solve(t(x) %*% x)
bhat <- H.inv %*% t(x) %*% y
s2 <- t(y - x %*%bhat) %*% (y - x %*% bhat) / (n-k)
tB <- solve(solve(Sigma) + t(x) %*% x) %*%
(solve(Sigma) %*% BBeta + t(x) %*% x %*% bhat)
ts <- 2*B + s2*(n-k) + (t(BBeta) - t(tB)) %*%
solve(Sigma) %*% BBeta + t(bhat - tB) %*%
t(x) %*% x %*% bhat
R <- diag(ts / (n+A-k-3)) *
solve(solve(Sigma) + t(x) %*% x)
alpha <- (n-k-1)/2
beta <- s2*(n-k)/2
coef.draw <- mvrnorm(100000, tB, Sigma=R) /
sqrt(rchisq(100000, k-1))
sigma.draw <- 1/rgamma(100000, alpha, beta)
post.coef <- apply(coef.draw, 2, mean)
post.sd <- apply(coef.draw, 2, sd)
post.sigma <- c(mean(sigma.draw), sd(sigma.draw))
output <- list("coef.mean"=post.coef, "coef.sd"=post.sd,
"sigma.mean"=post.sigma[1],
"sigma.sd"=post.sigma[2])
output
# make graphs
par(mfrow=c(4,2))
for (i in 2:8) hist(coef.draw[,i], freq=FALSE, breaks=50,
main=dimnames(x)[[2]][i], xlab="", ylab="")
60Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
lines(density(coef.draw[,i]))
hist(sigma.draw, freq=FALSE, breaks=50,
main="sigma^2", xlab="", ylab="")
lines(density(sigma.draw))
9. Returning to the discussion of conjugate priors for the Bayesian linear
regression model starting on page 111, show that substituting (4.14) and
(4.15) into (4.13) produces (4.16).
I Answer
All we have to show is:
σ2(n− k) + (β − b)′X′X(β − b) + 2b+ (β −B)′Σ−1(β −B)
= s+ (β − β)′(Σ−1 + X′X)(β − β),
where β = (Σ−1 + X′X)−1(Σ−1B + X′Xb) and s = 2b + σ2(n − k) +
(B− β)Σ−1B + (b− β)′X′Xb.
(LHS) = σ2(n− k) + 2b︸ ︷︷ ︸use definition of s
+(β − b)′X′X(β − b)
+ (β −B)′Σ−1(β −B)
= s− (B− β)′Σ−1B− (b− β)′X′Xb
+ (β − b)′X′X(β − b) + (β −B)′Σ−1(β −B)
= s−B′Σ−1B + β′Σ−1B− b′X′Xb + β
′X′Xb
+ β′X′Xβ − 2β′X′Xb + b′X′Xb
+ β′Σ−1β − 2β′Σ−1B + B′Σ−1B
= s+ (β′− 2β′)Σ−1B + (β
′− 2β′)X′Xb
+ β′X′Xβ + β′Σ−1β
= s+ (β′− 2β′) (Σ−1B + X′Xb)︸ ︷︷ ︸
use definition of β
+β′X′Xβ + β′Σ−1β
The Bayesian Linear Model 61
= s+ (β′− 2β′)(Σ−1 + X′X)β + β′X′Xβ + β′Σ−1β
= s+ β′(Σ−1 + X′X)β − 2β′(Σ−1 + X′X)β + β′(Σ−1 + X′X)β
= s+ (β − β)′(Σ−1 + X′X)(β − β)
= (RHS) (Q.E.D.)
11. For the Bayesian linear regression model, prove that the posterior that
results from conjugate priors is asymptotically equivalent to the poste-
rior that results from uninformative priors. See Table 4.3.
I Answer
In the asymptotic world, data (i.e. likelihood) dominates prior, which
produces:
β → (X′X)−1(X′Xb) = b
s → σ2(n− k) + (b− b)′X′Xb = σ2(n− k)
Therefore, when n → ∞, the posteriors from either conjugate prior or
uninformative prior becomes:
β|X,y ∼ N(b, σ2(X′X)−1
)σ2|X,y ∼ IG(∞,∞)
(Q.E.D.)
13. Develop a Bayesian linear model for the following data that describe the
average weekly household spending on tobacco and alcohol (in pounds)
for the eleven regions of the United Kingdom (Moore and McCabe 1989,
originally from Family Expenditure Survey. Department of Employment,
1981, British Official Statistics). Specify both an informed conjugate
and uninformed prior using the level for alcoholic beverage as the out-
come variable and the level for tobacco products as the explanatory
variable. Do you notice a substantial difference in the resulting priors.
Describe.
62Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Region Alcohol Tobacco
Northern Ireland 4.02 4.56
East Anglia 4.52 2.92
Southwest 4.79 2.71
East Midlands 4.89 3.34
Wales 5.27 3.53
West Midlands 5.63 3.47
Southeast 5.89 3.20
Scotland 6.08 4.51
Yorkshire 6.13 3.76
Northeast 6.19 3.77
North 6.47 4.03
I Answer
The two analyses produces Table 4.2. It turns out that the resulting
posteriors are pretty much different. Since we have only 11 observa-
tions the posteriors are dominated by the prior specification, not by the
likelihood (i.e. data).
From uninformed prior Mean SD 2.5% Quant Median 97.5% Quant
constant 4.351 1.822 0.230 4.351 8.472
tabacco 0.302 0.498 -0.824 0.302 1.427
σ2 0.963 0.281 0.589 0.906 1.681
From conjugate prior Mean SD 2.5% Quant Median 97.5% Quant
constant 1.671 1.505 -1.556 1.671 4.898
tabacco 1.022 0.419 0.122 1.022 1.921
σ2 0.595 0.082 0.460 0.586 0.777
Table 5.2: Posteriors of Alcohol vs. Tobacco model
< R.Code >
# data
x <- c(4.56, 2.92, 2.71, 3.34, 3.53, 3.47, 3.20, 4.51,
3.76, 3.77, 4.03)
X <- cbind(1, x)
y <- c(4.02, 4.52, 4.79, 4.89, 5.27, 5.63, 5.89, 6.08,
6.13, 6.19, 6.47)
The Bayesian Linear Model 63
# function to return a regression table with quantiles
t.ci.table <- function(coefs, cov.mat, level=0.95,
degrees=Inf, quantiles=c(0.025, 0.500, 0.975)) quantile.mat <- cbind(coefs, sqrt(diag(cov.mat)),
t(qt(quantiles, degrees) %o%
sqrt(diag(cov.mat))) +
matrix(rep(coefs, length(quantiles)),
ncol=length(quantiles)))
quantile.names <- c("Mean", "SD")
for (i in 1:length(quantiles)) quantile.names <- c(quantile.names,
paste(quantiles[i], "Quantile"))
colnames(quantile.mat) <- quantile.names
return(round(quantile.mat, 4))
# uninformed prior analysis
bhat <- solve(t(X) %*% X) %*% t(X) %*% y
s2 <- t(y - X %*% bhat) %*% (y - X %*% bhat) /
(nrow(X) - ncol(X))
R <- solve(t(X) %*% X) * ((nrow(X) - ncol(X)) *
s2 / (nrow(X) - ncol(X) - 2))[[1,1]]
uninformed.table <- t.ci.table(bhat, R,
degrees=nrow(X)-ncol(X))
alpha <- (nrow(X) - ncol(X) -1) / 2
beta <- s2 * (nrow(X) - ncol(X)) / 2
sort.inv.gamma.sample <- sort(1/rgamma(10000,
alpha, beta))
sqrt.sort.inv.gamma.sample <- sqrt(sort.inv.gamma.sample)
uninformed.table <- rbind(uninformed.table,
c(mean(sqrt.sort.inv.gamma.sample),
sqrt(var(sqrt.sort.inv.gamma.sample)),
sqrt.sort.inv.gamma.sample[250],
sqrt.sort.inv.gamma.sample[5000],
sqrt.sort.inv.gamma.sample[9750]))
# conjugate prior analysis
A <- 5; B <- 5
BBeta <- rep(0, 2); Sigma <- diag(c(2, 2))
64Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
tB <- solve(solve(Sigma) + t(X) %*% X) %*%
(solve(Sigma) %*% BBeta + t(X) %*% X %*% bhat)
ts <- 2*B + s2 * (nrow(X) - ncol(X)) +
(t(BBeta) - t(tB)) %*% solve(Sigma) %*% BBeta +
t(bhat - tB) %*% t(X) %*% X %*% bhat
R <- diag(ts / (nrow(X)+ A - ncol(X) - 3)) *
solve(solve(Sigma) + t(X) %*% X)
conjugate.table <- t.ci.table(tB, R,
degrees=nrow(X)+A-ncol(X))
a <- nrow(X) + A - ncol(X)
b <- s2 * (nrow(X) + A - ncol(X)) / 2
sort.iv.gamma.sample <- sort(1/rgamma(10000, a, b))
sqrt.sort.iv.gamma.sample <- sqrt(sort.iv.gamma.sample)
conjugate.table <- rbind(conjugate.table,
c(mean(sqrt.sort.inv.gamma.sample),
sqrt(var(sqrt.sort.inv.gamma.sample)),
sqrt.sort.inv.gamma.sample[250],
sqrt.sort.inv.gamma.sample[5000],
sqrt.sort.inv.gamma.sample[9750]))
# compare two analyses
uninformed.table
conjugate.table
15. The standard econometric approach to testing parameter restrictions in
the linear model is compare H0 :Rβ − q = 0 with H1 :Rβ − q 6= 0 (eg.
Greene [2003, 95]), where R is a matrix of linear restrictions and q is a
vector of values (typically zeros). Thus for example R =[
1 0 0 00 1 1 00 0 0 1
]and
q = [0, 1, 2]′ indicate β1 = 0, β2+β3 = 1, and β4 = 2. Derive expressions
for the posteriors π(β|X,y) and π(σ|X,y) using both conjugate and
uninformative priors.
I Answer
If we use conjugate priors,
β|σ2 ∼ N (B, σ2)
σ2 ∼ IG(a, b),
The Bayesian Linear Model 65
then we have posteriors,
β|X ∼ t(n+ a− k)
σ2|X ∼ IG(n+ a− k, 1
2σ2(n+ a− k)
).
And, if we use uniform priors (as an example of uninformative priors),
β ∝ c over [−∞,∞]
σ2 =1
σover [0,∞],
then we have posteriors,
β|X ∼ t(n− k)
σ2|X ∼ IG(
1
2(n− k − 1),
1
2σ2(n− k)
).
The next step is to define the hypotheses:
H0 : Rβ − q = 0
H1 : Rβ − q 6= 0,
where k is the number of parameters.
Consider the gth restriction from the gth row of R and the gth value of
q. Then, what we would be interested in are:
fg(β) =
k∑j=1
Rgjπ(βj |X)− qg,
which is a non-central mixture of t-distributions because R ∈ 0, 1.And, f(σ2) is unchanged by restrictions in test.
17. Using the Palm Beach County electoral data (page 109), calculate the
posterior predictive distribution of the data using a skeptical conjugate
prior (i.e. centered at zero for hypothesized effects and having large
variance).
I Answer
< R.Code >
66Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
# data and setup
require(arm); require(mvtnorm)
pbc.vote <- as.matrix(read.table("http://artsci.wustl.edu/
~stats/BMSBSA/Chapter06/pbc.dat"))
dimnames(pbc.vote)[[2]] <- c("badballots", "teach", "new",
"turnout", "rep", "whi")
x <- cbind(1, pbc.vote[,-1])
y <- pbc.vote[,1]
n <- nrow(x); k <- ncol(x)
xbar <- x
# posterior predictive distribution
bhat <- solve(t(x)%*%x) %*% t(x)%*%y
s2 <- t(y - x%*%bhat) %*% (y - x%*%bhat) / (n-k)
m <- t(x)%*%x + t(xbar)%*%x
h <- (diag(n) - xbar %*% solve(m) %*% t(xbar)) /
as.numeric(s2)
newb <- xbar %*% bhat
newR <- (n-k) * solve(h) / (n-k-2)
pred <- mvrnorm(10000, newb, Sigma=newR) /
sqrt(rchisq(10000, k-1))
p <- nrow(xbar)
pred.mean <- rep(0, p)
pred.sd <- rep(0, p)
for (i in 1:p) pred.mean[i] <- mean(pred[,i])
pred.sd[i] <- sd(pred[,1])
# actual data vs. predictive distribution
pred.table <- cbind(y, pred.mean, pred.sd)
pred.table
19. Rerun the Ancient China Conflict Model using the code in this chap-
ter’s Computational Addendum using informed prior specifications
of your choice (you should be willing to defend these decisions though).
Calculate the posterior mean for each marginal distribution of the pa-
rameters and create a predicted data values as customarily done in linear
models analysis. Graph the yi against yi and make a statement about
The Bayesian Linear Model 67
Mean SD 2.5% Quant Median 97.5% Quant
CONSTANT 15.54 2.32 10.927 15.54 20.16
EXTENT 5.39 1.12 3.168 5.39 7.61
DIVERSE 0.37 0.59 -0.802 0.37 1.55
ALLIANCE -1.22 0.76 -2.733 -1.22 0.28
DYADS -3.41 0.75 -4.900 -3.41 -1.91
TEMPOR 0.74 0.32 0.096 0.74 1.38
DURATION 0.15 0.46 -0.767 0.15 1.07
σ2 0.96 0.28 0.586 0.90 1.65
Table 5.3: Posteriors of Ancient China Conflict Model
the quality of fit for your model. Can you improve this fit by dramati-
cally changing the prior specification?
I Answer
The prior for β isN ([0, 0, 0, 0, 0, 0, 0]′, 2I) and the prior for σ2 is IG(2, 1).
Then, the linear model produces the posteriors as in Table 4.3. Figure
4.3 show that the model choice (i.e. linear model with conjugate priors)
does not perform very well. Furthermore, since the data dominates the
posteriors, the change in the prior specification improve the model fit.
< R.Code >
# data and setup
# data downloaded from http://dvn.iq.harvard.edu/dvn/dv/mra
require(foreign)
wars <- read.dta("data_16049.dta")
attach(wars)
X <- cbind(1, V3, V5, V6, V7, V12, V2-V1+1)
colnames(X) <- cbind("CONSTANT", "EXTENT", "DIVERSE",
"ALLIANCE", "DYADS", "TEMPOR",
"DURATION")
y <- 10*V8 + V9
missing <- c(4, 56, 78)
X <- X[-missing,]
y <- y[-missing]
n <- nrow(X); k <- ncol(X)
# function to return a regression table with quantiles
68Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
20 22 24 26 28 30
2224
2628
3032
34
observed y value
pred
icte
d y
valu
e
FIGURE 5.3: Model fit for Ancient China Conflict Model
t.ci.table <- function(coefs, cov.mat, level=0.95,
degrees=Inf, quantiles=c(0.025, 0.500, 0.975)) quantile.mat <- cbind(coefs, sqrt(diag(cov.mat)),
t(qt(quantiles, degrees) %o%
sqrt(diag(cov.mat))) +
matrix(rep(coefs, length(quantiles)),
ncol=length(quantiles)))
quantile.names <- c("Mean", "SD")
for (i in 1:length(quantiles)) quantile.names <- c(quantile.names,
paste(quantiles[i], "Quantile"))
colnames(quantile.mat) <- quantile.names
return(round(quantile.mat, 4))
# define priors
The Bayesian Linear Model 69
A <- 2; B <- 1
BBeta <- rep(0, 7)
Sigma <- diag(c(2,2,2,2,2,2,2))
# calculate posteriors
bhat <- solve(t(X)%*%X) %*% t(X)%*%y
s2 <- t(y - X%*%bhat) %*% (y - X%*%bhat) / (n-k)
tB <- solve(solve(Sigma) + t(X)%*%X) %*%
(solve(Sigma) %*% BBeta + t(X)%*%X%*%bhat)
ts <- 2*B + s2*(n-k) + (t(BBeta) - t(tB)) %*%
solve(Sigma) %*% BBeta + t(bhat - tB) %*%
t(X)%*%X%*%bhat
R <- (ts / (n+A-k))[1,1] *
solve(solve(Sigma) + t(X)%*%X)
conjugate.table <- t.ci.table(tB, R, degrees=n+A-k)
a <- n+A-k
b <- s2*(n+A-k)/2
sort.iv.gamma.sample <- sort(1/rgamma(10000,a,b))
sqrt.sort.iv.gamma.sample <- sqrt(sort.iv.gamma.sample)
conjugate.table <- rbind(conjugate.table,
c(mean(sqrt.sort.inv.gamma.sample),
sqrt(var(sqrt.sort.inv.gamma.sample)),
sqrt.sort.inv.gamma.sample[250],
sqrt.sort.inv.gamma.sample[5000],
sqrt.sort.inv.gamma.sample[9750]))
conjugate.table
# create predicted data vlues and make a graph
beta <- conjugate.table[,1]
beta <- beta[-8]
yhat <- X %*% beta
plot(yhat ~ y, xlab="observed y value",
ylab="predicted y value")
abline(0,1)
6
Assessing Model Quality
Exercises
1. Derive the marginal posterior for β in
π(β|X,y) ∝[s+ (β − β)′(Σ−1 + X′X)(β − β)
]−(n+a2 )+1
,
where:
β = (Σ−1 + X′X)−1(Σ−1B + X′Xb)
s = 2b+ σ2(n− k) + (B− β)′Σ−1B + (b− β)′X′Xb.
from the joint distribution given by
π(β, σ2|X,y) ∝ σ−n−a exp
[−
1
2σ2
(s+ (β − β)′(Σ−1 + X′X)(β − β)
)].
See Chapter 4 (page 112) for terminology details.
I Answer
Let’s integrate the joint distribution with respect to σ2 to get the marginal
for β:
π(β|X,y) ∝∫ ∞
0
σ−n−A exp
[−
1
2σ2
(s+ (β + β)′(Σ−1 + X′X)(β + β)
)]︸ ︷︷ ︸
inverse gamma PDF kernel
dσ2
We know that, ∫ ∞0
qp
Γ(p)x−(p+1) exp
[−qx
]︸ ︷︷ ︸
IG(x|p,q)
dx = 1
⇒∫ ∞
0
x−(p+1) exp
[−qx
]dx =
Γ(p)
qp
71
72Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
By setting p = n+a2 −1, q = 1
2
(s+ (β + β)′(Σ−1 + X′X)(β + β)
), and
x = σ2, we have:
π(β|X,y) ∝ q−p ∝[s+ (β + β)′(Σ−1 + X′X)(β + β)
]−(n+a2 )+1
,
which is the marginal posterior for β above. (Q.E.D.)
3. Derive the form of the posterior predictive distribution for the beta-
binomial model. See Section 6.4.
I Answer
From Equation (2.15) we know that,
p(θ|y) =Γ(n+A+B)
Γ(y +A)Γ(n− y +B)θ(y+A)−1(1− θ)(n−y+B)−1
Thus the posterior predictive distribution for new data y is:
p(y|y) =
∫ 1
0
p(y|θ)p(θ|y)dθ
=
∫ 1
0
[(n
y
)θy(1− θ)n−y
× Γ(n+A+B)
Γ(y +A)Γ(n− y +B)θ(y+A)−1(1− θ)(n−y+B)−1
]dθ
=Γ(n+A+B)
Γ(y +A)Γ(n− y +B)
Γ(n+ 1)
Γ(n− y + 1)Γ(y + 1)
×∫ 1
0
θ(y+y+A)−1(1− θ)(n−y+n−y+B)−1dθ
=Γ(n+A+B)
Γ(y +A)Γ(n− y +B)
Γ(n+ 1)
Γ(n− y + 1)Γ(y + 1)
× Γ(y + y +A)Γ(n− y +B + n− y)
Γ(A+B + n+ n)
To calculate mean and variance of y, I denote A′ = y+A, B′ = n−y+B.
Therefore, we have:
Assessing Model Quality 73
E(y|y) = E[E(y|θ)|y] = E(nθ|y)
= nA′
A′ +B′
and
V ar(y|y) = E[V ar(y|θ)|y] + V ar[E(y|θ)|y]
= E[[nθ(1− θ)]|y] + V ar(nθ|y)
= nE(θ|y)− nE(θ2|y) + V ar(nθ|y)
= nE(θ|y)− n[V ar(θ|y) + (E(θ|y))2] + n2V ar(θ|y)
= n
(A′
A′ +B′
)− n
(A′
A′ +B′
)2
+ n(n− 1)A′B′
(A′ +B′)2(A′ +B′ + 1)
=nA′B′(A′ +B′ + n)
(A′ +B′)2(A′ +B′ + 1)
5. Calculate the Kullback-Leibler distance between two gamma distribu-
tions: f(X|α, β), g(X|γ, δ).
I Answer
I(f, g) =
∫ ∞0
log
[f(x)
g(x)
]· f(x)dx
=
∫ ∞0
[log
(βα
δγΓ(γ)
Γ(α)xα−γ exp [−xβ + xδ]
)× βα
Γ(α)xα−1 exp[−xβ]
]dx
=
∫ ∞0
[(log
βαΓ(γ)
δγΓ(α)+ (α− γ) log x− (β − δ)x
)× βα
Γ(α)xα−1 exp[−xβ]
]dx
74Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
= logβαΓ(γ)
δγΓ(α)
∫ ∞0
βα
Γ(α)xα−1 exp[−xβ]dx︸ ︷︷ ︸
=1
+ (α− γ)
∫ ∞0
(log x) · βα
Γ(α)xα−1 exp[−xβ]dx︸ ︷︷ ︸
=ψ(α)
− (β − γ)
∫ ∞0
x · βα
Γ(α)xα−1 exp[−xβ]dx︸ ︷︷ ︸=αβ
= logβαΓ(γ)
δγΓ(α)+ (α− γ)ψ(α)− (β − γ)
α
β,
where ψ(α) is the digamma function.
7. Calculate the Polasek (6.15) linear model diagnostic for each of the
data points in your model from Exercise 4.4. Graphically summarize
identifying points of concern.
I Answer
From Section 6.3.2.1, the Polasek linear model diagnostic can be calcu-
lated by:∂b(wi)
∂wi= (X′X)−1Xi
[yi −Xiβ
],
where β is the posterior mean for β.
< R.Code >
# data and setup
mk <- as.matrix(read.table("http://artsci.wustl.edu/~stats/
BMSBSA/Chapter06/meier.keiser.dat", header=T))
y <- mk[,1]; x <- mk[,-1]
# priors
BBeta <- rep(0, 8); Sigma <- diag(c(5,5,5,5,5,5,5,5))
A <- 1; B <- 1
# posteriors
x <- cbind(1, x); n <- nrow(x); k <- ncol(x)
Assessing Model Quality 75
H.inv <- solve(t(x) %*% x)
bhat <- H.inv %*% t(x) %*% y
s2 <- t(y - x %*%bhat) %*% (y - x %*% bhat) / (n-k)
tB <- solve(solve(Sigma) + t(x) %*% x) %*%
(solve(Sigma) %*% BBeta + t(x) %*% x %*% bhat)
ts <- 2*B + s2*(n-k) + (t(BBeta) - t(tB)) %*%
solve(Sigma) %*% BBeta + t(bhat - tB) %*%
t(x) %*% x %*% bhat
R <- diag(ts / (n+A-k-3)) *
solve(solve(Sigma) + t(x) %*% x)
0 20 40
−30
0−
200
−10
00
100
intercept
0 20 40
−20
020
4060
80
ACES
0 20 40
−0.
0010
0.00
000.
0010
INSTABIL
0 20 40
−2
−1
01
23
AAMBIG
0 20 40
−1.
0−
0.5
0.0
0.5
1.0
CSTAFF
0 20 40
−3
−2
−1
01
2
ARD
0 20 40
−20
−10
010
2030
40
ASLACK
0 20 40
−0.
02−
0.01
0.00
0.01
0.02
0.03
AEXPEND
FIGURE 6.1: Polasek Diagnostic for Meier and Keiser (1996) model
76Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
# Polasek diagnostic
e <- as.vector(y - x %*% tB)
polasek <- NULL
for (i in 1:n) pol <- solve(t(x)%*%x) %*% x[i,] %*% e[i]
polasek <- cbind(polasek, pol)
colnames(polasek) <- obs <- c(1:50)
rownames(polasek) <- indv <- c("intercept", "ACES",
"INSTABIL", "AAMBIG", "CSTAFF",
"ARD", "ASLACK", "AEXPEND")
# graphs
par(mfrow=c(2,4))
for (i in 1:k) plot(polasek[i,] ~ obs, main=indv[i], xlab="", ylab="")
9. In Section 6.2.2, it was shown how to vary the prior variance 50% by ma-
nipulating the beta parameters individually. Perform this same analysis
with the two parameters in the Poisson-gamma conjugate specification.
Produce a graph like Figure 6.4.
I Answer
We know that the prior λ ∼ G(α, β) and the likelihood p(y|λ) ∼ P(λ)
produces the posterior λ|y ∼ G(α+ ny, β + n). By using the data from
Exercise 2.3, we can make Figure 6.2 below.
< R.Code >
A <- c(10, 5, 15)
B <- c(2, 2.828, 1.633)
nx <- 2701; n <- 25
d1 <- A + nx; d2 <- B + n
ruler <- seq(90, 110, length=100)
par(mfrow=c(1,2))
plot(c(90, 110), c(0, 0.25), type="n",
xlab="", ylab="", main="Varying alpha: [5, 10, 15]")
Assessing Model Quality 77
90 95 100 105 110
0.00
0.05
0.10
0.15
0.20
0.25
Varying alpha: [5, 10, 15]
90 95 100 105 110
0.00
0.05
0.10
0.15
0.20
0.25
Varying beta: [2.828, 2, 1.633]
FIGURE 6.2: Exponential Model Sensitivity
for (i in 1:3) lines(dgamma(ruler, d1[i], d2[1]) ~ ruler, col=i)
plot(c(90, 110), c(0, 0.25), type="n",
xlab="", ylab="",
main="Varying beta: [2.828, 2, 1.633]")
for (i in 1:3) lines(dgamma(ruler, d1[1], d2[i]) ~ ruler, col=i)
11. The following data are annual firearm-related deaths in the United
States per 100,000 from 1980 to 1997, by row (source: National Center
for Health Statistics, Health and Aging Chartbook, August 24, 2001).
78Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
14.9 14.8 14.3 13.3 13.3 13.3 13.9 13.6 13.9
14.1 14.9 15.2 14.8 15.4 14.8 13.7 12.8 12.1
Specify a normal model for these data with a diffuse normal prior for
µ and a diffuse inverse gamma prior for σ2 (see Section 3.5 for the
more complex multivariate specification that can be simplified). In order
to obtain local robustness of this specification, calculate ||D(q)|| (see
page 213) for the alternative priors: q(µ) = N (14, 1), q(µ) = N (12, 3),
q(σ−2) = G(1, 1), q(σ−2) = G(2, 4).
I Answer
I specify the diffuse priors as µ ∼ N (14, 100) and σ−2 ∼ G(1, 0.1).
The resulting posterior distributions are shown in Figure 6.3. And the
robustness measure ||D(q)|| is:
Alternative Prior for µ ||D(q)||N (14, 1) 0.0667
N (12, 3) 0.1030
Alternative Prior for σ−2 ||D(q)||G(1, 1) 0.2797
G(2, 4) 0.8352
Table 6.1: Robustness measure ||D(q)||
Observe that the posteriors for µ look very similar in Figure 6.3, reflect-
ing the low level of robustness measure ||D(q)|| in Table 6.1. On the
other hand, the posteriors for σ−2 do not look very similar in Figure 3,
which is consistent with the high level of robustness measure ||D(q)|| inTable 6.1.
< R.Code >
# setup
require(runjags)
y <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3, 13.9, 13.6,
Assessing Model Quality 79
Posteriors for mu w/ Different Priors
12 13 14 15 16
0.0
0.5
1.0
1.5
2.0
−−− N(14, 100)−−− N(14, 1)−−− N(12, 3)
Posteriors for sigma^(−2) w/ Different Priors
0 1 2 3 4
0.0
0.5
1.0
1.5
−−− G(1, 0.1)
−−− G(1, 1)
−−− G(2, 4)
FIGURE 6.3: Posteriors for µ and σ−2 w/ different priors
13.9, 14.1, 14.9, 15.2, 14.8, 15.4, 14.8, 13.7,
12.8, 12.1)
n <- length(y)
data <- dump.format(list(y=y, n=n))
inits <- dump.format(list(mu=1, sigma=1))
# diffuse priors
diffuse.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)
mu ~ dnorm(14, 0.01)
sigma ~ dgamma(1, 0.1)
"diffuse <- run.jags(model=diffuse.model, data=data,
monitor=c("mu", "sigma"),
inits=inits, n.chains=1, burnin=1000,
sample=5000, plots=TRUE)
mu.diffuse <- as.vector(diffuse$mcmc[,1][[1]])
sigma.diffuse <- as.vector(diffuse$mcmc[,2][[1]])
# alternative priors
normal1.model <- "model for (i in 1:n)
80Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
y[i] ~ dnorm(mu, sigma)
mu ~ dnorm(14, 1)
sigma ~ dgamma(1, 0.1)
"normal2.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)
mu ~ dnorm(12, 3)
sigma ~ dgamma(1, 0.1)
"gamma1.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)
mu ~ dnorm(14, 0.01)
sigma ~ dgamma(1, 1)
"gamma2.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, sigma)
mu ~ dnorm(14, 0.01)
sigma ~ dgamma(2, 4)
"normal1 <- run.jags(model=normal1.model, data=data,
monitor=c("mu", "sigma"),
inits=inits, n.chains=1, burnin=1000,
sample=5000, plots=TRUE)
normal2 <- run.jags(model=normal2.model, data=data,
monitor=c("mu", "sigma"),
inits=inits, n.chains=1, burnin=1000,
sample=5000, plots=TRUE)
gamma1 <- run.jags(model=gamma1.model, data=data,
monitor=c("mu", "sigma"),
inits=inits, n.chains=1, burnin=1000,
sample=5000, plots=TRUE)
gamma2 <- run.jags(model=gamma2.model, data=data,
monitor=c("mu", "sigma"),
Assessing Model Quality 81
inits=inits, n.chains=1, burnin=1000,
sample=5000, plots=TRUE)
mu.normal1 <- as.vector(normal1$mcmc[,1][[1]])
mu.normal2 <- as.vector(normal2$mcmc[,1][[1]])
sigma.gamma1 <- as.vector(gamma1$mcmc[,2][[1]])
sigma.gamma2 <- as.vector(gamma2$mcmc[,2][[1]])
# graph
p.mu.diffuse <- hist(mu.diffuse, freq=F,
breaks=seq(12, 16, length=30),
main="Posteriors for mu w/ Different Priors",
xlab="", ylab="", ylim=c(0, 2))
lines(density(mu.diffuse))
lines(density(mu.normal1), col=2)
lines(density(mu.normal2), col=3)
text(15.26, 1.5, "--- N(14, 100)")
text(15.20, 1.4, "--- N(14, 1)", col=2)
text(15.20, 1.3, "--- N(12, 3)", col=3)
p.sigma.diffuse <- hist(sigma.diffuse, freq=F,
breaks=seq(0, 4.5, length=30),
main="Posteriors for sigma^(-2) w/ Different Priors",
xlab="", ylab="", ylim=c(0, 1.5))
lines(density(sigma.diffuse))
lines(density(sigma.gamma1), col=2)
lines(density(sigma.gamma2), col=3)
text(3.05, 1, "--- G(1, 0.1)")
text(3, 0.9, "--- G(1, 1)", col=2)
text(3, 0.8, "--- G(2, 4)", col=3)
# calculate local robustness
p.mu.normal1 <- hist(mu.normal1, freq=F,
breaks=seq(12, 16, length=30))
p.mu.normal2 <- hist(mu.normal2, freq=F,
breaks=seq(12, 16, length=30))
robust.mu1 <- max(p.mu.normal1$density -
p.mu.diffuse$density)
robust.mu2 <- max(p.mu.normal2$density -
p.mu.diffuse$density)
p.sigma.gamma1 <- hist(sigma.gamma1, freq=F,
breaks=seq(0, 4.5, length=30))
82Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
p.sigma.gamma2 <- hist(sigma.gamma2, freq=F,
breaks=seq(0, 4.5, length=30))
robust.sigma1 <- max(p.sigma.gamma1$density -
p.sigma.diffuse$density)
robust.sigma2 <- max(p.sigma.gamma2$density -
p.sigma.diffuse$density)
robust <- rbind(robust.mu1, robust.mu2,
robust.sigma1, robust.sigma2)
robust
13. Calculate the posterior predictive distribution of X in Exercise 6 as-
suming that σ20 = 0.75 and using the prior: p(µ) = N (15, 1). Perform
the same analysis using the prior: C(15, 0.675). Describe the difference
graphically and with quantiles.
I Answer
From (6.22), we know that the posterior predictive distribution of X
has “posterior mean of µ” for its mean, and “σ20 + posterior variance of
µ” for its variance. So, it suffices to find the posterior distributions for
µ in order to describe the posterior predictive distribution of X.
mean variance
from Normal prior 14.0965 0.7522
from Cauchy prior 14.1193 0.7523
Table 6.2: Posterior predictive distributions from two priors
By using two different prior specifications, we have two sets of infor-
mation on the posterior predictive distribution, which is shown in Table
6.2. Figure 6.4. is demonstrating two posterior distributions for µ (NOT
posterior predictive distributions of X), which eventually contribute to
the shape of the posterior predictive distributions of X.
As is obvious from both Table 6.2 and Figure 6.4, the two posterior
predictive distributions are very simular.
Assessing Model Quality 83
Den
sity
13.0 13.5 14.0 14.5 15.0
0.0
0.5
1.0
1.5
2.0
−− N(15, 1)
−− Cauchy(15, 0.675)
FIGURE 6.4: Posterior distributions of mu from two priors
< R.Code >
# setup
require(runjags)
y <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3, 13.9,
13.6, 13.9, 14.1, 14.9, 15.2, 14.8, 15.4,
14.8, 13.7, 12.8, 12.1)
n <- length(y)
data <- dump.format(list(y=y, n=n))
inits.normal <- dump.format(list(mu=1))
inits.cauchy <- dump.format(list(top=1, bottom=1))
# normal prior
normal.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, pow(0.75, -1))
84Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
mu ~ dnorm(15, 1)
"normal <- run.jags(model=normal.model, data=data,
monitor=c("mu"), inits=inits.normal,
n.chains=1, burnin=1000,
sample=5000, plots=TRUE)
mu.normal <- as.vector(normal$mcmc[[1]])
# cauchy prior
cauchy.model <- "model for (i in 1:n) y[i] ~ dnorm(mu, pow(0.75, -1))
mu <- 15 + 0.675 * cauchy # cauchy(a,b) = a+b*cauchy(0,1)
cauchy <- top / bottom # N(0,1) / N(0,1) ~ cauchy(0,1)
top ~ dnorm(0,1)
bottom ~ dnorm(0,1)
"cauchy <- run.jags(model=cauchy.model, data=data,
monitor=c("mu"), inits=inits.cauchy,
n.chains=1, burnin=1000,
sample=5000, plots=TRUE)
mu.cauchy <- as.vector(cauchy$mcmc[[1]])
# compare
compare <- rbind(cbind(mean(mu.normal), var(mu.normal)),
cbind(mean(mu.cauchy), var(mu.cauchy)))
compare <- cbind(compare, 0.75 + compare[,2]/18)
colnames(compare) <- c("mean", "var of mu", "var")
rownames(compare) <- c("normal", "cauchy")
compare
p.mu.normal <- hist(mu.normal, freq=F,
breaks=seq(13, 15, length=20),
ylim=c(0, 2.3), xlab="", main="")
lines(density(mu.normal))
lines(density(mu.cauchy), col=2)
text(13.25, 1.5, "-- N(15, 1)")
text(13.395, 1.38, "-- Cauchy(15, 0.675)", col=2)
Assessing Model Quality 85
15. Calculate the posterior predictive distribution for the Palm Beach County
model (Section 4.1.1) using both the conjugate prior and the uninfor-
mative prior.
I Answer
E[ynew|y,X] = E [E[ynew|β, y,X]|y,X]
= E [Xβ|y,X] = X× (posterior mean of β)
Var[ynew|y,X] = E [Var[ynew|β, y,X]|y,X]
+ Var [E[ynew|β, y,X]|y,X]
= E[σ2|y,X
]+ Var [Xβ|y,X]
= E[σ2|y,X
]+ 0 = (posterior mean of σ2)
(a) Conjugate prior: β ∼ N (0,Σ = I), σ2 ∼ IG(2, 1)
E[ynew|y,X] = Xβ
Var[ynew|y,X] =1
2σ2(n+ 2− k)/(n+ 2− k − 1),
where β = (Σ−1 + X′X)−1(0 + X′Xb), σ2 = (y−Xb)′(y−Xb)n−k , b =
(X′X)−1X′y, n = 516, and k = 6.
(b) Uninformed prior: p(β) ∝ c, p(σ2) ∝ 1σ
E[ynew|y,X] = Xb
Var[ynew|y,X] = σ2(n− k)/(n− k − 3),
where b = (X′X)−1X′y, σ2 = (y−Xb)′(y−Xb)n−k , n = 516, and k = 6.
< R.Code >
# data and setup
pbc.vote <- as.matrix(read.table("http://artsci.wustl.edu/
~stats/BMSBSA/Chapter06/pbc.dat"))
dimnames(pbc.vote)[[2]] <- c("badballots", "teach", "new",
86Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
"turnout", "rep", "whi")
x <- cbind(1, pbc.vote[,-1])
y <- pbc.vote[,1]
n <- nrow(x); k <- ncol(x)
xbar <- x
# posterior predictive distribution
bh <- solve(t(xbar) %*% xbar) %*% t(xbar) %*% y
bt <- solve(diag(1, k) + t(xbar) %*% xbar) %*%
(t(xbar) %*% xbar %*% bh)
s2 <- (t(y - xbar %*% bh) %*% (y - xbar %*% bh)) / (n - k)
st <- 2 + s2 * (n-k) +
t(bh - bt) %*% t(xbar) %*% xbar %*% bh
con.y.E <- xbar %*% bt
con.y.var <- (s2 * (n + 2 - k)) / (2 * (n + 2 - k - 1))
17. Using the model specification in (6.1), produce a posterior distribution
for appropriate survey data of your choosing. Manipulate the forms
of the priors and indicate the robustness of this specification for your
setting. Summarize the posterior results with tabulated quantiles.
I Answer
Try to utilize any types of survey data.
19. Leamer (1983) proposes reporting extreme bounds for coefficient values
during the specification search as a way to demonstrate the reliability of
effects across model-space. So the researcher would record the maximum
and minimum value that every coefficient took-on as different model
choices were attempted. Explain how this relates to Bayesian model
averaging.
I Answer
What Leamer’s proposal is doing is to record the maximum and the
minimum values that every coefficient takes as different model specifica-
tions are tried. And, the Bayesian model averaging is to 1) try multiple
parametric forms for the model specifications; 2) to weight each of them;
Assessing Model Quality 87
and 3) average them with their weights. Therefore, not just by recording
the extreme values, but by averaging all values (with proper weights),
Bayesian model averaging generalize the idea of Leamer’s proposal.
7
Bayesian Hypothesis Testing and the Bayes
Factor
Exercises
1. Perform a frequentist hypothesis test of H0 :θ = 0 versus H1 :θ = 500,
where X ∼ N (θ, 1) (one-tail at the α = 0.01 level). A single datapoint
is observed, x = 3. What decision do you make? Why is this not
a defensible procedure here? Does a Bayesian alternative make more
sense?
I Answer
The frequentist hypothesis test is:
(a) Under H0, t = 3−01 = 3 > 2.325. ∴ Reject H0.
(b) Under H1, t = 3−5001 = −497 < −2.325. ∴ Reject H1.
Since H0 and H1 are NOT mutually exclusive and exhaustive, the fre-
quentist procedure is a little problematic. However, if we use a Bayesian
alternatives, we can have the exact probability that a certain hypothesis
is correct. Then, we can discuss about how precise the hypotheses are.
3. Akaike (1973) states that models with negative AIC are better than the
saturated model and by extension the model with the largest negative
value is the best choice. Show that if this is true, then the BIC is a
better asymptotic choice for comparison with the saturated model.
I Answer
89
90Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Note that:
BIC = −2l(θ|x) + p log n
AIC = −2l(θ|x) + 2p
Then, BIC has always a smaller negative value then AIC once we have
p log n > 2p (i.e. n > e2). Therefore, BIC would be much more “con-
servative” criterion when comparing with the saturated model. In other
words, BIC is a “better” asymptotic choice in the sense that it is a more
“conservative” choice when n becomes larger.
5. Derive the last line of (7.12) from: p(H0|data) = 1 − p(H1|data) =
1− 1B(x)
[p(data)p(H0) p(H0|data)
]p(H1)p(data) .
I Answer
Let p(H0|data) ≡ A. Then, we have:
A = 1− 1
B(x)·[p(data)
p(H0)·A]· p(H1)
p(data)
= 1− 1
B(x)· p(data)
p(H0)
p(H1)
p(data)·A
= 1− 1
B(x)· p(H1)
p(H0)·A
⇒ A+
[1
B(x)· p(H1)
p(H0)
]A = 1
⇒[1 +
1
B(x)· p(H1)
p(H0)
]A = 1
∴ A =
[1 +
1
B(x)· p(H1)
p(H0)
]−1
= p(H0|data) (Q.E.D.)
7. (Berger 1985). A child takes an IQ test with the result that a score over
100 will be designated as above average, and a score of under 100 will be
designated as below average. The population distribution is distributed
N (100, 225) and the child’s posterior distribution is N (110.39, 69.23).
Test competing designations on a single test, p(θ ≤ 100|x) vs. p(θ >
Bayesian Hypothesis Testing and the Bayes Factor 91
100|x), with a Bayes Factor using equally plausible prior notions (the
population distribution), and normal assumptions about the posterior.
I Answer
Let M1 : θ ≤ 100 and M2 : θ > 100. Then, we have
B(x) =p(M1|x)/p(M1)
p(M2|x)/p(M2)
=p(M1|x)
p(M2|x)(∵ p(M1) = p(M2) =
1
2)
p(M1|x) = 0.1059 can be calculated by pnorm(100, mean=110.39,
sd=sqrt(69.23)) in R . And, p(M2|x) = 1− p(M1|x) = 0.8941. There-
fore, B(x) = 0.10590.8941 = 0.1184, which is substantial evidence against M1.
9. Returning to the beta-binomial model from Chapter 2, set up a Bayes
Factor for: p < 0.5 versus p ≥ 0.5, using a uniform prior, and the data:
[0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1]. Perform the integration step using
rejection sampling (Chapter 8).
I Answer
The resulting Bayes Factor is 3.3419, which is a strong evidence for
p < 0.5.
< R.Code >
# beta function
sdbeta <- function(x, alpha, beta) a <- factorial(alpha+beta-1)
b <- factorial(alpha-1) * factorial(beta-1)
c <- x^(alpha-1) * (1-x)^(beta-1)
return(a/b*c)
betamode <- function(alpha, beta)
(alpha - 1) / (alpha + beta - 2)
sdbetamode <- function(alpha, beta) sdbeta(betamode(alpha, beta), alpha, beta)
92Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
# rejection sampling for integration
int.beta <- function(n, alpha, beta) xvec <- runif(n, 0, 0.5)
yvec <- runif(n, 0, sdbetamode(alpha, beta))
fvec <- sdbeta(xvec, alpha, beta)
indicator <- ifelse(yvec < fvec, 1, 0)
return(mean(indicator)*sdbetamode(alpha, beta)/2)
# calculate Beta(7,10) and Bayes Facor
z <- int.beta(100000, 7, 10)
z/(1-z)
11. Demonstrate using rejection sampling the following equality from Bayes’
original (1763) paper, using different values of n and p:∫ 1
0
(n
p
)xp(1− x)n−pdx =
1
n.
I Answer
We have: ∫ 1
0
(n
p
)xp(1− x)n−pdx =
(n
p
)∫ 1
0
xp(1− x)n−pdx︸ ︷︷ ︸integral part
Since the integral part is bounded, the first thing is to find the maximum
value of xp(1− x)n−p under the bounded region of x (F.O.C.):
∂xp(1− x)n−p
∂x= pxp−1(1− x)n−p − x(n− p)(1− x)n−p−1 ≡ 0
∴ x =p
n
Therefore, the rejection method follows:
(a) Draw a from U [0, 1] and b from U [0, pn ].
(b) If ap(1− a)n−p > b then accept it.
Bayesian Hypothesis Testing and the Bayes Factor 93
(c) Repeat (a) and (b) many times.
(d) Calculate pn ×
#accepts#trials
From R , we can recognize that, even though we change n and p, the
values are always very similar to 1n .
< R.Code >
trial <- 10000
value <- function(n, p) max <- (p/n)^p * (1-p/n)^(n-p)
accept <- 0
for (i in 1:trial) a <- runif(1, 0, 1)
b <- runif(1, 0, max)
f <- a^p * (1-a)^(n-p)
if (f>b) accept <- accept + 1
else accept <- accept
int <- max*(accept/trial)*factorial(n) /
(factorial(n-p)*factorial(p))
return(int)
try1 <- value(10, 3); try2 <- value(10, 5)
try3 <- value(10, 8); try4 <- value(20, 3)
try5 <- value(20, 5); try6 <- value(20, 8)
ans <- rbind(c(1/10, try1, try2, try3),
c(1/20, try4, try5, try6))
colnames(ans) <- c("true", "1st try",
"2nd try", "3rd try")
> ans
true 1st try 2nd try 3rd try
[1,] 0.10 0.09104169 0.09026719 0.09132174
[2,] 0.05 0.04766731 0.04805365 0.04823303
13. Calculate the posterior distribution for β in a logit regression model
r(X′b) = p(y = 1|X′b) = 1/[1 + exp(−X′b)]
94Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
with a BE(A,B) prior. Perform a formal test of a BE(4, 4) prior versus
a BE(4, 1) prior. Calculate the Kullback-Leibler distance between the
two resulting posterior distributions.
I Answer
Let’s define f(b) as:
f(b) ≡n∏i=1
[(1
1 + exp(−X′b)
)yi ( exp(−X′b)
1 + exp(−X′b)
)1−yi]
The, the posterior distribution is:
π(b|X, y) =BE(b|A,B)f(b)∫ 1
0BE(b|A,B)f(b)db
The Bayes Factor for BE(4, 4) versus BE(4, 1) is:
BF =
∫ 1
0BE(b|4, 4)f(b)db∫ 1
0BE(b|4, 1)f(b)db
= 14×∫ 1
0b3(1− b)3f(b)db∫ 1
0b3f(b)db
And, the Kullback-Leibler distance is:
I (BE(4, 4),BE(4, 1)) =
∫ 1
0
log
[BE(b|4, 4)
BE(b|4, 1)· 1
BF
]BE(b|4, 4)f(b)db
= 70× log14
BF×∫ 1
0
b3(1− b)3f(b)db
15. (Aitkin 1991). Model 1 specifies y ∼ N (µ1, σ2) with µ1 specified and
σ2 known, Model 2 specifies y ∼ N (µ2, σ2) with µ2 unknown and the
same σ2. Assign the improper uniform prior to µ2: p(µ2) = C/2 over
the support [−C :C], where C is large enough value to make this a
reasonably uninformative prior. For a predefined standard significance
test level critical value z, the Bayes Factor for Model 1 versus Model 2
is given by:
B =2Cn1/2φ(z)
σ[Φ(n1/2 y+C
σ
)− Φ
(n1/2 y−C
σ
)] .
Bayesian Hypothesis Testing and the Bayes Factor 95
Show that B can be made as large as one likes as C → ∞ or n → ∞for any fixed value of z. This is an example of Lindley’s paradox for a
point null hypothesis test (Section 7.3.3.).
I Answer
First, as C →∞,
B =1
σ
∞[Φ(∞)− Φ(−∞)]
=1
σ
∞1− 0
=∞, for any fixed σ, n, z
Second, as n→∞,
B =1
σ
∞[Φ(∞)− Φ(∞)]
=1
σ
∞1− 1
=∞, for any fixed σ,C, z (Q.E.D.)
17. Given a Poisson likelihood function, instead of specifying the conjugate
gamma distribution, stipulate p(µ) = 1/µ2. Derive an expression for
the posterior distribution of µ by first finding the value of µ, which
maximizes the log density of the posterior, and then expanding a Taylor
series around this point (i.e., Laplace approximation). Compare the
resulting distribution with the conjugate result.
I Answer
From π(λ) = 1λ2 and Laplace approximation, we have the posterior
distribution as:
π(λ|x) = π(λo|x) +1
2(λ− λo)2π′′(λo|x),
where λo maximizes log π(λ|x).
In order to find λo, take F.O.C.:
∂ log π(λ|x)
∂λ=∂ log
[1λ2 ·
∏ni=1
λxie−λ
xi!
]∂λ
≡ 0
96Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
⇒∂ log
(λ∑xi−2 exp[−nλ]
)∂λ
= 0
⇒ ∂ [(∑xi − 2) log λ− nλ]
∂λ=
∑xi − 2
λ− n = 0
∴ λo =
∑xi − 2
n
Then, the posterior becomes:
π(λ|x) = λΣxi−2o exp[−nλo]
+1
2(λ− λo)2 · (Σxi − 2)(Σxi − 3)λΣxi−4 exp[−nλo]n2
=
(λ2o +
n2
2(λ− λo)2(Σxi − 2)(Σxi − 3)
)λΣxi−4o exp[−nλo]
=
[(Σxi − 2
n
)2
+n2
2
(λ− Σxi − 2
n
)2
(Σxi − 2)(Σxi − 3)
]
×(
Σxi − 2
n
)Σxi−4
exp[−(Σxi − 2)]
∝ n2
2(Σxi − 2)(Σxi − 3)
[λ− Σxi − 2
n
]2
Letting k1 ≡ Σxi−2n and k2 ≡ n2(Σxi−2)(Σxi−3) produces the posterior
distribution as:
π(λ|x) ∝ k2
2[λ− k1]
2
= log
(exp
[1
2/k2(λ− k1)2
])= logN
(k1,
1
k2
)Note that this is a little different from the posterior distribution obtained
from the conjugate prior G(α, β):
π(λ|x) = G(Σxi + α, β + n)
However, if we compare the means and variances from the two posterior
specifications, we can recognize that sufficient number of data would
make the two posterior converge.
Bayesian Hypothesis Testing and the Bayes Factor 97
19. Using the Palm Beach County voting model in Section 4.1.1. with unin-
formative priors, compare the full model with the nested model, leaving
out the technology variable using a local Bayes Factor. Experiment
with different training samples and compare the subsequent results. Do
you find that the Bayes Factor is sensitive to your selection of training
sample?
I Answer
The below R code utilizes the model specification that was discussed in
Section 4.1.1. Running this code multiple times would inform us how
sensitive the Bayes Factor is to the selection of training sample. For
the training sample below (keep) is designed to be a random sample
from the entire data. It turns out to be not sensitive. Additionally, it
is consistently in favor of keeping the technology variable in the model
(BF > 5× 1025).
< R.Code >
library(BaM); library(mvtnorm)
data(pbc.vote)
norm.like <- function(B,B.HAT,SIGMA.SQ,X)
SIGMA.SQ^(-nrow(X)/2) * exp( -0.5/SIGMA.SQ *
(-diff(dim(X)) * SIGMA.SQ + t(B-B.HAT) %*% t(X) %*%
X %*% (B-B.HAT)) )
# CREATE THE TWO DATASETS, STANDARDIZE FOR STABILITY
# X1 for bigger model, X2 for smaller model
# a for regular data, b for training data
attach(pbc.vote)
X1a <- cbind(new, size, Republican, white, technology)
X1b <- cbind(new, size, Republican, white )
Y1 <- badballots
detach(pbc.vote)
keep <- sample(1:nrow(X1a),200)
X2a <- X1a[keep,]; X2b <- X1b[keep,]; Y2 <- Y1[keep]
X1a <- apply(X1a,2,scale); X2a <- apply(X2a,2,scale)
X1b <- apply(X1b,2,scale); X2b <- apply(X2b,2,scale)
X1a <- as.matrix(cbind(rep(1,nrow(X1a)), X1a))
98Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
X1b <- as.matrix(cbind(rep(1,nrow(X1b)), X1b))
X2a <- as.matrix(cbind(rep(1,nrow(X2a)), X2a))
X2b <- as.matrix(cbind(rep(1,nrow(X2b)), X2b))
Y1 <- scale(Y1); Y2 <- scale(Y2)
# UNINFORMED PRIOR -> POSTERIOR PARAMETERS
bhat1a <- solve(t(X1a)%*%X1a)%*%t(X1a)%*%Y1
bhat1b <- solve(t(X1b)%*%X1b)%*%t(X1b)%*%Y1
s21a <- t(Y1- X1a%*%bhat1a)%*%(Y1- X1a%*%bhat1a)/
(nrow(X1a)-ncol(X1a))
s21b <- t(Y1- X1b%*%bhat1b)%*%(Y1- X1b%*%bhat1b)/
(nrow(X1b)-ncol(X1b))
R1a <- solve(t(X1a)%*%X1a)*((nrow(X1a)-ncol(X1a))* s21a/
(nrow(X1a)-ncol(X1a)-2))[1,1]
R1b <- solve(t(X1b)%*%X1b)*((nrow(X1b)-ncol(X1b))* s21b/
(nrow(X1b)-ncol(X1b)-2))[1,1]
bhat2a <- solve(t(X2a)%*%X2a)%*%t(X2a)%*%Y2
bhat2b <- solve(t(X2b)%*%X2b)%*%t(X2b)%*%Y2
s22a <- t(Y2- X2a%*%bhat2a)%*%(Y2- X2a%*%bhat2a)/
(nrow(X2a)-ncol(X2a))
s22b <- t(Y2- X2b%*%bhat2b)%*%(Y2- X2b%*%bhat2b)/
(nrow(X2b)-ncol(X2b))
R2a <- solve(t(X2a)%*%X2a)*((nrow(X2a)-ncol(X2a))* s22a/
(nrow(X2a)-ncol(X2a)-2))[1,1]
R2b <- solve(t(X2b)%*%X2b)*((nrow(X2b)-ncol(X2b))* s22b/
(nrow(X2b)-ncol(X2b)-2))[1,1]
# IMPORTANCE SAMPLING CALCULATION OF MARGINAL LIKELIHOODS
K <- 10000
bhat1 <- bhat1a; bhat2 <- bhat1b
R1 <- R1a; R2 <- R1b; s21 <- s21a; s22 <- s21b
X1 <- X1a; X2 <- X1b
g1 <- h1 <- rmvnorm(K,bhat1,R1)
g2 <- h2 <- rmvnorm(K,bhat2,R2)
G1 <- H1 <- rep(NA,K); G2 <- H2 <- rep(NA,K)
for (i in 1:K) G1[i] <- dmvnorm(g1[i,],bhat1,R1)
H1[i] <- norm.like(g1[i,],bhat1,s21,X1)
G2[i] <- dmvnorm(g2[i,],bhat2,R2)
H2[i] <- norm.like(g2[i,],bhat2,s22,X2)
Bayesian Hypothesis Testing and the Bayes Factor 99
m1 <- K/(sum(G1/H1)); m2 <- K/(sum(G2/H2))
BF.main <- m1/m2
bhat1 <- bhat2a; bhat2 <- bhat2b
R1 <- R2a; R2 <- R2b; s21 <- s22a; s22 <- s22b
X1 <- X2a; X2 <- X2b
g1 <- h1 <- rmvnorm(K,bhat1,R1)
g2 <- h2 <- rmvnorm(K,bhat2,R2)
G1 <- H1 <- rep(NA,K); G2 <- H2 <- rep(NA,K)
for (i in 1:K) G1[i] <- dmvnorm(g1[i,],bhat1,R1)
H1[i] <- norm.like(g1[i,],bhat1,s21,X1)
G2[i] <- dmvnorm(g2[i,],bhat2,R2)
H2[i] <- norm.like(g2[i,],bhat2,s22,X2)
m1 <- K/(sum(G1/H1)); m2 <- K/(sum(G2/H2))
BF.training <- m2/m1
(LBF <- BF.training*BF.main)
8
Bayesian Decision Theory
Exercises
To be provided.
101
9
Monte Carlo Methods
Exercises
1. Use the trapezoidal rule with n = 8 to evaluate the definite integral:
I(g) =
∫ π
0
excos(x)dx.
The correct answer is I(g) = −12.070346; can you explain why your
answer differs slightly?
I Answer
I(g) =π
8×[e0 cos(0) + eπ/8 cos(π/8)
]× 1
2
+π
8×[eπ/8 cos(π/8) + e2π/8 cos(2π/8)
]× 1
2
+ · · ·+ π
8×[e7π/8 cos(7π/8) + eπ cos(π)
]× 1
2
≈ −12.38216
Because n is quite small, the Trapezoid answer is a little different from
the true value. However, when n becomes larger, the Trapezoid answer
becomes much closer to the true value.
< R.Code >
trapezoid <- function(a, b, n) x.values <- seq(a, b, length=(n+1))
height <- NULL
area <- 0
for (i in 1:(n+1))
103
104Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
height[i] <- exp(x.values[i]) * cos(x.values[i])
for (i in 1:n) area <- area +
((b-a)/n) * (height[i] + height[i+1]) * 0.5
return(area)
> trapezoid(0, pi, 8)
[1] -12.38216
> trapezoid(0, pi, 25)
[1] -12.10213
> trapezoid(0, pi, 50)
[1] -12.07829
3. Use Simpson’s rule with n = 8 to evaluate the definite integral in Exer-
cise 8.1. Is your answer better?
I Answer
I(g) =π
8×
1
6×[e0 cos(0) + eπ/8 cos(π/8) + 4× eπ/16 cos(π/16)
]+π
8×
1
6×[eπ/8 cos(π/8) + e2π/8 cos(2π/8) + 4× e3π/16 cos(3π/16)
]· · ·+
π
8×
1
6×[e7π/8 cos(7π/8) + eπ cos(π) + 4× e15π/16 cos(15π/16)
]≈ −12.06995
Note that the Simpson’s Rule answer is much closer to the true value
even when n remains the same. Of course, as before, when n becomes
larger, the Simpson’s Rule answer becomes much closer to the true value.
< R.Code >
simpson <- function(a, b, n) x.value <- seq(a, b, length=(n+1))
height.point <- NULL
mid <- NULL
height.mid <- NULL
Monte Carlo Methods 105
area <- 0
for (i in 1:(n+1)) height.point[i] <- exp(x.value[i]) * cos(x.value[i])
for (i in 1:n) mid[i] <- (x.value[i] + x.value[i+1])/2
height.mid[i] <- exp(mid[i]) * cos(mid[i])
for (i in 1:n) area <- area + ((b-a)/n) * (1/6) * (height.point[i] +
height.point[i+1] + 4 * height.mid[i])
return(area)
> simpson(0, pi, 8)
[1] -12.06995
> simpson(0, pi, 25)
[1] -12.07034
> simpson(0, pi, 50)
[1] -12.07035
5. It is well known that the normal is the limiting distribution of the bino-
mial (i.e., that as the number of binomial experiments gets very large the
histogram of the binomial results increasing looks like a normal). Using
R , generate 25 BN (10, 0.5) experiments. Treat this as a posterior for
p = 0.5 even though you know the true value and improve your estimate
using importance sampling with a normal approximation density.
I Answer
The algorithm is:
(a) Draw 25 values from N (5, 25): x1, x2, · · · , x25.
(b) Round them because the binomial distribution allows only natural
numbers.
(c) Calculate the weights
wi =dbinom(xi, size=10, prob=0.5)
dnorm(xi, mean=5, var=2.5)
106Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
(d) Get
Mean =
∑25i=1 xiwi∑25i=1 wi
Variance =
∑25i=1 x
2iwi∑25
i=1 wi− [ Mean ]2
The importance sampling produces Mean = 4.8463 and Variance =
2.3802, which are not exactly same as the true values but pretty much
similar to those.
Just for the reference, the Monte Carlo analysis (about how the impor-
tance sampling is doing better as compared to the direct sampling from
the binomial distribution) is done. It turns out that, the importance
sampling is doing a great job when producing the mean. But, it is not
doing well when producing the variance (as the number of draws become
larger).
< R.Code >
app <- round(rnorm(25, mean=5, sd=sqrt(2.5)), 0)
weight <- dbinom(app, size=10, prob=0.5) /
dnorm(app, mean=5, sd=sqrt(2.5))
average <- sum(app*weight) / sum(weight)
variance <- sum(app^2 * weight)/ sum(weight) - average^2
imp <- cbind(average, variance)
> imp
average variance
[1,] 4.846275 2.380189
# compare with direct sampling from BINOM
# Monte Carlo analysis (w/ simulation number = 10000)
t <- c(25, 100, 1000, 5000, 10000)
improve <- NULL; estimate <- NULL
for (j in 1:5) m <- 0; v <- 0
for (i in 1:10000) y <- rbinom(t[j], size=10, prob=0.5)
m1 <- mean(y); v1 <- var(y)
x <- round(rnorm(t[j], mean=5, sd=sqrt(2.5)), 0)
Monte Carlo Methods 107
w <- dbinom(x, size=10, prob=0.5) /
dnorm(x, mean=5, sd=sqrt(2.5))
m2 <- sum(x*w) / sum(w)
v2 <- sum(x^2 *w) / sum(w) - m2^2
if (abs(m2-5) < abs(m1-5)) m <- m + 1
else m <- m
if (abs(v2-2.5) < abs(v1-2.5)) v <- v + 1
else v <- v
improve <- rbind(improve, c(m, v))
estimate <- rbind(estimate, c(m1, m2, v1, v2))
result <- cbind(estimate[,1:2], improve[,1]/100,
estimate[,3:4], improve[,2]/100)
colnames(result) <- c("Mean-Binom.", "Mean-Imp.",
"Better(%)?",
"Var-Binom.", "Var-Imp.",
"Better(%)?")
rownames(result) <- c("N=25", "N=100", "N=1000",
"N=5000", "N=10000")
> result
Mean-Binom. Mean-Imp. Better (%)?
N=25 4.6800 5.010181 49.01
N=100 4.9700 5.080935 50.72
N=1000 4.9450 4.992402 49.98
N=5000 4.9734 5.029711 49.34
N=10000 5.0176 5.001492 49.10
Var-Binom. Var-Imp. Better (%)?
N=25 2.143333 2.459887 51.21
N=100 2.332424 2.537924 50.46
N=1000 2.530506 2.569648 43.46
N=5000 2.560004 2.650143 23.55
N=10000 2.550345 2.563561 11.35
7. Casella and Berger (2001, 638) give the famous temperature and failure
data on space shuttle launches before the Challenger disaster, where
failure is dichotomized. Actually several o-ring failure events were mul-
tiple events. Treat the number of multiple events as missing data and
108Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
write an EM algorithm to estimate the Poisson parameter given the
missing data. Here are the dichotomized failure data in R form:
temp <- c(53,57,58,63,66,67,67,68,69,70,70,70,70,70,72,73,
75,75,76,76,78,79,81),
fail <- c(1,1,1,1,0,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0).
I Answer
Set all the values of 1 as missing. Then, the EM algorithm is:
(a) Start with an arbitrary value for λ: λ(0) = 20
(b) Draw m values for ymis from f(ymis|yobs, λ(0)),
where f(ymis|yobs, λ(0)) = f(ymis|λ(0)) (∵ data are i.i.d), and
f(ymis|λ(0)) = P(ymis|λ(0),ymis ≥ 1) (∵ missing data ≥ 1).
(c) For the ith draw of ymis, calculate
`(i)(λ|yobs,y(i)mis) =
23∑j=1
yj
log(λ)− 23λ
(d) Take the mean
¯=
∑mi=1
[(∑23j=1 yj
)log(λ)− 23λ
]m
(e) Find λ that maximizes the mean above
(f) Repeat the procedure until λ(t) and λ(t+1) becomes very close (say
0.001)
From R , we know that λ is estimated to be 0.3652. The number 1’s
should be mostly 1 and some other numbers greater than 1.
< R.Code >
m <- 100
y <- c(NA, NA, NA, NA, 0, 0, 0, 0, 0, 0, 0, 0,
NA, NA, 0, 0, 0, NA, 0, 0, 0, 0, 0)
fr <- function(lambda, y.draw) y.draw <- as.matrix(y.draw)
log.lik <- NULL
Monte Carlo Methods 109
for (j in 1:m) y.new <- c(as.vector(y.draw[j, 1:4]),
as.vector(y[5:12]),
as.vector(y.draw[j, 5:6]),
as.vector(y[15:17]),
as.vector(y.draw[j, 7]),
as.vector(y[19:23]))
log.lik[j] <- sum(y.new)*log(lambda) - 23*lambda
return(- sum(log.lik) / m)
y.draw <- matrix(rep(NA, 7*m), m)
lambda <- cbind(0, 20); c <- 2
while(abs(lambda[c] - lambda[c-1]) > 0.001) for (i in 1:7) count <- 1
while (count <= m) draw <- rpois(1, lambda[c])
if (draw >= 1) y.draw[count, i] <- draw
count <- count + 1
else count <- count
lambda.hat <- optimize(f=fr, y.draw=y.draw,
lower=0, upper=100)
lambda <- cbind(lambda, lambda.hat$minimum)
c <- c + 1
> print(lambda[c])
[1] 0.3652197
> y.draw[1,]
[1] 2 1 3 1 1 1 1
9. Replicate the calculation of the Kullback-Leibler distance between the
two beta distributions in Section 8.3.1 using rejection sampling. In
110Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
the cultural anthropology example given in Section 2.3.3 the beta priors
BE(15, 2) and BE(1, 1) produced beta posteriors BE(32, 9) and BE(18, 8)
with the beta-binomial specification. Are these two posterior distribu-
tions closer or farther apart in Kullback-Leibler distance than their re-
spective priors? What does this say about the effect of the data in this
example?
I Answer
The first thing that we have to do is to calculate the two integrals by
using the rejection method. For each integral, the algorithm is:
(a) Find the minimum value (min) of the function (h) that is inside
the integral.
(b) Randomly draw x from U [0, 1], and y from U [min, 0].
(c) If h(x) ≤ y, accept it. Otherwise, reject it.
(d) Calculate | min | × # accept# draws .
Then, the resulting two Kullback-Leibler distances are 1.5345 for two
priors and 0.4816 for two posteriors. So, we can say that two distribu-
tions becomes closer in posteriors than in priors. This is because the
data reduces the impact of the prior specification.
< R.Code >
m <- 1000 # number of draws
fr.1 <- function(p, A, B) return(log(p) * p^(A-1) * (1-p)^(B-1))
fr.2 <- function(p, A, B) return(log(1-p) * p^(A-1) * (1-p)^(B-1))
kl.distance <- function(a, b, c, d)
min.1 <- optimize(f=fr.1, A=a, B=b,
lower=0.5, upper=0.9)
min.2 <- optimize(f=fr.2, A=a, B=b,
lower=0.5, upper=0.9)
x <- runif(m, 0, 1)
y1 <- runif(m, min.1$objective, 0)
Monte Carlo Methods 111
y2 <- runif(m, min.2$objective, 0)
accept1 <- accept2 <- 0
for (i in 1:m) if (y1[i] < fr.1(x[i], a, b)) accept1 <- accept1
else accept1 <- accept1 + 1
if (y2[i] < fr.2(x[i], a, b)) accept2 <- accept2
else accept2 <- accept2 + 1
f.1 <- - min.1$objective * (accept1 / m)
f.2 <- - min.2$objective * (accept2 / m)
I <- log((gamma(a+b)*gamma(c)*gamma(d))/
(gamma(c+d)*gamma(a)*gamma(b))) -
(a-c)*(gamma(a+b)/(gamma(a)*gamma(b)))*f.1 -
(b-d)*(gamma(a+b)/(gamma(a)*gamma(b)))*f.2
return(I)
compare <- cbind(kl.distance(15, 2, 1, 1),
kl.distance(32, 9, 18, 8))
colnames(compare) <- c("distance b/w priors",
"distance b/w posterior")
> print(compare)
distance b/w priors distance b/w posterior
[1,] 1.534457 0.4815835
11. The Cauchy distribution (Appendix B) is a unimodal, symmetric density
like the normal but has much heavier tails. In fact the tails are suffi-
ciently heavy that the Cauchy distribution does not have finite moments.
To understand this characteristic, perform the following experiment at
least 10 times:
• Generate a sample of size 100 from C(x|θ, σ), where you choose
θ, σ.
• Perform Monte Carlo integration to attempt to calculate E[X].
• Compare your estimate with the mode (θ).
What do you conclude about the moments of the Cauchy?
I Answer
112Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
From the experiment in R , we observe that mean varies a lot (variance
of 203.4884) from around -80 to around 30. So, we cannot say about (or
define) the moments of the Cauchy distribution.
< R.Code >
cau.mean <- NULL
for (i in 1:50) cau <- rcauchy(100, location=0, scale=1)
mean <- sum(cau)/100
cau.mean <- c(cau.mean, mean)
cau.mean
result <- cbind(mean(cau.mean), var(cau.mean),
min(cau.mean), max(cau.mean))
colnames(result) <- c("mean of mean", "var of mean",
"min of mean", "max of mean")
> print(result)
mean of mean var of mean min of mean max of mean
[1,] -1.664704 203.4884 -80.29538 30.29719
13. Generate 100 standard normal random variables in R and use these
values to perform rejection sampling to calculate the integral of the nor-
mal PDF from 2 to ∞. Repeat this experiment 100 times and calculate
the empirical mean and variance from your replications. Now generate
10,000 Cauchy values and use this as an approximation distribution in
importance sampling to obtain an estimate of the same normal interval.
Which approach is more accurate?
I Answer
Using either algorithm produces no different result. The slight difference
in variance comes from the difference in number of draws (100 normal
draws for the rejection sampling vs. 10000 cauchy draws for the impor-
tance sampling). Once we adjust that difference, we have no different
result.
< R.Code >
Monte Carlo Methods 113
reject <- function(T, n) int <- NULL
for (i in 1:T) draw <- rnorm(n, 0, 1)
accept <- draw[draw >= 2]
int[i] <- length(accept)/n
result <- cbind(mean(int), var(int))
colnames(result) <- c("mean", "var")
return(result)
imp <- function(T, n) int <- NULL
for (i in 1:T) p <- rcauchy(n, 0, 1)
w <- dnorm(p, 0, 1) / dcauchy(p, 0, 1)
index <- ifelse (p >= 2, 1, 0)
int[i] <- mean(w*index)
result <- cbind(mean(int), var(int))
colnames(result) <- c("mean", "var")
return(result)
compare <- rbind(reject(100, 100), imp(100, 10000),
imp(100, 100))
rownames(compare) <- c("rejection (100 draws)",
"importance (10000 draws)",
"importance (100 draws)")
> print(compare)
mean var
rejection (100 draws) 0.02380000 1.995556e-04
importance (10000 draws) 0.02286258 1.041832e-06
importance (100 draws) 0.02213381 1.197089e-04
15. The time to failure of two groups are observed, which are assumed to
be distributed exponential with unknown θ. The time to failure for the
first group, X1, X2, . . . , Xm, are fully observed, but the second group is
censored at time t > 0 because the study grant ran out, therefore the
114Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Y1, Y2, . . . , Yn values are either t or some value less than t, which can be
given an indicator function: Oi = 1 if Yi = t and zero otherwise. The
expected value of the sufficient statistic is:
E
[n∑i=1
(Yi|Oi)
]=
(n∑i=1
Oi
)(t+ θ−1)
+
(n−
n∑i=1
Oi
)(θ−1 − t[1 + exp(t/θ)]−1),
and the complete data log likelihood is:
`(θ) = −m∑i=1
(log(θ−1) + xiθ)−n∑i=1
(log(θ−1) + yiθ).
Substitute the value of the sufficient statistic into the log likelihood
to get the Q(θ(k+1)|θ(k)) expression and run the EM algorithm for the
following data:
Xi 1 9 4 3 11 7 2 2 - -
Yi 2 3 2 4 4 2 4 4 3 1
I Answer
The estimated θ is 0.1456, which implies that the censored group (Yi)
should have 6.8697 as its mean. So, the number 4’s in Yi should be far
bigger than 4 for most of the time.
< R.Code >
x <- c(1,9,4,3,11,7,2,2)
y <- c(2,3,2,4,4,2,4,4,3,1)
n <- length(y)
t <- max(y)
on <- length(y[y==t])
fr <- function(theta, x, y.stat) theta <- as.numeric(theta)
x <- as.vector(x)
y.stat <- as.numeric(y.stat)
ll <- -sum(log(1/theta) + x*theta) -
Monte Carlo Methods 115
n*log(1/theta) - theta*y.stat
return(-ll)
theta <- cbind(0,1)
c <- 2
while (abs(theta[c] - theta[c-1]) > 0.0001) new.theta <- theta[c]
y.stat <- on*(t + 1/new.theta) +
(n - on)*(1/new.theta - t/(1 + exp(t/new.theta)))
new.theta <- optimize(f=fr, x=x, y.stat=y.stat,
lower=0, upper=10)
theta <- cbind(theta, new.theta$minimum)
c <- c + 1
result <- cbind(theta[c], 1/theta[c])
colnames(result) <- c("est. theta", "est. mean of y")
> print(result)
est. theta est. mean of y
[1,] 0.1455665 6.869713
17. One of the easier, but very slow, ways to generate beta random variables
is Johnk’s method (Jonhk 1964; Kennedy and Gentle 1980), based on
the fact that order statistics from iid random uniforms are distributed
beta. The algorithm to deliver a single BE(A,B) random variable is
given by:
(a) Generate independent:
u1 ∼ U(0, 1), u2 ∼ U(0, 1)
(b) Transform:
v1 = (u1)1/A, v2 = (u2)1/B
(c) Calculate the sum:
w = v1 + v2
and return w if w ≤ 1, otherwise return to step 1.
116Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Code this algorithm in R and produce 100 random variables each ac-
cording to: BE(15, 2), BE(1, 1), BE(2, 9), and BE(8, 8). For each of
these beta distributions, also produce 100 random variables with the
rbeta function in R and compare with the corresponding forms pro-
duced with Johnk’s method using qqplot. Graph these four com-
parisons in the same window by setting up the windowing command
par(mfrow=c(2,2)).
I Answer
Figure 8.1 shows that Johnk’s method produces the beta distribution
fairly well. As for the time, the first three cases do not take long. But,
BE(8, 8) case takes a little longer.
< R.Code >
draw.beta <- function(n, A, B) draw <- NULL
counter <- 0
while (counter < 100) u1 <- runif(1, 0, 1)
u2 <- runif(1, 0, 1)
v1 <- u1^(1/A)
v2 <- u2^(1/B)
w <- v1 + v2
if (w <=1) counter <- counter + 1
draw <- cbind(draw, v1/w)
return(draw)
par(mfrow=c(2,2))
qqplot(draw.beta(100, 15, 2), rbeta(100, 15, 2),
xlab="Johnk’s method", ylab="rbeta",
xlim=c(0,1), ylim=c(0,1),
main="BE(15, 2)")
abline(0, 1, lty=2)
qqplot(draw.beta(100, 1, 1), rbeta(100, 1, 1),
xlab="Johnk’s method", ylab="rbeta",
Monte Carlo Methods 117
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
BE(15, 2)
Johnk's method
rbet
a
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
BE(1, 1)
Johnk's method
rbet
a
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
BE(2, 9)
Johnk's method
rbet
a
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
BE(8, 8)
Johnk's method
rbet
a
FIGURE 9.1: Drawing Beta Distribution: Johnk’s Method vs. rbeta
xlim=c(0,1), ylim=c(0,1),
main="BE(1, 1)")
abline(0, 1, lty=2)
qqplot(draw.beta(100, 2, 9), rbeta(100, 2, 9),
xlab="Johnk’s method", ylab="rbeta",
xlim=c(0,1), ylim=c(0,1),
main="BE(2, 9)")
abline(0, 1, lty=2)
qqplot(draw.beta(100, 8, 8), rbeta(100, 8, 8),
xlab="Johnk’s method", ylab="rbeta",
xlim=c(0,1), ylim=c(0,1),
118Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
main="BE(8, 8)")
abline(0, 1, lty=2)
19. Hartley (1958, 182) fits a Poisson model to the following “drastically
censored” grouped data:
Number of Events 0 1 2 3+ Total
Group Frequency 11 37 64 128 240
Develop an EM algorithm to estimate the Poisson intensity parameter.
I Answer
Set all the values of 3 as missing. Then, the EM algorithm is:
(a) Start with an arbitrary value for λ: λ(0) = 20
(b) Draw m values for ymis from f(ymis|yobs, λ(0)),
where f(ymis|yobs, λ(0)) = f(ymis|λ(0)) (∵ data are i.i.d), and
f(ymis|λ(0)) = P(ymis|λ(0),ymis ≥ 3) (∵ missing data ≥ 3).
(c) For the ith draw of ymis, calculate
`(i)(λ|yobs,y(i)mis) =
240∑j=1
yj
log(λ)− 240λ
(d) Take the mean
¯=
∑mi=1
[(∑240j=1 yj
)log(λ)− 240λ
]m
(e) Find λ that maximizes the mean above
(f) Repeat the procedure until λ(t) and λ(t+1) becomes very close (say
0.001)
From R , we know that λ is estimated to be 2.8707, which is a little
bigger than the original λ value (2.2875). The λ was biased downwardly
because the data was censored. Put it differently, the number 3’s in the
data should be mostly 3 and some other numbers greater than 3.
< R.Code >
Monte Carlo Methods 119
m <- 100
y.orginal <- c(rep(0,11),rep(1,37),rep(2,64),rep(3,128))
y <- c(rep(0,11),rep(1,37),rep(2,64),rep(NA,128))
fr <- function(lambda, y.draw) y.draw <- as.matrix(y.draw)
ll <- NULL
for (j in 1:m) y.new <- c(y[1:112], y.draw[j,])
log.lik[j] <- sum(y.new)*log(lambda) - 240*lambda
return(- sum(log.lik) / n)
y.draw <- matrix(rep(NA, 128*m), m)
lambda <- cbind(0, 20); c <- 2
while(abs(lambda[c] - lambda[c-1]) > 0.001) for (k in 1:128) count <- 1
while (count <= m) draw <- rpois(1, lambda[c])
if (draw >= 3) y.draw[count, k] <- draw
count <- count + 1
else count <- count
lambda.hat <- optimize(f=fr, y.draw=y.draw,
lower=0, upper=100)
lambda <- cbind(lambda, lambda.hat$minimum)
c <- c + 1
compare <- cbind(mean(y.orginal), l[c])
colnames(compare) <- c("Before EM", "After EM")
rownames(compare) <- "lambda"
> compare
Before EM After EM
lambda 2.2875 2.870666
> y.draw[1,]
120Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
[1] 3 4 3 4 3 3 5 3 7 4 3 6 7 5 3 5 8 5 3 3 5 4 4 4 5
[26] 4 4 5 4 3 3 3 4 3 3 3 3 5 3 3 5 7 4 3 6 3 3 5 3 3
[42] 5 3 3 6 3 3 5 3 3 5 3 3 6 3 3 5 3 3 3 4 3 4 4 3 5
[67] 3 3 3 4 6 4 5 4 3 3 5 4 8 6 6 4 4 4 4 3 3 3 5 7 4
[92] 4 4 5 4 3 3 4 3 4 3 6 3 5 3 3 3 4 4 5 3 4 3 3 6 3
[117] 3 4 4 3 5 3 3 5 7 5 3 4
10
Basics of Markov Chain Monte Carlo
Exercises
1. Find the values of α1, α2, and α3 that make the following a transition
matrix: 0.4 α1 0.0
0.3 α2 0.6
0.0 α3 0.4
. (10.1)
I Answer
0.4 + α1 + 0.0 = 1
0.3 + α2 + 0.6 = 1
0.0 + α3 + 0.4 = 1
⇒
α1 = 0.6
α2 = 0.1
α3 = 0.6
3. Using the transition matrix from Section 9.1.2, run the Markov chain
mechanically (step by step) in R using matrix multiplication. Start with
at least two very different initial states. Run the chains for at least ten
iterations.
I Answer
< R.Code >
transition <- matrix(c(.8, .6, .2, .4), ncol=2)
markov <- function(n, A, B) init <- matrix(c(A, B), 1)
out <- NULL
121
122Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
for (i in 1:n) init <- init %*% transition
out <- rbind(out, init)
return(out)
try1 <- markov(10, .9, .1)
try2 <- markov(10, .1, .9)
compare <- cbind(try1, try2)
colnames(compare) <- c("Pr(A)_try1", "Pr(B)_try1",
"Pr(A)_try2", "Pr(B)_try2")
rownames(compare) <- c("interation1", "interation2",
"interation3", "interation4",
"interation5", "interation6",
"interation7", "interation8",
"interation9", "interation10")
> print(compare)
Pr(A)_try1 Pr(B)_try1 Pr(A)_try2 Pr(B)_try2
interation1 0.7800000 0.2200000 0.6200000 0.3800000
interation2 0.7560000 0.2440000 0.7240000 0.2760000
interation3 0.7512000 0.2488000 0.7448000 0.2552000
interation4 0.7502400 0.2497600 0.7489600 0.2510400
interation5 0.7500480 0.2499520 0.7497920 0.2502080
interation6 0.7500096 0.2499904 0.7499584 0.2500416
interation7 0.7500019 0.2499981 0.7499917 0.2500083
interation8 0.7500004 0.2499996 0.7499983 0.2500017
interation9 0.7500001 0.2499999 0.7499997 0.2500003
interation10 0.7500000 0.2500000 0.7499999 0.2500001
Notice that, even with n=10, the state is almost approaching toward
(0.75, 0.25), which is the steady state of the given transition matrix.
5. For the following transition matrix, which classes are closed?0.50 0.50 0.00 0.00 0.00
0.00 0.50 0.00 0.50 0.00
0.00 0.00 0.50 0.50 0.00
0.00 0.00 0.75 0.25 0.00
0.50 0.00 0.00 0.00 0.50
Basics of Markov Chain Monte Carlo 123
I Answer
(a) State 1 can go to 1 and 2 at the first move, go to 4 through 2, and
go to 3 through (2 and 4). So, state 1 cannot go to state 5.
(b) State 2 can go to 2 and 4 at the first move, and go to 3 through 4.
So, state 2 cannot go to state 1 and state 5.
(c) State 3 can go to 3 and 4 at the first move, and nowhere else. So,
state 3 cannot go to state 1, state 2, and state 5.
(d) State 4 can go to 3 and 4 at the first move, and nowhere else. So,
state 4 cannot go to state 1, state 2, and state 5.
(e) State 5 can go to 1 and 5 at the first move, go to 2 through 1, go
to 4 through (1 and 2), and go to 3 through (1, 2, and 4). So, state
5 can go everywhere.
Therefore, according to the definition of closedness (if P(A, B) = 0, then
A is closed with respect to B),
(a) state 1, 2, 3, and 4 are closed with respect to state 5;
(b) state 2, 3, and 4 are closed with respect to state 1; and
(c) state 3 and 4 are closed with respect to state 2.
7. From the following transition matrix, calculate the stationary distribu-
tion for proportions: 0.0 0.4 0.6
0.1 0.0 0.9
0.5 0.5 0.0
.What is the substantive conclusion from the zeros on the diagonal of
this matrix?
I Answer
[a b c
]0.0 0.4 0.6
0.1 0.0 0.9
0.5 0.5 0.0
=[a b c
]
124Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEYb+ 5c = 10a
4a+ 5c = 10b
6a+ 9b = 10c
⇒
a = 11
14b
c = 4835b
Since a + b + c = 1, a = 55221 , b = 70
221 , c = 96221 , which produces the
stationary distribution as:[55
221
70
221
96
221
]As, R result shows, the substantive conclusion from zeros on the diagonal
of the transition matrix is that it converges slowly.
< R.Code >
transition <- matrix(c(0, .1, .5, .4, 0, .5, .6, .9, 0), ncol=3)
markov <- function(n, A, B, C) init <- matrix(c(A, B, C), 1)
out <- NULL
for (i in 1:n) init <- init %*% transition
out <- rbind(out, init)
return(out)
try <- markov(50, .8, .1, .1)
> print(try)
[,1] [,2] [,3]
[1,] 0.0600000 0.3700000 0.5700000
[2,] 0.3220000 0.3090000 0.3690000
[3,] 0.2154000 0.3133000 0.4713000
:
:
[36,] 0.2488689 0.3167422 0.4343889
[37,] 0.2488687 0.3167420 0.4343893
[38,] 0.2488689 0.3167421 0.4343890
[39,] 0.2488687 0.3167421 0.4343892
[40,] 0.2488688 0.3167421 0.4343891
[41,] 0.2488688 0.3167421 0.4343892
[42,] 0.2488688 0.3167421 0.4343891
[43,] 0.2488688 0.3167421 0.4343892
Basics of Markov Chain Monte Carlo 125
[44,] 0.2488688 0.3167421 0.4343891
[45,] 0.2488688 0.3167421 0.4343891
[46,] 0.2488688 0.3167421 0.4343891
[47,] 0.2488688 0.3167421 0.4343891
[48,] 0.2488688 0.3167421 0.4343891
[49,] 0.2488688 0.3167421 0.4343891
[50,] 0.2488688 0.3167421 0.4343891
9. Develop a Bayesian specification using BUGS to model the following
counts of the number of social contacts for children in a daycare center
with the objective of estimating λi, the contact rate:
Person 1 2 3 4 5 6 7 8 9 10
Age (ai) 11 3 2 2 7 5 6 9 7 4
Social contacts (xi) 22 4 2 2 9 3 4 5 2 6
Use the following specification:
λi ∼ G(α, β)
α ∼ G(1, 1)
β ∼ G(0.1, 1)
to produce a posterior summary for λi.
I Answer
< R.Code >
require(runjags)
posterior.summary <- function(posterior) # this function is widely used throughout the text
output <- NULL
n <- ncol(posterior)
for (i in 1:n) summary <- c(mean(posterior[,i]),
sd(posterior[,i]),
quantile(posterior[,i], .025),
126Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
median(posterior[,i]),
quantile(posterior[,i], .975))
output <- rbind(output, summary)
rownames(output) <- colnames(posterior)
colnames(output) <- c("mean", "sd", "2.5%", "25%",
"50%", "75%", "97.5%")
return(round(output, 4))
x <- c(22, 4, 2, 2, 9, 3, 4, 5, 2, 6)
a <- c(11, 3, 2, 2, 7, 5, 6, 9, 7, 4)
n <- length(x)
data <- dump.format(list(x=x, a=a, n=n))
inits <- dump.format(list(alpha=1, beta=1,
lambda=rep(1,n)))
daycare.model <- "model for (i in 1:n) x[i] ~ dpois(theta[i])
theta[i] <- lambda[i]*a[i]
lambda[i] ~ dgamma(alpha, beta)
alpha ~ dgamma(1, 1)
beta ~ dgamma(0.1, 1)
"daycare <- run.jags(model=daycare.model,
data=data, inits=inits, monitor=c("alpha", "beta",
"lambda"), n.chain=1, burnin=1000, sample=5000,
plot=TRUE)
> posterior.summary(daycare$mcmc[[1]])
mean sd 2.5% 50% 97.5%
alpha 1.7584 0.7708 0.6501 1.6356 3.6209
beta 1.5432 0.7726 0.4134 1.4192 3.3878
lambda[1] 1.8980 0.3964 1.1860 1.8685 2.7581
lambda[2] 1.2866 0.5585 0.4359 1.2067 2.6165
lambda[3] 1.0768 0.5739 0.2556 0.9807 2.4177
lambda[4] 1.0708 0.5674 0.2575 0.9807 2.4223
lambda[5] 1.2588 0.3874 0.6394 1.2195 2.1506
lambda[6] 0.7312 0.3456 0.2160 0.6834 1.5303
Basics of Markov Chain Monte Carlo 127
lambda[7] 0.7550 0.3202 0.2590 0.7137 1.4928
lambda[8] 0.6361 0.2504 0.2478 0.6052 1.2199
lambda[9] 0.4364 0.2295 0.0988 0.3982 0.9694
lambda[10] 1.4091 0.5212 0.5835 1.3411 2.6206
11. The table below provides the results of nine postwar elections in Italy
by proportion per political party. The listed parties are: Democrazia
Cristiana (DC), Partito Comunista Italiano (PCI), Partito Socialista
Italiano (PSI), Partito Socialista Democratico Italiano (PSDI), Partito
Repubblicano Italiano (PRI), and Partito Liberale Italiano (PLI). The
“Others” category is a collapsing of smaller parties: Partito Radicale
(PR), Democrazia Proletaria (DP), Partito di Unita Proletaria per il
Comunismo (PdUP), Movimento Sociale Italiano (MSI), South Tyrol
Peoples Party (SVP), Sardinian Action Party (PSA), Valdotaine Union
(UV), the Monarchists (Mon), and the Socialist Party of Proletarian
Unity (PSIUP). In two cases parties presented joint election lists and
the returns are split across the two parties here. The compositional data
suggest a sense of stability for postwar Italian elections even though Italy
has averaged more than one government per year since 1945.
Party 1948 1953 1958 1963 1968 1972 1976 1979 1983
DC 0.485 0.401 0.424 0.383 0.3910 0.388 0.387 0.383 0.329
PCI 0.155 0.226 0.227 0.253 0.2690 0.272 0.344 0.304 0.299
PSI 0.155 0.128 0.142 0.138 0.0725 0.096 0.096 0.098 0.114
PSDI 0.071 0.045 0.045 0.061 0.0725 0.051 0.034 0.038 0.041
PRI 0.025 0.016 0.014 0.014 0.0200 0.029 0.031 0.030 0.051
PLI 0.038 0.030 0.035 0.070 0.0580 0.039 0.013 0.019 0.029
Others 0.071 0.154 0.113 0.081 0.1170 0.125 0.095 0.128 0.137
Source: Instituto Centrale di Statistica, Italia
Develop a multinomial-logistic model for pij as the proportion received
by party i and election j with BUGS , using time as a explanatory variable
128Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
and specifying uninformative priors.
I Answer
The model is specified as:
pj ∼MN (1, pr1j , pr2j , · · · , pr7j)
prij =eα+βj∑9j=1 e
α+βj
α ∼ N (0, 100)
β ∼ N (0, 100)
< R.Code >
require(runjags)
p <- matrix(c(.485, .155, .155, .071, .025, .038, .071,
.401, .226, .128, .045, .016, .030, .154,
.424, .227, .142, .045, .014, .035, .113,
.383, .253, .138, .061, .014, .070, .081,
.391, .269, .0725, .0725, .02, .058, .117,
.388, .272, .096, .051, .029, .039, .125,
.387, .344, .096, .034, .031, .013, .095,
.383, .304, .098, .038, .030, .019, .128,
.329, .299, .114, .041, .051, .029, .137),
nrow=7)
m <- nrow(p); t <- ncol(p)
italy.data <- dump.format(list(p=p, m=m, t=t))
italy.inits <- dump.format(list(alpha=1, beta=1))
italy.model <- "model for (j in 1:t) p[1:m, j] ~ dmulti(pr[1:m, j], 1)
for (i in 1:m) pr[i,j] <- mu[i,j] / sum(mu[,j])
log(mu[i,j]) <- alpha + beta * j
alpha ~ dnorm(0, 0.01)
beta ~ dnorm(0, 0.01)
"italy <- run.jags(model=italy.model, data=italy.data,
Basics of Markov Chain Monte Carlo 129
inits=italy.inits, n.chain=1,
monitor=c("alpha", "beta"),
burnin=1000, sample=5000, plot=TRUE)
# the function posterior.summary was defined in Q9.5.
> posterior.summary(italy$mcmc[[1]])
mean sd 2.5% 50% 97.5%
alpha 0.0375 10.0505 -19.1669 -0.1279 20.5815
beta 0.4247 10.0329 -19.4075 0.2927 19.8428
13. Blom, Holst, and Sandell (1994) define a “homesick” Markov chain as
one where the probability of returning to the starting state after 2m
(m > 1) iterations is at least as large as moving to any other state:
p2m(x0, x¬0) ≤ p2m(x0, x0). Does the Markov chain defined by the
following transition matrix have homesickness?
P =
0.5 0.5 0.0 0.0
0.5 0.0 0.5 0.0
0.0 0.5 0.0 0.5
0.0 0.0 0.5 0.5
Do Markov chains becomes less homesick over time?
I Answer
Since the stationary state is [0.25, 0.25, 0.25, 0.25], the Markov chain can-
not overcome “homesickness.” As shown below, this can be confirmed
in R , too.
< R.Code >
P <- matrix(c(.5, .5, 0, 0, .5, 0, .5, 0,
0, .5, 0, .5, 0, 0, .5, .5),
nrow=4)
out <- list(state1=c(1,0,0,0), state2=c(0,1,0,0),
state3=c(0,0,1,0), state4=c(0,0,0,1))
for (m in 1:20) state <- list(state1=c(1,0,0,0), state2=c(0,1,0,0),
state3=c(0,0,1,0), state4=c(0,0,0,1))
130Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
for (i in 1:4) for (j in 1:m) state[[i]] <- state[[i]] %*% P %*% P
out[[i]] <- rbind(out[[i]], state[[i]])
> print(out)
$state1
[,1] [,2] [,3] [,4]
[1,] 1.0000000 0.00 0.00 0.0000000
[2,] 0.5000000 0.25 0.25 0.0000000
[3,] 0.3750000 0.25 0.25 0.1250000
[4,] 0.3125000 0.25 0.25 0.1875000
[5,] 0.2812500 0.25 0.25 0.2187500
[6,] 0.2656250 0.25 0.25 0.2343750
[7,] 0.2578125 0.25 0.25 0.2421875
[8,] 0.2539062 0.25 0.25 0.2460938
[9,] 0.2519531 0.25 0.25 0.2480469
[10,] 0.2509766 0.25 0.25 0.2490234
[11,] 0.2504883 0.25 0.25 0.2495117
[12,] 0.2502441 0.25 0.25 0.2497559
[13,] 0.2501221 0.25 0.25 0.2498779
[14,] 0.2500610 0.25 0.25 0.2499390
[15,] 0.2500305 0.25 0.25 0.2499695
[16,] 0.2500153 0.25 0.25 0.2499847
[17,] 0.2500076 0.25 0.25 0.2499924
[18,] 0.2500038 0.25 0.25 0.2499962
[19,] 0.2500019 0.25 0.25 0.2499981
[20,] 0.2500010 0.25 0.25 0.2499990
[21,] 0.2500005 0.25 0.25 0.2499995
$state2
[,1] [,2] [,3] [,4]
[1,] 0.00 1.0000000 0.0000000 0.00
[2,] 0.25 0.5000000 0.0000000 0.25
[3,] 0.25 0.3750000 0.1250000 0.25
[4,] 0.25 0.3125000 0.1875000 0.25
[5,] 0.25 0.2812500 0.2187500 0.25
Basics of Markov Chain Monte Carlo 131
[6,] 0.25 0.2656250 0.2343750 0.25
[7,] 0.25 0.2578125 0.2421875 0.25
[8,] 0.25 0.2539062 0.2460938 0.25
[9,] 0.25 0.2519531 0.2480469 0.25
[10,] 0.25 0.2509766 0.2490234 0.25
[11,] 0.25 0.2504883 0.2495117 0.25
[12,] 0.25 0.2502441 0.2497559 0.25
[13,] 0.25 0.2501221 0.2498779 0.25
[14,] 0.25 0.2500610 0.2499390 0.25
[15,] 0.25 0.2500305 0.2499695 0.25
[16,] 0.25 0.2500153 0.2499847 0.25
[17,] 0.25 0.2500076 0.2499924 0.25
[18,] 0.25 0.2500038 0.2499962 0.25
[19,] 0.25 0.2500019 0.2499981 0.25
[20,] 0.25 0.2500010 0.2499990 0.25
[21,] 0.25 0.2500005 0.2499995 0.25
$state3
[,1] [,2] [,3] [,4]
[1,] 0.00 0.0000000 1.0000000 0.00
[2,] 0.25 0.0000000 0.5000000 0.25
[3,] 0.25 0.1250000 0.3750000 0.25
[4,] 0.25 0.1875000 0.3125000 0.25
[5,] 0.25 0.2187500 0.2812500 0.25
[6,] 0.25 0.2343750 0.2656250 0.25
[7,] 0.25 0.2421875 0.2578125 0.25
[8,] 0.25 0.2460938 0.2539062 0.25
[9,] 0.25 0.2480469 0.2519531 0.25
[10,] 0.25 0.2490234 0.2509766 0.25
[11,] 0.25 0.2495117 0.2504883 0.25
[12,] 0.25 0.2497559 0.2502441 0.25
[13,] 0.25 0.2498779 0.2501221 0.25
[14,] 0.25 0.2499390 0.2500610 0.25
[15,] 0.25 0.2499695 0.2500305 0.25
[16,] 0.25 0.2499847 0.2500153 0.25
[17,] 0.25 0.2499924 0.2500076 0.25
[18,] 0.25 0.2499962 0.2500038 0.25
[19,] 0.25 0.2499981 0.2500019 0.25
[20,] 0.25 0.2499990 0.2500010 0.25
[21,] 0.25 0.2499995 0.2500005 0.25
132Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
$state4
[,1] [,2] [,3] [,4]
[1,] 0.0000000 0.00 0.00 1.0000000
[2,] 0.0000000 0.25 0.25 0.5000000
[3,] 0.1250000 0.25 0.25 0.3750000
[4,] 0.1875000 0.25 0.25 0.3125000
[5,] 0.2187500 0.25 0.25 0.2812500
[6,] 0.2343750 0.25 0.25 0.2656250
[7,] 0.2421875 0.25 0.25 0.2578125
[8,] 0.2460938 0.25 0.25 0.2539062
[9,] 0.2480469 0.25 0.25 0.2519531
[10,] 0.2490234 0.25 0.25 0.2509766
[11,] 0.2495117 0.25 0.25 0.2504883
[12,] 0.2497559 0.25 0.25 0.2502441
[13,] 0.2498779 0.25 0.25 0.2501221
[14,] 0.2499390 0.25 0.25 0.2500610
[15,] 0.2499695 0.25 0.25 0.2500305
[16,] 0.2499847 0.25 0.25 0.2500153
[17,] 0.2499924 0.25 0.25 0.2500076
[18,] 0.2499962 0.25 0.25 0.2500038
[19,] 0.2499981 0.25 0.25 0.2500019
[20,] 0.2499990 0.25 0.25 0.2500010
[21,] 0.2499995 0.25 0.25 0.2500005
15. Koppel (1999) studies political control in hybrid organizations (semi-
governmental) through an analysis of government purchase of a spe-
cific type of venture capital funds: investment funds sponsored by the
Overseas Private Investment Corporation (OPIC). The following data
provide three variables as of January 1999.
Develop a model using BUGS where the size of the fund is modeled by the
age of the fund and its investment status according to the specification:
Yi ∼ N (mi, τ)
mi = β0 + β1X1i + β2X2i + εi
ε ∼ N (0, k),
where: Y is the size of the fund, X1 is the age of the fund and X2 is a
dichotomous explanatory variable equal to one if the fund is investing
Basics of Markov Chain Monte Carlo 133
and zero otherwise. Set a value for the constant k and an appropriate
prior distribution for τ . Summarize the posterior distributions of the
unknown parameters of interest.
Fund Age Status Size ($M)
AIG Brunswick Millennium 3 Investing 300
Aqua International Partners 2 Investing 300
Newbridge Andean Capital Partners 4 Investing 250
PBO Property 1 Investing 240
First NIS Regional 5 Investing 200
South America Private Equity Growth 4 Investing 180
Russia Partners 5 Investing 155
South Asia Capital 3 Investing 150
Modern Africa Growth and Investment 2 Investing 150
India Private Equity 4 Investing 140
New Africa Opportunity 3 Investing 120
Global Environmental Emerging II 2 Investing 120
Bancroft Eastern Europe 3 Investing 100
Agribusiness Partners International 4 Investing 95
Caucus 1 Raising 92
Asia Pacific Growth 7 Divesting 75
Global Environmental Emerging I 5 Invested 70
Poland Partners 5 Invested 64
Emerging Europe 3 Investing 60
West Bank/Gaza and Jordan 2 Raising 60
Draper International India 3 Investing 55
EnterArab Investment 3 Investing 45
Israel Growth 5 Investing 40
Africa Growth 8 Divesting 25
Allied Capital Small Business 4 Divesting 20
I Answer
I set the constant k = 1, and assign the gamma distribution, G(.01.01)
to the prior for τ .
< R.Code >
require(runjags)
size <- c(300, 300, 250, 240, 200, 180, 155, 150,
150, 140, 120, 120, 100, 95, 92, 75,
70, 64, 60, 60, 55, 45, 40, 25, 20)
age <- c(3, 2, 4, 1, 5, 4, 5, 3, 2, 4, 3, 2, 3,
134Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
4, 1, 7, 5, 5, 3, 2, 3, 3, 5, 8, 4)
status <- c(rep(1, 14), 0,0,0,0,1,0,1,1,1,0,0)
n <- length(size)
e <- rnorm(n, 0, 1)
opic.data <- dump.format(list(size=size, age=age,
status=status,
n=n, e=e))
opic.inits <- dump.format(list(beta0=1, beta1=1,
beta2=1, pres=1))
opic.model <- "model for (i in 1:n) size[i] ~ dnorm(mu[i], pres)
mu[i] <- beta0+beta1*age[i]+beta2*status[i]+e[i]
beta0 ~ dnorm(0, .01)
beta1 ~ dnorm(0, .01)
beta2 ~ dnorm(0, .01)
pres ~ dgamma(.01, .01)
"opic <- run.jags(model=opic.model, data=opic.data,
inits=opic.inits, n.chain=1,
monitor=c("beta0", "beta1",
"beta2", "pres"),
burnin=1000, sample=5000, plot=TRUE)
# the function posterior.summary was defined in Q9.5.
> posterior.summary(opic$mcmc[[1]])
mean sd 2.5% 50% 97.5%
beta0 10.0548 9.8456 -9.4405 9.9184 29.2703
beta1 16.8120 5.1294 6.5871 16.9374 26.6733
beta2 12.4065 9.8219 -6.7362 12.4348 31.2947
pres 0.0001 0.0000 0.0000 0.0001 0.0002
17. (Norris 1997) A Markov chain is reversible if the distribution of θn|θn+1 =
t is the same as the θn|θn−1 = t. This means that direction of time does
not alter the properties of the chain. Show that the following irreducible
Basics of Markov Chain Monte Carlo 135
matrix does not define a reversible Markov chain. 0 p 1− p1− p 0 p
p 1− p 0
.See Besag et al. (1995) for details on reversibility.
I Answer
According to the definition of “reversibility,” there exist π = (π1, π2, π3)
s.t. πipij = πjpji,∀i, j.π1p12 = π2p21
π1p13 = π3p31
π2p23 = π3p32
π1 + π2 + π3 = 1
⇒
π1p = π2(1− p)π1(1− p) = π3p
π2p = π3(1− p)π1 + π2 + π3 = 1
∴ π1 = π2 = π3 =1
3, p =
1
2
Therefore, the transition matrix that is provided in the question defines
a reversible Markov chain ONLY when p = 12 , and therefore not in
general.
19. (Grimmett and Stirzaker 1992) A random walk is recurrent if the mean
size of the jumps is zero. Define a random walk on the integers by the
transition from integer i to either integer i+2 or i−1 with probabilities:
p(i, i+ 2) = p, p(i, i− 1) = 1− p.
A random walk is recurrent if the mean recurrence time,∑nfii(n), is
finite, otherwise it is transient. What values of p make this random walk
recurrent?
I Answer
Let fii(n) be the probability that the state i is first visited at time n,
given X0 = i.
The state i can be first visited only after the third transition, meaning
fii(1) = fii(2) = 0. With the same logic, fii(4) = fii(5) = 0, and fii(n)
for all n that is not divisible by 3.
136Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
At the third transition, the state i is first reached through three ways:
1) (i) → (i − 1) → (i − 2) → (i); 2) (i) → (i − 1) → (i + 1) → (i); and
3) (i)→ (i+ 2)→ (i+ 1)→ (i). This gives us:
fii(3) = p(1− p)2 + p(1− p)2 + p(1− p)2 = 3p(1− p)2.
So, we can eliminate the three cases that we counted. With the same
procedure, we know that the state i is first visited through (i−3) at the
third transition AND through (i+ 3) at the third transition. This gives
us:
fii(6) = 3p2(1− p)4 + 3p2(1− p)4 = 6p2(1− p)4.
With the same logic, we can proceed more to get fii(9), fii(12), · · · .
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.5
1.0
1.5
2.0
2.5
value of p
mea
n re
curr
ence
tim
e
FIGURE 10.1: Mean recurrence time with respect to p
Basics of Markov Chain Monte Carlo 137
Therefore, the mean recurrence time is:∑nfii(n) = 3× 3p(1− p)2 + 6× 6p2(1− p)4 + · · ·
= 32(12x+ 22x2 + 32x3 + 42x4 + · · ·
)=
32x(1 + x)
(1− x)3
where x = p(1− p)2.
We can make a graph for the mean recurrence time with respect to the
value of p. As we can observe in Figure 9.1, the mean recurrence time
is always finite within the region of 0 ≤ p ≤ 1. Therefore, the random
walk specified in the question is recurrent for all 0 ≤ p ≤ 1.
11
Implementing Bayesian Models with Markov
Chain Monte Carlo
Exercises
To be provided.
139
12
Bayesian Hierarchical Models
Exercises
1. Show that inserting (10.2) into (10.1) produces (10.3).
I Answer
yij = βj0 + βj1Xij + εij
= (γ00 + γ10Zj0 + uj0) + (γ01 + γ11Zj1 + uj1)Xij + εij
= γ00 + γ01Xij + γ10Zj0 + γ11Zj1Xij + uj1Xij + uj0 + εij ,
which is the equation (10.3).
3. For the following simple hierarchical form,
Xi|θ ∼ N (θ, 1)
θ|σ2 ∼ N (0, σ2)
σ2 ∼ 1/k, 0 < σ ≤ k,
express the full joint distribution p(θ, σ,X).
I Answer
p(θ, σ,X) ∝ p(X|θ)p(θ|σ2)p(σ2)
= N (θ, 1) · N (0, σ2) · 1
k
141
142Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
=
(1√2π
)nexp
[n∑i=1
−1
2(Xi − θ)2
]1√
2πσ2exp
[− 1
2σ2θ2
]1
k
∝ 1
σexp
[−1
2
n∑i=1
(Xi − θ)2 − 1
2σ2θ2
]
=1
σexp
[−1
2
n∑i=1
(Xi − θ)2+
(θ
σ
)2]
5. Using the data below on firearm-related deaths per 100,000 in the United
States from 1980 to 1998 (Source: National Center for Health Statistics.
Health, United States, 1999, With Health and Aging Chartbook, Hy-
attsville, Maryland: 1999.), test the hypothesis that there is a period of
higher rates by specifying a normal mixture model and comparing the
posterior distributions of λ1 and λ2 in the model:
xi ∼ N (λki |τ)
ki ∼ Categorical2(K)
τ ∼ G(εg1 , εg2)
λi ∼ N (0, εli)
K ∼ D(1, 1),
where K is the proportion of the observations distributed N (λk2 |τ) and
1−K is the proportion of the observations distributed N (λk1 |τ). Spec-
ify a BUGS model choosing values of the ε.. parameters that specify
somewhat diffuse hyperpriors.
14.9 14.8 14.3 13.3 13.3 13.3 13.9 13.6 13.9
14.1 14.9 15.2 14.8 15.4 14.8 13.7 12.8 12.1
I Answer
< R.Code >
require(runjags)
x <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3,
13.9, 13.6, 13.9, 14.1, 14.9, 15.2,
Bayesian Hierarchical Models 143
14.8, 15.4, 14.8, 13.7, 12.8, 12.1)
n <- length(x)
d <- c(1, 1)
firearm.data <- dump.format(list(x=x, n=n, d=d))
firearm.inits <- dump.format(list(tau=1,
lambda=c(1,1),
K=c(.5,.5)))
firearm.model <- "model for (i in 1:n) x[i] ~ dnorm(mu[i], tau)
mu[i] <- lambda[k[i]]
k[i] ~ dcat(K[])
tau ~ dgamma(.01, .01)
lambda[1] ~ dnorm(0, .01)
lambda[2] ~ dnorm(0, .01)
K[1:2] ~ ddirch(d[])
"firearm <- run.jags(model=firearm.model,
data=firearm.data,
inits=firearm.inits, n.chain=1,
monitor=c("tau", "lambda", "K"),
burnin=1000, sample=5000, plot=TRUE)
# the function posterior.summary was defined in Q9.5.
> posterior.summary(firearm$mcmc[[1]])
mean sd 2.5% 50% 97.5%
tau 1.4672 0.8043 0.5649 1.2882 3.6936
lambda[1] 10.5694 7.7897 -12.7068 13.9496 15.2927
lambda[2] 8.5250 9.2400 -14.9768 13.6290 15.5635
K[1] 0.5768 0.3896 0.0063 0.7252 0.9965
K[2] 0.4232 0.3896 0.0035 0.2748 0.9937
Even though there is no strictly strong evidence, the hypothesis of a
higher rate period could be reasonable: λ1 and λ2 are a little different
with fairly equal proportion of K[1] and K[2].
144Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
7. (Carlin and Louis 2001). Given the hierarchical model:
Yi|λi ∼ P(λiti)
λi ∼ G(α, β)
β ∼ IG(A,B), α known
where the hyperpriors A and B are fixed. Express the full conditional
distributions for all of the unknown parameters.
I Answer
p(Yi, λi, β) ∝ p(Yi|λi)p(λi|β)p(β)
=
n∏i=1
(λiti)yi
yi!exp[−λiti]×
βα
Γ(α)λα−1i exp[−λiti]
× BA
Γ(A)β−(A+1) exp[−B/β]
∴ π(λi|β, Yi) ∝ λ∑yi+α−1
i exp [−(ti + β)λi]
= G(λi|∑
yi + α, ti + beta)
π(β|λi, Yi) ∝ βα−A−1 exp [−λiβ −B/β]
9. Hobert and Casella (1998) specify the following one-way random effects
model:
yij ∼ N (β + ui, σ2ε )
ui ∼ N (0, σ2)
p(β) ∝ 1
p(σ2ε ) ∼ 1/σ2
ε ), 0 < σ2ε <∞
p(σ2) ∼ 1/σ2), 0 < σ2 <∞,
where i = 1, . . . ,K and j = 1, . . . , J . Using the R function rnorm() gen-
erate contrived data for yij with K = 50, J = 5, and estimate the model
in BUGS . When improper priors are specified in hierarchical models, it
is important to make sure that the resulting posterior is not itself then
Bayesian Hierarchical Models 145
improper. Non-detection of posterior impropriety from looking at chain
values was a problem in the early MCMC literature. This model gives
improper posteriors although it is not possible to observe this problem
without running the chain for quite some time. Compare the standard
diagnostics for a short run with a very long run. What do you see? Now
specify proper priors and contrast the results.
I Answer
< R.Code >
# generating data
u <- matrix(rep(NA, 250), nrow=50)
for (i in 1:50) u[i,1:5] <- rnorm(5, 0, 1)
b <- matrix(rnorm(250, 3, 1), nrow=50)
e <- matrix(rnorm(250, 0, 1), nrow=50)
y <- u + b + e
# modeling
require(runjags)
K <- nrow(y); J <- ncol(y)
casella.data <- dump.format(list(y=y, K=K, J=J))
casella.inits <- dump.format(list(beta=1, sige=1,
sigma=1))
casella.improper.model <- "model for (i in 1:K) for (j in 1:J) y[i,j] ~ dnorm(mu[i,j], sige)
mu[i,j] <- beta + u[i]
u[i] ~ dnorm(0, sigma)
beta ~ dunif(-10000, 10000)
sige ~ dunif(-10000, 10000)
sigma ~ dunif(-10000, 10000)
"casella.improper1 <- run.jags(model=casella.improper.model,
data=casella.data,
146Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
inits=casella.inits,
n.chain=1, burnin=1000,
sample=5000, plot=TRUE,
monitor=c("beta", "sige",
"sigma"))
casella.improper2 <- run.jags(model=casella.improper.model,
data=casella.data,
inits=casella.inits,
n.chain=1, burnin=10000,
sample=20000, plot=TRUE,
monitor=c("beta", "sige",
"sigma"))
casella.proper.model <- "model for (i in 1:K) for (j in 1:J) y[i,j] ~ dnorm(mu[i,j], sige)
mu[i,j] <- beta + u[i]
u[i] ~ dnorm(0, sigma)
beta ~ dnorm(2, 1)
sige ~ dgamma(.5, 1)
sigma ~ dgamma(.5, 1)
"casella.proper <- run.jags(model=casella.proper.model,
data=casella.data,
inits=casella.inits,
n.chain=1, burnin=1000,
sample=5000, plot=TRUE,
monitor=c("beta", "sige",
"sigma"))
par(mfrow=c(3,3))
traceplot(casella.improper1$mcmc[[1]][,1],
main="beta (short run)",
ylim=c(2.2, 3.5))
traceplot(casella.improper2$mcmc[[1]][,1],
main="beta (long run)",
ylim=c(2.2, 3.5))
traceplot(casella.proper$mcmc[[1]][,1],
main="beta (proper prior, short)",
Bayesian Hierarchical Models 147
ylim=c(2.2, 3.5))
traceplot(casella.improper1$mcmc[[1]][,2],
main="sigma_e (short run)",
ylim=c(.2, .5))
traceplot(casella.improper2$mcmc[[1]][,2],
main="sigma_e (long run)",
ylim=c(.2, .5))
traceplot(casella.proper$mcmc[[1]][,2],
main="sigma_e (proper prior, short run)",
ylim=c(.2, .5))
traceplot(casella.improper1$mcmc[[1]][,3],
main="sigma (short run)",
ylim=c(0, 10000))
traceplot(casella.improper2$mcmc[[1]][,3],
main="sigma (long run)",
ylim=c(0, 10000))
traceplot(casella.proper$mcmc[[1]][,3],
main="sigma (proper prior, short run)",
ylim=c(0, 15))
# the function posterior.summary was defined in Q9.5.
> posterior.summary(casella.improper1$mcmc[[1]])
mean sd 2.5% 50% 97.5%
beta 2.8828 0.109 2.6783 2.8826 3.0994
sige 0.3268 0.030 0.2698 0.3262 0.3864
sigma 5011.0088 2910.723 386.2101 5284.9250 9677.6095
> posterior.summary(casella.improper2$mcmc[[1]])
mean sd 2.5% 50% 97.5%
beta 2.8830 0.1105 2.6590 2.8845 3.0979
sige 0.3269 0.0296 0.2717 0.3259 0.3879
sigma 4850.2739 3007.6788 59.9005 4831.7800 9742.1902
> posterior.summary(casella.proper$mcmc[[1]])
mean sd 2.5% 50% 97.5%
beta 2.8689 0.1337 2.6106 2.8684 3.1394
sige 0.3493 0.0329 0.2873 0.3478 0.4166
sigma 3.0508 1.2014 1.3813 2.8223 5.9766
As shown in Figure 10.1, improper priors cause bad convergences, which
requires a very long iterations. However, proper priors produce relatively
148Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
FIGURE 12.1: Traceplots for One-Way Random Effects Model
well converged traceplots even with a fairly small number of iterations.
11. To introduce the “near-ignorance prior” Sanso and Pericchi (1994) spec-
ify the hierarchical model starting with a multivariate normal specifica-
tion:
Y|θ ∼MVN (Xθ, σ2I)
θi|λi ∼ N (µi, λ2i )
λ2i ∼ EX (1/τ2
i )
Bayesian Hierarchical Models 149
where i = 1, . . . , p and µi, τi, σ2 are known. Show that by substitution
and integration over λ that this model is equivalent to the hierarchical
specification:
Y|θ ∼MVN (Xθ, σ2I)
θi ∼ DE(µi, τi).
I Answer
f(θi) =
∫ ∞0
f(θi|λ2i )f(λ2
i )dλ2i
=
∫ ∞0
(2πλ2
i
)− 12 exp
[− 1
2λ2i
(θi − µi)2
]1
τ2i
exp
[−λ2
i
(1
τ2i
)]dλ2
i
= (2π)1
τ2i
∫ ∞0
(λ2i )− 1
2︸ ︷︷ ︸=xν−1
exp
− 1
2λ2i
(θi − µi)2︸ ︷︷ ︸=−β/x
−λ2i
(1
τ2i
)︸ ︷︷ ︸
=γx
dλ2i ,
where ν = 12 ,β = 1
2 (θi − µi)2, and γ = 1/τ2i .
Then, we have
f(θi) =
∫ ∞0
xν−1 exp
[−β
x− γx
]dx = 2
(β
γ
) ν2
Kν(2√βγ),
where Kν(z) is a modified Bessel function of imaginary argument:
Kν(z) = −πi2
exp[−π
2νi]
H(2)ν
(ze−
12πi)
H(2)ν (z) = ιν(z)− iΥν(z)
ιν(z) =zν
2ν
∞∑k=0
(−1)kz2k
22kk!Γ(ν + k + 1)
Υν(z) =1
sin(νπ)[cos(νπ)ιν(z)− ι−ν(z)]
Since z = 2√βγ = 2
√1
2τ2i
(θi − µi)2 = 2√2τi| θi − µi | and ν = 1
2 , we
150Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
have:
ι 12(z) =
(1√2τi| θi − µi |
) 12∞∑k=0
(−1)k(
1√2τi| θi − µi |
)2k
k!Γ(k + 1.5)
ι− 12(z) =
(1√2τi| θi − µi |
)− 12∞∑k=0
(−1)k(
1√2τi| θi − µi |
)2k
k!Γ(k − 0.5)
Υ 12(z) =
1
sin(π2 )
[cos(
π
2)ι 1
2(z)− ι− 1
2(z)]
Plugging these in gives us
f(θi) ∝1
τiexp
[−| θi − µi |
τi
],
which is a canonical form for DE(µi, τi).
13. Clyde, Muller, and Parmigiani (1995) specify the following hierarchical
model for 10 Bernoulli outcomes:
yi|βi, γi ∼ BR(p(β, λ,x))
p(yi = 1|βi, λi,x) = (1 + exp[−βi(x− λi)− log(0.95/0.05)])−1[
log βi
log λi
]∼MVN (µ,Σ)
µ ∼MVN
[ 0
2
],
[25 5
5 25
]−1
Σ ∼ W
10, 10×
[0.44 −0.12
−0.12 0.14
]−1 .
Using BUGS, obtain the posterior distribution for log β and log λ using the
data: y = [1, 0, 1, 1, 1, 1, 0, 1, 1, 0] with x generated in R from N (1, 2).
I Answer
< R.Code >
require(runjags)
Bayesian Hierarchical Models 151
y <- c(1, 0, 1, 1, 1, 1, 0, 1, 1, 0)
x <- rnorm(10, 1, 2)
n <- length(y)
mumu <- c(0, 2)
musig <- matrix(c(25, 5, 5, 25), 2)
sigsig <- matrix(c(0.44, -0.12, -0.12, 0.14), 2)
theta.in <- cbind(rep(1,10), rep(1, 10))
clyde.data <- dump.format(list(y=y, x=x, mumu=mumu, n=n,
musig=musig,
sigsig=sigsig))
clyde.inits <- dump.format(list(theta=theta.in))
clyde.model <- "model for (i in 1:n) y[i] ~ dbern(p[i])
p[i] <- 1/(1 + g[i])
log(g[i]) <- - beta[i] * (x[i] - lambda[i])
- log(0.95/0.05)
log(beta[i]) <- theta[i,1]
log(lambda[i]) <- theta[i,2]
theta[i,1:2] ~ dmnorm(mu[1:2], S[1:2,1:2])
mu[1:2] ~ dmnorm(mumu[1:2], musig[1:2,1:2])
S[1:2,1:2] ~ dwish(sigsig[1:2,1:2], 10)
"clyde <- run.jags(model=clyde.model, data=clyde.data,
inits=clyde.inits, n.chain=1,
burnin=1000, sample=5000, plot=TRUE,
monitor=c("theta"))
colnames(clyde$mcmc[[1]]) <- c("beta[1]", "beta[2]",
"beta[3]", "beta[4]",
"beta[5]", "beta[6]",
"beta[7]", "beta[8]",
"beta[9]", "beta[10]",
"lambda[1]", "lambda[2]",
"lambda[3]", "lambda[4]",
"lambda[5]", "lambda[6]",
"lambda[7]", "lambda[8]",
"lambda[9]", "lambda[10]")
# the function posterior.summary was defined in Q9.5.
152Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
> posterior.summary(clyde$mcmc[[1]])
mean sd 2.5% 50% 97.5%
beta[1] -0.6457 0.2816 -1.2149 -0.6394 -0.1038
beta[2] -0.3872 0.3527 -1.0017 -0.4259 0.3786
beta[3] -0.7028 0.2982 -1.2711 -0.7021 -0.1088
beta[4] -0.5088 0.3380 -1.1601 -0.5026 0.1884
beta[5] -0.7409 0.3038 -1.3286 -0.7283 -0.1150
beta[6] -0.5157 0.3470 -1.1704 -0.5212 0.2559
beta[7] -0.4188 0.3648 -1.0883 -0.4360 0.3569
beta[8] -0.6620 0.3076 -1.2663 -0.6573 -0.0916
beta[9] -0.6162 0.3132 -1.3290 -0.6057 -0.0341
beta[10] -0.4134 0.3546 -1.0529 -0.4510 0.3488
lambda[1] 1.8409 0.2021 1.4365 1.8437 2.1969
lambda[2] 1.8685 0.2009 1.4436 1.8926 2.2618
lambda[3] 1.8479 0.2104 1.4033 1.8578 2.2298
lambda[4] 1.8396 0.2090 1.4415 1.8510 2.2186
lambda[5] 1.8548 0.2117 1.4356 1.8546 2.2480
lambda[6] 1.8393 0.2090 1.3686 1.8545 2.2326
lambda[7] 1.8717 0.2021 1.4624 1.8821 2.2402
lambda[8] 1.8404 0.2103 1.4083 1.8441 2.2452
lambda[9] 1.8412 0.2154 1.4007 1.8515 2.2478
lambda[10] 1.8920 0.1991 1.4910 1.9040 2.2714
15. Scollnik (2001) considers the following actuarial claims data for three
groups of insurance policyholders, with missing values.
Group 1 Group 2 Group 3
Year Payroll Claims Payroll Claims Payroll Claims
1 280 9 260 6 NA NA
2 320 7 275 4 145 8
3 265 6 240 2 120 3
4 340 13 265 8 105 4
5 NA NA 285 NA 115 NA
Bayesian Hierarchical Models 153
Replicate Scollnik’s hierarchical Poisson model using BUGS :
Yij ∼ P(λij)
λij = Pijθj
θj ∼ G(α, β) Pij ∼ G(γ, δ)
α ∼ G(5, 5) γ ∼ U(0, 100)
β ∼ G(25, 1) δ ∼ U(0, 100),
where i = 1, 2, 3 and j = 1, . . . , 5.
I Answer
< R.Code >
require(runjags)
y <- matrix(c(9, 6, NA, 7, 4, 8, 6, 2, 3,
13, 8, 4, NA, NA, NA), nrow=3)
K <- nrow(y); J <- ncol(y)
scollnik.data <- dump.format(list(y=y, K=K, J=J))
scollnik.inits <- dump.format(list(alpha=1, beta=1,
gamma=1, delta=1))
scollnik.model <- "model for (j in 1:J) for (i in 1:K) y[i,j] ~ dpois(lambda[i,j])
lambda[i,j] <- p[i,j] * theta[j]
p[i,j] ~ dgamma(gamma, delta)
theta[j] ~ dgamma(alpha, beta)
alpha ~ dgamma(5,5)
beta ~ dgamma(25,1)
gamma ~ dunif(0,100)
delta ~ dunif(0,100)
"scollnik <- run.jags(model=scollnik.model,
data=scollnik.data,
inits=scollnik.inits,
n.chain=1, burnin=1000,
154Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
sample=5000, plot=TRUE,
monitor=c("alpha", "beta",
"gamma", "delta"))
> posterior.summary(scollnik$mcmc[[1]])
mean sd 2.5% 50% 97.5%
alpha 1.3470 0.4674 0.5819 1.2878 2.4244
beta 24.7008 4.7755 16.3794 24.4065 34.8211
gamma 53.7372 24.5962 14.3150 50.2538 97.7965
delta 0.4334 0.2205 0.1232 0.3869 0.8882
17. Show that de Finetti’s property holds true for a mixture of normals.
Hint: see Freedman and Diaconis (1982).
I Answer
Let’s say
x1[1], x
1[2], · · · , x
1[n] ∼ N 1
x2[1], x
2[2], · · · , x
2[n] ∼ N 2
The exchangeability in N 1 and N 2 means that changing the order of
x1’s and x2’s do NOT matter.
Construct a mixture normal model by w1·N 1+w2·N 2 where w1+w2 = 1.
So, the draws from the mixture normal model are:
w1 · x1[1] + w2 · x2
[1],
w1 · x1[2] + w2 · x2
[2],
· · · ,w1 · x1
[n] + w2 · x2[n]
What we care about is the order of the above draws. And, since the
orders of x1’s and x2’s do NOT matter, the order of(w1 · x1 + w2 · x2
)’s
does NOT matter, either. Therefore, de Finetti’s property holds true
for a mixture of normals.
19. Plug (10.39) in to (10.36) to obtain (10.40), the empirical Bayes esti-
mates of the posterior mean. Using the same logic and (10.37), produce
Bayesian Hierarchical Models 155
the empirical Bayes estimate of the posterior variance, assuming that it
is now not know.
I Answer
µi =σ2
0m+ s2Xi
σ20 + s2
=σ2
0
σ20 + s2
m+s2
σ20 + s2
Xi
=σ2
0
σ20 + s2
m+
(1− σ2
0
σ20 + s2
)Xi
= E
[(p− 3)σ2
0∑(Xi − ¯X)2
]¯X +
(1− E
[(p− 3)σ2
0∑(Xi − ¯X)2
])Xi
∴ µEBi =(p− 3)σ2
0∑(Xi − ¯X)2
¯X +
(1− (p− 3)σ2
0∑(Xi − ¯X)2
)Xi,
which is (10.40).
From (10.39), we have:
s2 =σ2
0 − σ20E[
(p−3)σ20∑
(Xi− ¯X)2
]E[
(p−3)σ20∑
(Xi− ¯X)2
] ,
which gives us:
σ2µ = E
[(p− 3)σ2
0∑(Xi − ¯X)2
]×σ2
0 − σ20E[
(p−3)σ20∑
(Xi− ¯X)2
]E[
(p−3)σ20∑
(Xi− ¯X)2
]=
(1− E
[(p− 3)σ2
0∑(Xi − ¯X)2
])σ2
0
∴ σ2EB
µ =
(1− E
[(p− 3)σ2
0∑(Xi − ¯X)2
])σ2
0 ,
which is the empirical Bayes estimate of the posterior variance.
13
Some Markov Chain Monte Carlo Theory
Exercises
1. In Section , three conditions were given for F to be an associated field
of Ω. Show that the first condition could be replaced with ∅ ∈ F using
properties of one of the other two conditions. Similarly, prove that the
Kolmogorov axioms can be stated with respect the probability of the
null set or the probability of the complete set.
I Answer By assumption,
φ ∈ F
If A ∈ F → AC ∈ F
If A ∈ F ,∀i →∞⋃i=1
Ai ∈ F
Since φC = Ω, from the conditions (1) and (2), φ ∈ F → Ω ∈ F .
3. Suppose we have the probability space (Ω,F , P ), sometimes called a
triple, and A1, A2, . . . , Ak ∈ F . Prove the finite sub-additive property
that:
P
(k⋃i=1
Ai
)≤
k∑i=1
p(Ai)
(Boole’s Inequality).
I Answer Start with a basic principle:
p(Aj⋃A`) = p(Aj) + p(A`)− p(Aj
⋂A`).
157
158Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Which is generalized as:
p(
k⋃i=1
Ai) =
k∑i=1
p(Ai)−k∑i 6=1
(A1
⋂Ai)−
k∑i 6=2
(A2
⋂Ai) . . .−
k∑i6=k
(Ak⋂Ai).
Since each of the subtracted terms must be non-negative, we know that
p(k⋃i=1
Ai) ≤k∑i=1
p(Ai).
5. Prove that uniform ergodicity gives a faster rate of convergence than
geometric ergodicity and that geometric ergodicity gives a faster rate of
convergence than ergodicity of degree 2.
I Answer Using notation from Tierny (1994). Consider two Markov
chains P1 and P2 at step n:
||Pn1 (x, ·)− π(·)||TV ≤M(x)rn1
||Pn2 (x, ·)− π(·)||TV ≤ K0rn2
where 0 < r1, r2 < 1, K0 is a positive constant, and || ||TV denotes the
total variation norm. Therefore the first chain is geometrically ergodic
and the second chain is uniformly ergodic. If M(x) is bounded then the
first chain is also uniformly ergodic since we can set the upper bound
as K.
Given a specific n, let Kn =rn2rn1
(K0), then there is a An ⊂ X with
π(An) > 0 such that for all x ∈ An:
||Pn1 (x, ·)− π(·)||TV ≥ Krn1
=
(K0
rn2rn1
)rn1 = K0r
n2
≥ ||Pn2 (x, ·)− π(·)||TV .
7. Construct a stochastic process without the Markovian property and con-
struct deliberately non-homogeneous Markov chain.
I Answer (A) Define a stochastic process, θ[t](t ≥ 0) on the prob-
ability space (Ω,F , P ) as discussed in Section 11.2 where the random
Some Markov Chain Monte Carlo Theory 159
variable θ[t] is conditional on the history of the process in the following
manner:
• E|θ[t]| < inf
• E(θ[t+1]|X1, X2, . . . , Xt) = θ[t]
• θ[t] = f(n(X1, X2, . . . , Xt)
where fn() often the arithmetic mean. This is called a Martingale.
(B) A deliberately nonhomogeneous Markov chain on I+, with k, c ∈ Iis defined by:
p(θ[t+1]) =∣∣∣f(θ[t] + t(modk) + c
∣∣∣ ,where f() is any positive bounded function. For example,
X <- 50; k <- 10
move <- function(X,t,k) abs( rpois(1,X) + t %% k - 5 )
for (i in 1:5000)
X <- move(X,i,k)
print(X)
9. Give an example of a maximally ψ-irreducible measure and one that it
dominates.
I Answer Define ψ as a postiive, σ-finite measure on (Ω,F) with
arbitrary A ∈ F 3 ψ(A) > 0, where for any other positive σ-finite
measure on (Ω,F) that ψ > ψ′ (maximally).
When F is discrete and A ⊂ F :
x −→ y ∀x ∈ F , y ∈ A,
and the irreducible measure is just the counting measure on A. Thus
for matrices:
x −→ y ∀x, y ∈ F .
Denote this as:
CardB(F) = Card(A ∩B), B ⊂ F ,
160Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
for some other subset B.
Now suppose that k is a CardB-irreducible transition matrix, with:
G = y ∈ F : x −→ y for somex ∈ B.
Then the counting measure on G, CardG, is maximally irriducible and
dominates CardF .
11. A single-dimension Markov chain, M(θt, t ≥ 0), with transition matrix
P and unique stationary distribution π(θ) is time-reversible iff:
π(θi)p(θi, θj) = π(θj)p(θj , θi).
for all θi and θj in Ω. Prove that the reversed chain, defined by M(θs, s ≤0) = M(θt,−t ≥ 0), has the identical transition matrix.
I Answer Notate Xn as the nth value from a Markov chain with
stationary distribution π(θ) and transition kernel p(θi, θj) = pij . Now
Yn = X−n is the nth value from a Markov chain with stationary distri-
bution also π(θj), but transition kernel p(θj , θi) = qji. Then:
qij = p(Yn+1 = j|Yn = i)
= p(X−n−1 = j|X−n = i) [set k = −n]
= p(Xk−1 = j|Xk = i)
=p(Xk−1 = j,Xk = i)
p(Xk = i)
=p(Xk−1 = j)
p(Xk = i)p((Xk = i|Xk−1 = ji).
Therefore π(θi)p(θi, θj) = π(θj)p(θj , θi) holds only when qij = pij ,
meaning that the transition kernels are the same but run in opposite
directions with regard to time.
13. Meyn and Tweedie (1993, 73) define an n-step taboo probability as:∑AP
n(x,B) = Px(Φn ∈ B, τA ≥ n), x ∈ X, A,B ∈ B(X),
meaning the probability that the Markov chain, Φ, transitions to set Bin n steps avoiding (not hitting) set A. Here X is a general state space
Some Markov Chain Monte Carlo Theory 161
with countably generated σ-field B(X), and τA is the return time to A(their notation differs slightly from that in this chapter). Show that:∑
APn(x,B) =
∫AcP (x, dy)AP
n−1(y,B),
where Ac denotes the complementary set to A.
I Answer For some y ∈ X as well:∑AP
n(x,B) = p(moves from x to y at step n without hitting A)
× p(reached step n− 1 without hitting A)
= p(Φn 6∈ A|Φn−1 = x)p(Φ1, . . . ,Φn−1 6∈ A)
=
∫AcP (x, dy)AP
n−1(y,B).
15. The Langevin algorithm is a Metropolis-Hastings variant where on each
step a small increment is added to the proposal point in the direction
of higher density. Show that making this this increment an increment
of the log gradient in the positive direction produces an ergodic Markov
chain by preserving the detailed balance equation.
I Answer Define the Langevin Diffusion (LD) for an n-dimensional
posterior distribution, smooth πn, having variance σ2, as the diffusion
process that satisfies the stochastic differential equation:
d∧t = σdBt +σ2
2∇ log [πn(∧t)] dt,
where Bt is n-dimensional standard Brownian motion, and ∇ denotes a
gradient df/dx. A discrete approximation is given by:
d∧t+1 = ∧t + σnZt +σ2n
2∇ log [πn(∧t)] dt,
where Ztiid∼ N(0, 1), t = 1, . . . , n, and σ2
n is a chosen step variance.
This can be transient without a Metropolis step (Roberts and Tweedie
1996), so for producing a candidate from the current position, Xt −→Yt+1, we calculate:
Yt+1 = Xt + σnZt + delformXt,
162Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
which is a random where σnZt+1 is random step andσ2n
2 ∇ log [πn(Xt)]
another random step but always towards higher density. Note that in
practice the choice of σ2n is critical to the performance of this chain.
We accept the candidate Yt+1 as move Xt+1 with probabiliity:
αn(Xt, Yt+1) =πn(Yt+1)qn(Yt+1, Xt)
πn(Yt+1)qn(Xt, Yt+1),
where:
qn(x, y) = (2πσ2n)−
n2 exp
[− 1
2σ2n
∥∥∥∥y − x− σ2n
2∇ log [πn(x)]
∥∥∥∥2
2
]≡
n∏i=1
q(xni , yi),
(‖ ‖ denotes the standard L2 norm for vectors), otherwise set Xt+1 =
Xt. We need to show detailed balance holds, π(x)p(x, y) = π(y)p(y, x),
where
p(x, y) = qn(x, y)min
[1,π(y)
π(x)
], x 6= y.
Substituting in the definitions. . .
π(x)(2πσ2n)−
n2 exp
[− 1
2σ2n
∥∥∥∥y − x− σ2n
2∇ log [πn(x)]
∥∥∥∥2
2
]min
[1,π(y)
π(x)
]?= π(y)(2πσ2
n)−n2 exp
[− 1
2σ2n
∥∥∥∥x− y − σ2n
2∇ log [πn(y)]
∥∥∥∥2
2
]min
[1,π(x)
π(y)
]
min [π(y), π(x)] exp
[− 1
2σ2n
∥∥∥∥y − x− σ2n
2∇ log [πn(x)]
∥∥∥∥2
2
]?= min [π(x), π(y)] exp
[− 1
2σ2n
∥∥∥∥x− y − σ2n
2∇ log [πn(y)]
∥∥∥∥2
2
]
exp
[− 1
2σ2n
∥∥∥∥y − x− σ2n
2∇ log [πn(x)]
∥∥∥∥2
2
]?= exp
[− 1
2σ2n
∥∥∥∥x− y − σ2n
2∇ log [πn(y)]
∥∥∥∥2
2
]
n∑i=1
[yi − xi
σ2n
2∇ log [πn(xi)]
]2?=
n∑i=1
[xi − yi
σ2n
2∇ log [πn(yi)]
]2
Some Markov Chain Monte Carlo Theory 163
n∑i=1
[y2i − yixi − yi
σ2n
2∇ log [πn(xi)]− xiyi + x2
i + xiσ2n
2∇ log [πn(xi)]
−σ2n
2∇ log [πn(xi)] yi +
σ2n
2∇ log [πn(xi)]xi +
(σ2n
2∇ log [πn(xi)]
)2]
?=
n∑i=1
[x2i − xiyi − xi
σ2n
2∇ log [πn(yi)]− yixi + y2
i + yiσ2n
2∇ log [πn(yi)]
−σ2n
2∇ log [πn(yi)]xi +
σ2n
2∇ log [πn(yi)] yi +
(σ2n
2∇ log [πn(yi)]
)2]
n∑i=1
[−2y
σ2n
2∇ log [πn(xi)] + 2xi
σ2n
2∇ log [πn(xi)] +
(σ2n
2∇ log [πn(xi)]
)2]
?=
n∑i=1
[−2x
σ2n
2∇ log [πn(yi)] + 2yi
σ2n
2∇ log [πn(yi)] +
(σ2n
2∇ log [πn(yi)]
)2].
Now use the discrete approximation:σ2n
2 ∇ log [πn(xi)] ≈ yi− xi + σnZi:
n∑i=1
[−2y(yi − xi + σnZi) + 2xi(yi − xi + σnZi) + (yi − xi + σnZi)
2]
?=
n∑i=1
[−2x
σ2n
2∇ log [πn(yi)] + 2yi
σ2n
2∇ log [πn(yi)] +
(σ2n
2∇ log [πn(yi)]
)2]
n∑i=1
[−2y2
i + 2yixi − 2yiσnZi + 2xiyi − 2x2i + 2xiσnZi + y2
i
−yixi + yiσnZi − xiyi + x2i − xiσnZi + σnZiyi − σnZixiσ2
nZ2i
]?=
n∑i=1
[−2x
σ2n
2∇ log [πn(yi)] + 2yi
σ2n
2∇ log [πn(yi)] +
(σ2n
2∇ log [πn(yi)]
)2]
n∑i=1
[−y2
i + 2yixi − x2i + σ2
nZ2i
]?=
n∑i=1
[−2x
σ2n
2∇ log [πn(yi)] + 2yi
σ2n
2∇ log [πn(yi)] +
(σ2n
2∇ log [πn(yi)]
)2].
Langevin diffusion is reversible, soσ2n
2 ∇ log [πn(yi)] ≈ xi − yi + σnZi.
164Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Inserting this into the RHS above:
n∑i=1
[−y2
i + 2yixi − x2i + σ2
nZ2i
] ?=
n∑i=1
[−2xxi − yi + σnZi + 2yixi − yi + σnZi + (xi − yi + σnZi)
2]
n∑i=1
[−y2
i + 2yixi − x2i + σ2
nZ2i
] ?=
n∑i=1
[−2x2
i + 2xiyi − 2xiσnZi + 2yixi − 2y2i + 2yiσnZi + x2
i
−xiyi + xiσnZi − yixi + y2i − yiσnZi + σnZixi = σnZiyi + σ2
nZ2i
]n∑i=1
[−y2
i + 2yixi − x2i + σ2
nZ2i
] ?=
n∑i=1
[−y2
i − x2i + xiyi + σ2
nZ2i
].
This last equality is cleary true in expectation since Zi is iid.
17. Consider a stochastic process, θ[t] (t ≥ 0), on the probability space
(Ω,F , P ). If: Ft ⊂ Ft+1, θ[t] is mearsureable on Fn with finite first
moment, and with probability one E[θ[t+1]|Ft] = θ[t], then this process
is called a Martingale (Billingsley 1995, 458). Prove that Martingales
do or do not have the Markovian property.
I Answer
Since E[θ[t+1]|Ft] = θ[t], then E[θ[t+1] − θ[t]|Ft] = 0, by the definition
of conditional expectation (see Norris [1997, p.130] for details). Define
∇t = E[θ[t+1] − θ[t], which indicates that:
θ[t+1] = ∇t + θ[t]
= ∇t +∇t−1 + θ[t−1]
= ∇t +∇t−1 +∇t−2 + θ[t−2]
=
t∑1
∇i.
Note that the ∇i are independent and integrable random variables with
mean zero and finite variance. Further note that for some fixed num-
Some Markov Chain Monte Carlo Theory 165
ber of iterations θ[1], . . . , θ[τ ] and ∇1, . . . ,∇τ generate the same σ-
algebra:
σ(θ[1], . . . , θ[τ ]) = σ(∇1, . . . ,∇τ ) ⊂ Fτ .
So either both are Markovian or neither are Markovian. But since the
random variable defined by a sum of iid random variables cannot be
Markovian, the generation of θ[i] cannot be either.
19. Prove that a Markov chain that is positive Harris recurrent, ∃σ−finite
probability measure, P, on S so that for an irreducible Markov chain,
Xt, at time t, p(Xn ∈ A) = 1, ∀A ∈ S where P(A) > 0, and aperiodic
is also α-mixing,
α(t) = supA,B
∣∣∣∣p(θt ∈ B,θ0 ∈ A)− p(θt ∈ B)p(θ0 ∈ A)
∣∣∣∣ −→t→∞ 0
I Answer This proof follows that of Jones (2004).∗ The coupling
inequality states that if there are two ergodic (implied by the condi-
tions above) chains, one started in the stationarity distribution and one
started at some arbitrary point, then:
‖Pn(X, ·)− π(·)‖TV ≤ p(C > n),
where C is the coupling time of these two Markov chains defined in
Lindvall (1992).† See also Rosenthal (1995, p.560). Therefore, both of
the folloing hold:
|Pn(X,A)− π(A)| ≤ p(C > n)
|Pn(X,B)− π(B)| ≤ p(C > n),
for A and B defined above. Under the assumptions of the problem,
p(C > n) → 0 as n → ∞. By Chen (1999)‡ the integral of the
∗Jones, Galin. (2004). “On the Markov Chain Central Limit Theorem.” Probability Surveys
1, 299320.†Lindvall, T. (1992). Lectures on the Coupling Method. Wiley, New York.‡Chen, X. (1999). Limit Theorems for Functionals of Ergodic Markov Chains With General
State Space. Memoirs of the American Mathematical Society, 139.
166Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
posterior distance in A over the set B is also smaller than the coupling
time: ∫B
|Pn(X,A)− π(A)|π(dx) ≤ p(C > n).
With standard regularity conditions and Minkowski’s inequality the ab-
solute value function can be moved outside of the integral, allowing:
p(C > n) ≥∣∣∣∣∫B
[Pn(X,A)− π(A)]π(dx)
∣∣∣∣≥∣∣∣∣∫B
Pn(X,A)π(dx)−∫B
π(A)π(dx)
∣∣∣∣=∣∣Pn(Xn ∈ A,X0 ∈ B)− π(A)π(B)
∣∣ = α(t).
Therefore α(t) also goes to zero as n goes to infinity since the rate
of TVN convergence bounds the rate of α-mixing. For demonstrated
geometric and uniform ergodicity this rate can be shown to be much
faster still.
14
Utilitarian Markov Chain Monte Carlo
Exercises
1. For the hierarchical model of firearm-related deaths in Chapter 10, per-
form the following diagnostic tests from Section 12.2 in CODA or BOA
: Gelman and Rubin, Geweke, Raftery and Lewis, Heidelberger and
Welsh.
I Answer
< R.Code >
require(runjags)
x <- c(14.9, 14.8, 14.3, 13.3, 13.3, 13.3,
13.9, 13.6, 13.9, 14.1, 14.9, 15.2,
14.8, 15.4, 14.8, 13.7, 12.8, 12.1)
n <- length(x)
d <- c(1, 1)
firearm.data <- dump.format(list(x=x, n=n, d=d))
firearm.init1 <- dump.format(list(tau=1, lambda=c(1,20),
K=c(.5,.5)))
firearm.init2 <- dump.format(list(tau=1, lambda=c(20,1),
K=c(.5,.5)))
firearm.model <- "model for (i in 1:n) x[i] ~ dnorm(mu[i], tau)
mu[i] <- lambda[k[i]]
k[i] ~ dcat(K[])
tau ~ dgamma(.01, .01)
lambda[1] ~ dnorm(0, .01)
167
168Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
lambda[2] ~ dnorm(0, .01)
K[1:2] ~ ddirch(d[])
"firearm <- run.jags(model=firearm.model, n.chain=2,
data=firearm.data, plot=TRUE,
inits=c(firearm.init1,firearm.init2),
burnin=5000, sample=10000,
monitor=c("tau","lambda","K"))
> geweke.diag(firearm$mcmc)
[[1]]
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
tau lambda[1] lambda[2] K[1] K[2]
1.7860 -0.1373 -0.1122 -0.1828 0.1828
[[2]]
Fraction in 1st window = 0.1
Fraction in 2nd window = 0.5
tau lambda[1] lambda[2] K[1] K[2]
1.5434 -0.5641 2.3849 -1.5759 1.5759
> gelman.diag(firearm$mcmc)
Potential scale reduction factors:
Point est. 97.5% quantile
tau 1.00 1.01
lambda[1] 1.00 1.00
lambda[2] 1.00 1.01
K[1] 1.00 1.00
K[2] 1.00 1.00
Multivariate psrf
1.00
Utilitarian Markov Chain Monte Carlo 169
> raftery.diag(firearm$mcmc)
[[1]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
Burn-in Total Lower bound Dependence
(M) (N) (Nmin) factor (I)
tau 3 4338 3746 1.16
lambda[1] 6 10809 3746 2.89
lambda[2] 3 4129 3746 1.10
K[1] 16 18252 3746 4.87
K[2] 3 4129 3746 1.10
[[2]]
Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95
Burn-in Total Lower bound Dependence
(M) (N) (Nmin) factor (I)
tau 3 4163 3746 1.11
lambda[1] 9 13413 3746 3.58
lambda[2] 4 7346 3746 1.96
K[1] 12 12060 3746 3.22
K[2] 3 4061 3746 1.08
> heidel.diag(firearm$mcmc)
[[1]]
Stationarity start p-value
test iteration
tau passed 1 0.706
lambda[1] passed 1 0.631
lambda[2] passed 1 0.763
170Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
K[1] passed 1 0.640
K[2] passed 1 0.640
Halfwidth Mean Halfwidth
test
tau passed 1.507 0.0523
lambda[1] passed 10.785 1.0125
lambda[2] failed 8.959 1.2892
K[1] failed 0.571 0.0841
K[2] failed 0.429 0.0841
[[2]]
Stationarity start p-value
test iteration
tau passed 1 0.252
lambda[1] passed 1 0.530
lambda[2] passed 1 0.311
K[1] passed 1 0.355
K[2] passed 1 0.355
Halfwidth Mean Halfwidth
test
tau passed 1.478 0.0499
lambda[1] passed 10.636 1.0311
lambda[2] failed 8.987 1.2375
K[1] failed 0.557 0.0782
K[2] failed 0.443 0.0782
3. Plot in the same graph a time sequence of 0 to 100 versus the annealing
cooling schedules: logarithmic, geometric, semi-quadratic, and linear.
What are the tradeoffs associated with each with regard to convergence
and processing time?
I Answer
Different cooling schedules are as follows:
• Geometric: Tt = εTt−1, 0 < ε < 1
Utilitarian Markov Chain Monte Carlo 171
• Logarithmic: Tt = KT1
log(t)
• Linear: Tt = T1 − at, a > 0
• Semiquadratic: Tt = T1 − t− at2, a > 0
And, as shown below in the figure, the semiquadratic form is the slowest
annealing schedule whereas the geometric form is the fastest annealing
schedule. When annealing is slower, the chain is more likely to converge
to its stationary distribution, but it requires longer processing time. So,
the convergence and the processing time has a trade-off relationship.
Time Sequence
Tem
erat
ure
1 25 50 75 100
010
020
030
040
050
0
logarithmic
geometric
semiquadratic
linear
FIGURE 14.1: Different Annealing Cooling Schedules
< R.Code >
Logarithmic <- function(t, T0) T <- rep(NA, t+1)
T[1] <- T0
for (i in 1:t) T[i+1] <- 0.8*T[1] / (log(i+1)) -
172Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
0.8*T[1] / log(t+1)
return(T)
Geometric <- function(t, T0)
T <- rep(NA, t+1)
T[1] <- T0
for (i in 1:t) T[i+1] <- 0.3 * T[i]
return(T)
Semiquadratic <- function(t, T0) T <- rep(NA, t+1)
T[1] <- T0
for (i in 1:t) T[i+1] <- T[1] - i - 0.04*i^2
return(T)
Linear <- function(t, T0) T <- rep(NA, t+1)
T[1] <- T0
for (i in 1:t) T[i+1] <- T[1] - 5*i
return(T)
ruler <- seq(1, 101)
l1 <- Logarithmic(100,500)
l2 <- Geometric(100,500)
l3 <- Semiquadratic(100,500)
l4 <- Linear(100,500)
plot(ruler, l1, type="n",
axes=F, xaxs="i", yaxs="i",
xlim=c(1,100), ylim=c(0,500),
xlab="Time Sequence", ylab="Temerature")
box(col="gray80")
axis(1, at=c(1,25,50,75,100),col="gray80")
axis(2,col="gray80")
Utilitarian Markov Chain Monte Carlo 173
lines(l1,lwd=2)
lines(l2)
lines(l3,lty=3, lwd=2)
lines(l4,lty=2,lwd=2)
text(30, 50, "logarithmic")
text(12, 10, "geometric")
text(60, 360, "semiquadratic")
text(45, 250, "linear")
5. Show how the full conditional distributions for Gibbs sampling in the
Bayesian Tobit model on page 466 can be changed into grouped or col-
lapsed Gibbs sampler forms (Section 12.3.2).
I Answer
Since the full conditional distribution of β (k × 1 vector) is “Multivari-
ate Normal” distribution (with k degree of freedom), we can sample
β1, β2, · · · , βk at the same time. Therefore, it is (by itself) a grouped
Gibbs sampler form.
7. BUGS includes a quick diagnostic command diag which implements a
version of Geweke’s (1992) time-series diagnostic for the first 25% of
the data and the last 50% of the data. Rather than calculate spec-
tral densities for the denominator of the G statistic, BUGS divides the
two periods into 25 equal-sized bins and then takes the variance of the
means from within these bins. Run the example in Section 12.2 using
the BUGS code in the Computational Addendum with 10,000 itera-
tions, and load the data into R with the command military.mcmc <-
boa.importBUGS("my.bugs.dir/bugs") . Write a function to calculate
the shortcut for Geweke’s diagnostic for each of the five chains (columns
here). Compare your answer with BUGS .
I Answer
As shown in the table below, Geweke’s diagnostic from BUGS tends to
make the statistics smaller in absolute term. It is especially so when a
chain doesn’t converge. However, the decision on whether the chain is
174Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
non-convergent (statistic 2) or not doesn’t seem to be affected by this
change.
From BUGS From math
αµ -5.14526874 -30.69368272
βµ -0.05462386 -0.05198337
ατ -4.25638592 -28.65358373
βτ -2.09015361 -2.37649673
τc -0.74985086 -0.82777295
Table 14.1: Geweke’s Diagnostics from Different Sources
< R.Code >
require(runjags)
require(BaM) # BaM pacakage contains data - military
y <- as.matrix(t(military[,-1]))
x <- as.vector(military[,1])
n <- nrow(y); k <- ncol(y)
military.data <- dump.format(list(x=x, y=y, n=n, k=k))
military.inits <- dump.format(list(alpha.mu=1, beta.mu=1,
alpha.tau=1,
beta.tau=1, tau.c=1))
military.model <- "model for (i in 1:n) for (j in 1:k) mu[i,j] <- alpha[i] + beta[i]*x[j]
y[i,j] ~ dnorm(mu[i,j], tau.c)
alpha[i] ~ dnorm(alpha.mu, alpha.tau)
beta[i] ~ dnorm(beta.mu, beta.tau)
alpha.mu ~ dnorm(1, 0.01)
beta.mu ~ dnorm(1, 0.01)
alpha.tau ~ dgamma(0.01, 0.01)
beta.tau ~ dgamma(0.01, 0.01)
tau.c ~ dgamma(0.01, 0.01)
Utilitarian Markov Chain Monte Carlo 175
"military.jags <- run.jags(model=military.model,
data=military.data,
inits=military.inits,
burnin=10000, sample=10000,
monitor=c("alpha.mu",
"beta.mu", "alpha.tau",
"beta.tau", "tau.c"),
n.chain=1, plot=TRUE)
almu <- military.jags$mcmc[[1]][,1]
bemu <- military.jags$mcmc[[1]][,2]
altau <- military.jags$mcmc[[1]][,3]
betau <- military.jags$mcmc[[1]][,4]
tau <- military.jags$mcmc[[1]][,5]
geweke.hand <- function(theta) n <- length(theta)
n1 <- n*0.25
n2 <- n*0.5+1
theta25 <- theta[1:n1]
theta50 <- theta[n2:n]
G <- abs(mean(theta25) - mean(theta50)) /
(sqrt(var(theta25)/n1 + var(theta50)/(n2-1)))
return(G)
geweke.bugs <- geweke.diag(military.jags$mcmc,
frac1=0.25, frac2=0.5)[[1]]$z
geweke.math <- c(geweke.hand(almu), geweke.hand(bemu),
geweke.hand(altau),geweke.hand(betau),
geweke.hand(tau))
geweke.compare <- cbind(geweke.bugs, geweke.math)
> print(geweke.compare)
geweke.bugs geweke.math
alpha.mu -5.14526874 -30.69368272
beta.mu -0.05462386 -0.05198337
alpha.tau -4.25638592 -28.65358373
beta.tau -2.09015361 -2.37649673
tau.c -0.74985086 -0.82777295
9. For the model of marriage rates in Italy given in Section 10.3, calculate
176Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
the Zellner and Min diagnostic with the segmentation: θ1 = α, θ2 = β,
for the posteriors defined at the iteration points: π1,000(α, β,λ,X) and
π10,000(α, β,λ,X). Test H0: η = 0 versus H1: η 6= 0 with the statistic
KOA assuming serial independence.
I Answer
First begin with some background on the Zellner and Min diagnos-
tic for two parameters (α and β) whose posterior is described by a
Markov chain. For this diagnostic the Markov chain typically comes
from a Gibbs sampler, so in addition to a statement of the joint poste-
rior, π(α, β|y) = cp(α, β)`(α, β|y), we require the full conditional dis-
tributions π(α|β,y), π(β|α,y). We obtain empirical estimates of the
marginals from MCMC sample vlaues (“smoothed empirical estimates”
in the paper) by:
pN (αi|y) =1
N
N∑j=1
p(αi|β(j),y)
pN (β|y) =1
N
N∑j=1
p(β|α(j),y),
which is smoothed over the period 1, . . . , N for parameters that are
conditioned upon and given for some marginal estimate at the point i
in the chain. Normally two values for i are selected and compared.
Zellner and Min (1995) give three statistics (ARC2, DC2, RC2) and a
set of formal tests. To perform the formal test of convergence hypothe-
ses, we need to first calculate the Difference of Convergence Criterion
statistic, which is based on:
η = p(α|y)p(β|α,y)− p(β|y)p(α|β,y) = 0.
Under the null hypothesis of convergence at the point i in the Markov
chain, we get an empirical statistic from this expression by using the
empirical estimates for the first part of each product and plug-in values
to the conditionals for the second part of each product:
ηi = pN (αi|y)p(βi|αi,y)− pN (βi|y)p(αi|βi,y),
Utilitarian Markov Chain Monte Carlo 177
which is approximately zero under H0 :η = 0. So the user selected two
parameters from the model (α and β here in the general sense), and some
point (or points) and obtains a statistic that is normally distributed for
large sample sizes. The Ratio Convergence Criterion simply converts
the difference above to a ratio and provides a posterior odds statistic.
The corresponding variance term for ηi is:
φ2i = [p(αi|βi,y)]2V ar[pN (βi|y)] + [p(βi|αi,y)]2V ar[pN (αi|y)]
− 2p(αi|βi,y)p(βi|αi,y)Cov[pN (αi|y), pN (βi|y)].
So we can use this to create:
zi =ηi − ηi
ˆphiiasym
∼N(0, 1).
Furthermore, the posterior odds for this null of convergence are given
in the paper’s appendix as:
KOA =√Nπ/2(1 + (zi
2/N)) exp[−z2i /2]
(there are different notations for this expression).
Question 12.5 asks for this setup to be used on the Italian marriage rates
example in Section 10.3 for iteration points i = 1, 000 and i = 10, 000.
The three relevant full conditional distributions are:
p(α|β, λ) = BAΓ(A)−1 exp[−αB]αA−1
p(β|α, λ) = DCΓ(C)−1 exp[−βD]αC−1
p(λ|α, β) = (1 + β)∑
yi+αΓ(∑yi + α)−1 exp[−λ(1 + β)]λ∑
yi+α−1.
This test is created for two parameters at a time. It does not make sense
to test α and β together since they are parent nodes in the model, but
we could test either or both against any of the 16 λ parameters. We can
also test two λ parameters together.
The most interesting test is a pair of tests for α and β, against a λ,
and the first one is chosen arbitrarily here. This sets-up the following
178Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
starting point from above:
p(α, λ1|β,y) = p(α|β,y)p(λ1|α, β, ]y)
= p(λ1|β,y)p(α|λ1, β, ]y)
p(β, λ1|α,y) = p(β|α,y)p(λ1|α, β, ]y)
= p(λ1|α,y)p(β|λ1, β, ]y).
This sets up two parralel tests as described above. The following BUGS
code creates the MCMC iterations.
< Bugs.Code >
model italy;
for (i in 1:N)
lambda[i] ~ dgamma(alpha,beta);
y[i] ~ dpois(lambda[i]);
alpha ~ dgamma(1,1);
beta ~ dgamma(1,1);
list(y=c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7), N=16)
list(alpha=1, beta=1)
list(alpha=10, beta=10)
Now use the following R code to create the tests.
< R.Code >
library(boa)
italy.mcmc <- boa.importBUGS(
"Users/jgill/Bugs/Italy.Model/italy")
y=c(7,9,8,7,7,6,6,5,5,7,9,10,8,8,8,7)
A <- B <- C <- D <- 1
# SMOOTHED VALUES
p.alpha.1000 <- B^A * gamma(A+1)^(-1) *
exp(-italy.mcmc[1000 ,17]*B) *
Utilitarian Markov Chain Monte Carlo 179
italy.mcmc[1000 ,17]^(A-1)
p.alpha.10000 <- B^A * gamma(A+1)^(-1) *
exp(-italy.mcmc[10000,17]*B) *
italy.mcmc[10000,17]^(A-1)
p.beta.1000 <- C^D * gamma(C+1)^(-1) *
exp(-italy.mcmc[1000 ,18]*D) *
italy.mcmc[1000 ,18]^(C-1)
p.beta.10000 <- C^D * gamma(C+1)^(-1) *
exp(-italy.mcmc[10000,18]*D) *
italy.mcmc[10000,18]^(C-1)
p.lambda.1000 <- mean ( (1+italy.mcmc[1:1000,18])^(
sum(y)+italy.mcmc[1:1000,17]) *
gamma(sum(y)+
italy.mcmc[1:1000,17])^(-1) *
exp(-italy.mcmc[1:1000,1]*
(1+italy.mcmc[1:1000,18])) *
italy.mcmc[1000,1]^(
sum(y)+italy.mcmc[1:1000,17]-1) )
p.lambda.10000 <- mean ( (1+italy.mcmc[1:10000,18])^(
sum(y)+italy.mcmc[1:10000,17]) *
gamma(sum(y)+
italy.mcmc[1:10000,17])^(-1) *
exp(-italy.mcmc[1:10000,1]*
(1+italy.mcmc[1:10000,18])) *
italy.mcmc[10000,1]^(
sum(y)+italy.mcmc[1:10000,17]-1) )
v.lambda.1000 <- var ( (1+italy.mcmc[1:1000,18])^(
sum(y)+italy.mcmc[1:1000,17]) *
gamma(sum(y)+
italy.mcmc[1:1000,17])^(-1) *
exp(-italy.mcmc[1:1000,1]*
(1+italy.mcmc[1:1000,18])) *
italy.mcmc[1000,1]^(
sum(y)+italy.mcmc[1:1000,17]-1) )
v.lambda.10000 <- var ( (1+italy.mcmc[1:10000,1])^(
sum(y)+italy.mcmc[1:10000,17]) *
gamma(sum(y)+
italy.mcmc[1:10000,17])^(-1) *
exp(-italy.mcmc[1:10000,17]*
(1+italy.mcmc[1:10000,18])) *
180Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
italy.mcmc[10000,1]^(
sum(y)+italy.mcmc[1:10000,17]-1) )
# PLUG-IN VALUES
p.alpha.plug.1000 <- p.alpha.1000
p.alpha.plug.10000 <- p.alpha.10000
p.beta.plug.1000 <- p.beta.1000
p.beta.plug.10000 <- p.beta.10000
p.lambda.plug.1000 <- (1+italy.mcmc[1000,18])^(
sum(y)+italy.mcmc[1000,17]) *
gamma(sum(y)+
italy.mcmc[1000,17])^(-1) *
exp(-italy.mcmc[1000,17]*
(1+italy.mcmc[1000,18])) *
italy.mcmc[1000,1]^(
sum(y)+italy.mcmc[1000,17]-1)
p.lambda.plug.10000 <- (1+italy.mcmc[10000,18])^(
sum(y)+italy.mcmc[10000,17]) *
gamma(sum(y)+
italy.mcmc[10000,17])^(-1) *
exp(-italy.mcmc[10000,17]*
(1+italy.mcmc[10000,18])) *
italy.mcmc[10000,1]^(
sum(y)+italy.mcmc[10000,17]-1)
# TEST STATISTICS
eta.alpha.lambda.1000 <- p.alpha.1000*p.lambda.plug.1000 -
p.lambda.1000*p.alpha.plug.1000
eta.alpha.lambda.10000 <- p.alpha.10000*p.lambda.plug.10000 -
p.lambda.10000*p.alpha.plug.10000
eta.beta.lambda.1000 <- p.beta.1000*p.lambda.plug.1000 -
p.lambda.1000*p.beta.plug.1000
eta.beta.lambda.10000 <- p.beta.10000*p.lambda.plug.10000 -
p.lambda.10000*p.beta.plug.10000
phi.alpha.lambda.1000 <- p.alpha.plug.1000^2 *
v.lambda.1000
phi.alpha.lambda.10000 <- p.alpha.plug.1000^2 *
v.lambda.10000
phi.beta.lambda.1000 <- p.beta.plug.1000^2 *
Utilitarian Markov Chain Monte Carlo 181
v.lambda.1000
phi.beta.lambda.10000 <- p.beta.plug.10000^2 *
v.lambda.10000
( z.alpha.lambda.1000 <- eta.alpha.lambda.1000/
sqrt(phi.alpha.lambda.1000) )
[1] -0.034080
( z.alpha.lambda.10000 <- eta.alpha.lambda.10000/
sqrt(phi.alpha.lambda.10000) )
[1] -1.6812e-110
( z.beta.lambda.1000 <- eta.beta.lambda.1000/
sqrt(phi.beta.lambda.1000) )
[1] -0.034080
( z.beta.lambda.10000 <- eta.beta.lambda.10000/
sqrt(phi.beta.lambda.10000) )
[1] -1.7071e-110
# POSTERIOR ODDS FOR ETA=0 (CONVERGENCE)
( K.0A.alpha.lambda.1000 <- sqrt(pi*1000/2) *
(1+z.alpha.lambda.1000^2/1000) *
exp(-z.alpha.lambda.1000^2/2) )
[1] 39.61
( K.0A.alpha.lambda.10000 <- sqrt(pi*10000/2) *
(1+z.alpha.lambda.10000^2/10000) *
exp(-z.alpha.lambda.10000^2/2) )
[1] 125.33
( K.0A.beta.lambda.1000 <- sqrt(pi*1000/2) *
(1+z.beta.lambda.1000^2/1000) *
exp(-z.beta.lambda.1000^2/2) )
[1] 39.61
( K.0A.beta.lambda.10000 <- sqrt(pi*10000/2) *
(1+z.beta.lambda.10000^2/10000) *
exp(-z.beta.lambda.10000^2/2) )
[1] 125.33
# NOTE: SIMILARITY OCCURS BECAUSE THEY ARE BOTH PARENTAL
NODES TO LAMBDA
11. (Raftery and Lewis 1996). Write a Gibbs sampler in R to produce
a Markov chain to sample for θ1, θ2, and θ3, which are distributed
182Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
according to the trivariate normal distribution θ1
θ2
θ3
∼ N0
0
0
, 99 −7 −7
−7 1 0
−7 0 1
by cycling through the normal conditional distributions with variances:
V (θ1|θ2, θ3) = 1, V (θ2|θ1, θ3) = 1/50 and V (θ3|θ1, θ2) = 1/50. Contrast
the Geweke diagnostic with the Gelman and Rubin diagnostic for 1,000
iterations.
I Answer
If we have the joint distribution,
X =
[X1
X2
]∼ N
([µ1
µ2
],
[Σ11 Σ12
Σ21 Σ22
]),
then the conditional distribution is
X1|X2 ∼ N(µ1|2,Σ1|2
),
where µ1|2 = µ1 + Σ12Σ−122 (X2 − µ2) and Σ1|2 = Σ11 −Σ′12Σ
−122 Σ12.
Therefore, the full conditional distributions are:
θ1|θ2, θ3 ∼ N (−7θ2 − 7θ3, 1)
θ2|θ1, θ3 ∼ N(− 7
50θ1 −
49
50θ3,
1
50
)θ3|θ1, θ2 ∼ N
(− 7
50θ1 −
49
50θ2,
1
50
)Since we have all full conditional distribution, we use Gibbs sampler.
And, based on the posterior draws from the Gibbs sampler, we can man-
nually calculate the Geweke’s diagnostic and Gelman-Rubin’s diagnostic
(shown in the table below). It turns out that Gelman-Rubin diagnos-
tic doesn’t report any evidence for non-convergence whereas some of
Geweke’s diagnostic does report evidence for non-convergence (θ2 with
θ[0]2 = 1, and θ3 with θ
[0]3 = 0).
< R.Code >
Utilitarian Markov Chain Monte Carlo 183
θ1 θ2 θ3
Geweke(init=0) 0.65881323 0.01406866 2.611673
Geweke(init=1) 0.05630465 2.26367041 1.165261
Gelman-Rubin 1.00126375 0.99950074 1.001659
Table 14.2: Geweke versus Gelman-Rubin
post.draw <- function(bi, sim, inits) n <- bi + sim + 1
chain <- length(inits)
theta1 <- theta2 <- theta3 <- NULL
for (k in 1:chain) draw1 <- draw2 <- draw3 <- NULL
draw1[1] <- draw2[1] <- draw3[1] <- 0
for (i in 1:n) draw1.mean <- - 7*(draw2[i] + draw3[i])
draw1[i+1] <- rnorm(1, draw1.mean, 1)
draw2.mean <- - 0.07*(draw1[i+1] + 7*draw3[i])
draw2[i+1] <- rnorm(1, draw2.mean, 0.02)
draw3.mean <- - 0.07*(draw1[i+1] + 7*draw2[i+1])
draw3[i+1] <- rnorm(1, draw3.mean, 0.02)
theta1 <- cbind(theta1, draw1[(bi+2):n])
theta2 <- cbind(theta2, draw2[(bi+2):n])
theta3 <- cbind(theta3, draw3[(bi+2):n])
output <- list(theta1=theta1, theta2=theta2,
theta3=theta3)
theta1.post <- post.draw(500, 1000, c(0,1))$theta1
theta2.post <- post.draw(500, 1000, c(0,1))$theta2
theta3.post <- post.draw(500, 1000, c(0,1))$theta3
geweke.hand <- function(theta) n <- nrow(theta)
k <- ncol(theta)
G <- NULL
for (i in 1:k) n1 <- n*0.25
n2 <- n*0.5+1
184Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
theta25 <- theta[1:n1,i]
theta50 <- theta[n2:n,i]
geweke <- abs(mean(theta25) - mean(theta50)) /
(sqrt(var(theta25)/n1 + var(theta50)/(n2-1)))
G <- cbind(G, geweke)
return(G)
gelman.hand <- function(theta) n <- nrow(theta)
k <- ncol(theta)
inmean <- NULL
invariance <- NULL
bwvariance <- 0
for (i in 1:k) inmean[i] <- mean(theta[,i])
allmean <- mean(theta)
for (i in 1:k) invariance[i] <- 0
for (j in 1:n) stat1 <- (theta[j,i] - inmean[i])^2
invariance[i] <- invariance[i] + stat1
stat2 <- (inmean[i] - allmean)^2
bwvariance <- bwvariance + stat2
invar <- sum(invariance) / (k*(n-1))
bwvar <- n*bwvariance / (k-1)
estvar <- (1 - 1/n) * invar + (1/n) * bwvar
R <- sqrt(estvar / invar)
return(R)
diag.compare <- matrix(c(geweke.hand(theta1.post),
gelman.hand(theta1.post),
geweke.hand(theta2.post),
gelman.hand(theta2.post),
geweke.hand(theta3.post),
gelman.hand(theta3.post)),
ncol=3)
colnames(diag.compare) <- c("theta1", "theta2", "theta3")
rownames(diag.compare) <- c("Geweke(init=0)",
Utilitarian Markov Chain Monte Carlo 185
"Geweke(init=1)",
"Gelman-Rubin")
> print(diag.compare)
theta1 theta2 theta3
Geweke(init=0) 0.65881323 0.01406866 2.611673
Geweke(init=1) 0.05630465 2.26367041 1.165261
Gelman-Rubin 1.00126375 0.99950074 1.001659
13. Prove that data augmentation is a special case of Gibbs sampling.
I Answer
Data augmentation runs as follows: at the ith iteration,
• generate θ[i] from p[i−1] (θ|Xobs)
• generate m values of Xmis from p(Xmis|θ[i],Xobs
)• p[i] (θ|Xobs) = 1
m
∑mj=1 p
(θ|Xobs,X
[j]mis
)Substitute m = 1,Xmis = δ,Xobs = y.
Then, the algorithm becomes: at the ith iteration,
• generate θ[i] from p(θ|y, δ[i−1]
)• generate δ[i] from p
(δ|y, θ[i−1]
)This is a two-block (θ, δ) Gibbs sampler. (Q.E.D.)
15. For the following simple bivariate normal model[θ1
θ2
]∼ N
([0
0
],
[1 ρ
ρ 1
]),
the Gibbs sampler draws iteratively according to:
θ1|θ2 ∼ N (ρθ2, 1− ρ2)
θ2|θ1 ∼ N (ρθ1, 1− ρ2).
Write a simple implementation of the Gibbs sampler in R , calculate the
unconditional and conditional (Rao-Blackwellized) posterior standard
186Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
error, for both ρ = 0.05 and ρ = 0.95. Compare these four values and
comment.
I Answer
The unconditional posterior is from the Gibbs sampler:
at the (g + 1)th iteration,
• draw θ[g+1]1 from N
(ρθ
[g]2 , 1− ρ2
)• draw θ
[g+1]2 from N
(ρθ
[g+1]1 , 1− ρ2
)And, the Rao-Blackwellized posterior is from the Gibbs sampler:
at the (g + 1)th iteration,
• draw k values of θ1 from N(ρθ
[g]2 , 1− ρ2
)• θ[g+1]
1 = mean of k values of θ1
• draw k values of θ2 from N(ρθ
[g+1]1 , 1− ρ2
)• θ[g+1]
2 = mean of k values of θ2
As shown in the table below, standard errors of posterior distributions
becomes significantly smaller in the Rao-Blackwellized case.
θ1 θ2
ρ=0.05, Unconditional 0.899678592 0.896445864
ρ=0.05, Rao-Blackwellized 0.019853010 0.021344061
ρ=0.95, Unconditional 0.079108334 0.081591576
ρ=0.95, Rao-Blackwellized 0.001961144 0.001969048
Table 14.3: Unconditional vs. Rao-Blackwellized Posteriors
< R.Code >
gibbs.uc <- function(rho, bi, sim) n <- bi + sim + 1
theta1 <- theta2 <- NULL
theta1[1] <- theta2[1] <- 0
Utilitarian Markov Chain Monte Carlo 187
for (i in 1:n) theta.var <- 1 - rho^2
theta1.mean <- rho*theta2[i]
theta1[i+1] <- rnorm(1, theta1.mean, theta.var)
theta2.mean <- rho*theta1[i+1]
theta2[i+1] <- rnorm(1, theta2.mean, theta.var)
theta1.post <- theta1[(bi+2):n]
theta2.post <- theta2[(bi+2):n]
theta1.post.mean <- mean(theta1.post)
theta2.post.mean <- mean(theta2.post)
stat1 <- stat2 <- 0
for (k in 1:sim) stat1 <- stat1+(theta1.post[k]-theta1.post.mean)^2
stat2 <- stat2+(theta2.post[k]-theta2.post.mean)^2
theta1.post.sd <- stat1 / (sim - 1)
theta2.post.sd <- stat2 / (sim - 1)
post.sd <- c(theta1.post.sd, theta2.post.sd)
return(post.sd)
gibbs.rb <- function(rho, m, bi, sim) n <- bi + sim + 1
theta1 <- theta2 <- NULL
theta1[1] <- theta2[1] <- 0
for (i in 1:n) theta.var <- 1 - rho^2
theta1.mean <- rho*theta2[i]
theta1[i+1] <- mean(rnorm(m, theta1.mean, theta.var))
theta2.mean <- rho*theta1[i+1]
theta2[i+1] <- mean(rnorm(m, theta2.mean, theta.var))
theta1.post <- theta1[(bi+2):n]
theta2.post <- theta2[(bi+2):n]
theta1.post.mean <- mean(theta1.post)
theta2.post.mean <- mean(theta2.post)
stat1 <- stat2 <- 0
for (k in 1:sim) stat1 <- stat1+(theta1.post[k]-theta1.post.mean)^2
stat2 <- stat2+(theta2.post[k]-theta2.post.mean)^2
188Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
theta1.post.sd <- stat1 / (sim - 1)
theta2.post.sd <- stat2 / (sim - 1)
post.sd <- c(theta1.post.sd, theta2.post.sd)
return(post.sd)
sd.compare <- rbind(gibbs.uc(0.05, 500, 1000),
gibbs.rb(0.05, 50, 500, 1000),
gibbs.uc(0.95, 500, 1000),
gibbs.rb(0.95, 50, 500, 1000))
colnames(sd.compare) <- c("theta1", "theta2")
rownames(sd.compare) <- c("rho=0.05, Unconditional",
"rho=0.05, Rao-Blackwellized",
"rho=0.95, Unconditional",
"rho=0.95, Rao-Blackwellized")
> print(sd.compare)
theta1 theta2
rho=0.05, Unconditional 0.899678592 0.896445864
rho=0.05, Rao-Blackwellized 0.019853010 0.021344061
rho=0.95, Unconditional 0.079108334 0.081591576
rho=0.95, Rao-Blackwellized 0.001961144 0.001969048
17. Produce the ARE result for the Rao-Blackwellization from (12.33) with
a two-variable Metropolis-Hastings algorithm instead of the Gibbs sam-
pler that was used.
I Answer
To demonstrate the Rao-Blackwell property for Markov chains, consider
the following code that gives a comparison of simulated students-t dis-
tribution values that are simulated from R’s regular function rt() and
a Rao-Blackwellized version based on Dickey’s decomposition (1968),
which is an old idea but conditions on a sufficient statistic using mix-
tures of normals and a chi-square variate.
< R.Code >
nu <- 5
Utilitarian Markov Chain Monte Carlo 189
x <- y <- (rgamma(2000,nu/2,nu/2))^(-1)
for (i in 1:length(y)) x[i] <- rnorm(1,0,mean(y))
delta.m.star <- delta.m <- x <- rt(10000,5)
y <- 1/(rgamma(length(x),nu/2,nu/2))
for (j in 1:length(x)) delta.m[j] <- sum(exp(-x[1:j]^2))/j
for (j in 1:length(x)) delta.m.star[j] <- sum(1/sqrt(2*y[1:j]+1))/j
par(mfrow=c(1,1),col="grey80")
plot(1:length(delta.m), delta.m, type="l",
col="darkblue", ylim=c(0.4,0.6))
lines(1:length(delta.m.star), delta.m.star,
col="firebrick")
This code is self-contained and the interested reader should simply cut-
and-paste it into their R window. Notice that the red line, which is the
Rao-Blackwellized simulated estimate, converges much more quickly to
the empirical mean than the blue line.
The outline of this solution is as follows. If we can show that the
Metropolis-Hastings algorithm constitutes an interleaved Markov chain,
then we can use Theorem 9.19 in Robert and Casella to show the desired
ARE property.
Consider a M-H algorithm where the candidate φ at time t is generated
from a conditional distribution according to p(φ|θ[t−1]). That is, θ is our
value of interest from the Markov chain and the candidate is generated
conditionally on the current position (a random walk Metropolis kernel).
Define the the transition function for φ according to:
φ[t+1] =
φ[t] with probability 1− pt+1
θ[t+1] with probability pt+1
where:
pt+1 =π(φ[t+1])/p(φ[t+1]|θ[t])
π(θ[t])/p(θ[t]|φ[t+1])∧ 1.
190Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
This is an ergodic Markov chain provided that the support of π and p
are identical. In addition, the Metropolis step indicated by “∧” implies
a second ancillary stochastic process of uniforms. This algorithm meets
the interleaving process since at each step θ and φ are independent of
their previous values conditionally on the current value of the other,
and θ is iid in stationarity (asymptotically guaranteed by the ergodic
property).
Now define the standard generic Monte Carlo estimator from the runs of
the Metropolis-Hastings algorithm for both Standard and Rao-Blackwellized
forms according to:
τS =1
n
n∑t=1
h(θ[t])
τRB =1
n
n∑t=1
h(θ[t]|φ[t])
From Lemma 9.17 and Proposition 9.18 in Robert and Casella (2004,
p.355), we know that for consecutive draws of these interleaved random
variables:
Cov[h(θ[t]), h(θ[t+1])
]= V ar(E[h(θ)|φ]),
and this is positive and decreasing in t to t+ 1, t+ 2, . . . , t+ n. Define
δt+j as the indicating a comparison between the tth iteration of θ in the
Markov chain and the t+ jth of θ iteration to be used in a summation
such that j = 1 means a comparison with the next value and j = n the
last value. Now define the variances of the two estimators with their
summed pairwise covariance calculations:
V ar(τS) =1
n(n− 1)
n∑δt+j ,j=1
Cov(h(θ[t], h(θ[t+j]))
V ar(τRB) =1
n(n− 1)
n∑δt+j ,j=1
Cov(h(θ[t]|φ[t]), h(θ[t+j]|φ[t]))
from the definitions above. Using the Lemma 9.17 and Proposition 9.18
Utilitarian Markov Chain Monte Carlo 191
results iteratively, these become nested expectations:
V ar(τS) =1
n(n− 1)
n∑δt+j ,j=1
V ar(E[· · ·E[E[h(θ)|φ] · · · ])
V ar(τRB) =1
n(n− 1)
n∑δt+j ,j=1
V ar(E[· · ·E[E[h(θ)|φ]θ] · · · ])
Since the second variance has an identical number of summed terms all
of which are identical or smaller by the Rao-Blackwell theorem, then
this total for V ar(τRB) can never be larger than this total for V ar(τS).
19. Modify the Metropolis-Hastings R code on Section 9.4.6 to produce the
marginal likelihood with Chib’s method.
I Answer
The marginal likelihood is defined as:
p(x) =L(x|θ′)π(θ′)
π(θ′|x),
where θ′ is an arbitrary point in the sample space (say θ′ = (1, 2)).
For all practical cases, the likelihood function (L(x|θ)) and the prior
distribution (π(θ)) are known to us. We can just plug in a specific value
there (θ = θ′). So, the main task below in R is to show how to produce
π(θ′|x).
We know that the posterior distribution is Bivariate Normal and the
proposal density function is Cauchy (following the example in 9.4.6).
And, π(θ′|x) can be produced by the following formula:
π(θ′|x) =
∫α(θ′, θ)q(θ′|θ)π(θ|x)dθ∫
α(θ, θ′)q(θ|θ′)dθ
=1G
∑Gg=1 α(θ′, θ[g])q(θ′|θ[g])
1G
∑Gm=1 α(θ[m], θ′)
,
where θ[g] are drawn from π(θ|x) and θ[m] are drawn from q(θ|θ′).
< R.Code >
192Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
require(BaM)
bi <- 1000; sim <- 5000; n <- bi + sim + 1
start1 <- start2 <- 0
mu <- c(1,2)
s <- matrix(c(1,-.9,-.9,1), ncol=2)
alpha <- function(from1, from2, to1, to2) output <- min(1, dmultinorm(to1, to2, mu, s) /
dmultinorm(from1, from2, mu, s))
return(output)
qvalue <- function(th1, th2) qsum <- 0
for (i in 1:1000) qind <- dmultinorm(th1,th2,mu,s)/sqrt(rchisq(1,5)/5)
qsum <- qsum + qind
return(qsum/1000)
qmean <- qvalue(mu[1],mu[2])
theta1 <- theta2 <- Theta1 <- Theta2 <- NULL
theta1[1] <- start1; theta2[1] <- start2
nom <- denom <- NULL
for (i in 1:(n-1)) cand <- mvrnorm(1, mu, s)/(sqrt(rchisq(2,5)/5))
a <- alpha(theta1[i], theta2[i], cand[1], cand[2])
if (a > runif(1)) theta1[i+1] <- cand[1]
theta2[i+1] <- cand[2]
else
theta1[i+1] <- theta1[i]
theta2[i+1] <- theta2[i]
Theta1[i+1] <- cand[1]
Theta2[i+1] <- cand[2]
nom[i+1] <- alpha(theta1[i+1], theta2[i+1],
mu[1], mu[2]) * qmean
denom[i+1] <- alpha(mu[1], mu[2], Theta1[i+1],
Theta2[i+1])
Utilitarian Markov Chain Monte Carlo 193
post.value <- sum(nom[(bi+2):n])/sum(denom[(bi+2):n])
post.sim <- matrix(c(mean(theta1[(bi+2):n]),
sd(theta1[(bi+2):n]),
mean(theta2[(bi+2):n]),
sd(theta2[(bi+2):n])), ncol=2)
colnames(post.sim) <- c("theta1", "theta2")
rownames(post.sim) <- c("mean", "sd")
post.summary <- list(posterior=post.sim,
posterior.value=post.value)
> print(post.summary)
$posterior
theta1 theta2
mean 1.0410134 1.9487204
sd 0.6934439 0.7035153
$posterior.value
[1] 1.511966
15
Markov Chain Monte Carlo Extensions
Exercises
To be provided.
195
A
Generalized Linear Model Review
Exercises
1. Suppose X1, . . . , Xn are iid exponential: f(x|θ) = θe−θx, θ > 0. Find
the maximum likelihood estimate of θ by constructing the joint distri-
bution, express the log likelihood function, take the first derivative with
respect to θ, set this function equal to zero, and solve for θ the maximum
likelihood value.
I Answer
Since Xi ∼iid f(x|θ), the joint distribution is:
f(X|θ) =
n∏i=1
θe−θxi ,
which gives us the log likelihood function as:
` (θ|X) =
n∑i=1
(log θ − θxi)
= n log θ − nxθ.
Hence, set the first derivative with respect to θ equal to zero:
∂`
∂θ=n
θ− nx ≡ 0
Therefore, the maximum likelihood value is:
θ =1
x
197
198Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
3. Below are two sets of data each with least square regression lines cal-
culated (y = 6 + 0x). Answer the following questions by looking at the
plots.
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
• •
•
•
•
•
••
•
•
•
•
•
•
•
••
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
•
•
•
•
••
•
•
•
••
•
•
•
••
•
•
•
•
•
• •
•
•
•
•
••
••
•
•
•
•
•
•
••
•
•
•
•
•
•
•
••••
•••
•••• •
•••
•
•
•
•
••
•
••
•••
•
••
••
•
•
•••
•
•••••
•
•
•
•
•
•
•••
•
•
••
•
••
•
•
••
•
•
•
•
• •••
•
•••
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
••
•
(a) Does the construction of the least squares line in panel 1 violate
any of the Gauss-Markov assumptions?
(b) Does the construction of the least squares line in panel 2 violate
any of the Gauss-Markov assumptions?
(c) Does the identified point (identically located in both panels) have
a substantively different interpretation?
I Answer
(a) No.
(b) Yes. The construction of the least squares line in panel 2 appears to
violate the assumption where the error terms must be independent
of one another. That is, the residuals in panel 2 appear to be
correlated with one another.
(c) Yes. The point in panel 2 could be an outlier, while it looks normal
in panel 1.
5. Consider the bivariate normal PDF:
f(x1, x2) =(
2πσ1σ2
√1− ρ2
)−1
×
exp
[− 1
2(1− ρ2)
((x1 − µ1)2
σ21
− 2ρ(x1 − µ1)(x2 − µ2)
σ1σ2+
(x2 − µ2)2
σ22
)].
Generalized Linear Model Review 199
for −∞ < µ1, µ2 <∞, σ1, σ2 > 0, and ρ ∈ [−1 : 1].
For µ1 = 3, µ2 = 2, σ1 = 0.5, σ2 = 1.5, ρ = 0.75, calculate a grid search
using R for the mode of this bivariate distribution on R2. A grid search
bins the parameter space into equal space intervals on each axis and
then systematically evaluates each resulting subspace. First set up a
two dimensional coordinate system stored in a matrix covering 99% of
the support of this bivariate density, then do a systematic analysis of
the density to show the mode without using “for” loops. Hint: see the R
help menu for the outer function. Use the contour function to make a
figure depicting bivariate contours lines at 0.05, 0.1, 0.2, and 0.3 levels.
I Answer
< R.Code >
dmultnorm <- function(x, y, mu1, mu2, sigma.mat) rho <- sigma.mat[1,2]/prod(sqrt(diag(sigma.mat)))
nlizer <- 1/(2*pi*prod(sqrt(diag(sigma.mat)))*
sqrt(1-rho^2))
e.term1 <- (x - mu1)/sqrt(sigma.mat[1,1])
e.term2 <- (y - mu2)/sqrt(sigma.mat[2,2])
like <- exp(- 1/(2*(1-rho^2)) *
(e.term1^2 + e.term2^2 -
2*rho*e.term1*e.term2) )
return(nlizer*like)
par(mfrow=c(1,2), mar=c(12, 5, 12, 0.25))
x.ruler1 <- seq(1.8, 4.2, length=30)
y.ruler1 <- seq(-1.2, 5, length=30)
xy.cov <- matrix(c(0.5^2,0.75*0.5*1.5,0.75*0.5*1.5,
1.5^2),2,2)
xy.grid1 <- outer(x.ruler1, y.ruler1, dmultnorm,
3, 2, xy.cov)
contours1 <- c(0.05, 0.1, 0.2, 0.3)
contour(x.ruler1, y.ruler1, xy.grid1, levels=contours1,
xlab="X",ylab="Y", cex=2)
x.ruler2 <- seq(2.8, 3.2, length=30)
y.ruler2 <- seq(1.5, 2.5, length=30)
xy.grid2 <- outer(x.ruler2, y.ruler2, dmultnorm,
200Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
X
Y
0.05
0.1
0.2
0.3
2.0 2.5 3.0 3.5 4.0
−1
01
23
45
X
Y
0.31
0.315
0.32
2.8 2.9 3.0 3.1 3.2
1.6
1.8
2.0
2.2
2.4
FIGURE A.1: Contour Plot of Bivariate Normal Distribution
3, 2, xy.cov)
contours2 <- c(0.31, 0.315, 0.320, 0.3205)
contour(x.ruler2, y.ruler2, xy.grid2,
levels=contours2,
xlab="X",ylab="Y", cex=2)
7. Derive the exponential family form and b(θ) for the Inverse Gaussian
distribution: (2 points.)
f(x|µ, λ) =
(λ
2πx3
)1/2
exp
[− λ
2µ2x(x− µ)2
], x > 0, µ > 0.
Generalized Linear Model Review 201
Assume λ = 1.
I Answer
f(x|µ, λ) = exp
[−1
2log(2πx3)− 1
2µ2x(x− µ)2
]= exp
[−1
2log(2πx3)− 1
2µ−2x+ µ−1 − 1
2x−1
]
= exp
−1
2µ−2x︸ ︷︷ ︸θy
+ µ−1︸︷︷︸−b(θ)
−1
2log(2πx3)− 1
2x−1︸ ︷︷ ︸
c(y)
Since θ = − 1
2µ−2, we have µ = (−2θ)−
12 , which also gives us:
b(θ) = − 1
µ= −√−2θ, θ < 0.
9. Show that the Weibull distribution, f(y|γ, β) = γβ y
γ−1 exp(−yγ/β), is
an exponential family form when γ = 2, labeling each part of the final
answer.
I Answer
f(y|γ = 2, β) =2
βy exp(−y2/β)
= exp[log(2)− log(β) + log(y)− y2/β
]= exp
y2(−1/β)︸ ︷︷ ︸zθ
− log(β)︸ ︷︷ ︸b(θ)
+ log(y) + log(2)︸ ︷︷ ︸c(y)
Since θ = − 1
β , we have β = − 1θ , which also gives us:
b(θ) = log(−1
θ), θ < 0 (i.e. β > 0)
11. For normally distributed data a robust estimate of the standard devia-
tion can be calculated by (Devore 1995):
σ =1
0.6745n
n∑i=1
∣∣Xi − X∣∣ .
202Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Histogram of FARM
FARM
Fre
quen
cy
0 20000 40000 60000 80000
02
46
810
1214
FIGURE A.2: Finding Influential Outliers
Write an R function to calculate σ for the FARM variable in Exercise 15
below. Compare this value to the standard deviation. Is there evidence
of influential outliers in this variable?
I Answer
The robust estimate of the standard deviation is 12269.88, which is
smaller than the original standard deviation (13251.51). So, we can say
that there are outliers. From Figure A.2, we can also confirm that we
have one outlier.
< R.Code >
sum <- 0
for (i in 1:length(FARM)) dis <- abs(FARM[i] - mean(FARM))
sum <- sum + dis
Generalized Linear Model Review 203
robustsd <- (1/(0.6745*length(FARM)))*sum
originalsd <- sd(FARM)
result <- cbind(originalsd, robustsd)
colnames(result) <- c("original s.d.", "robust s.d.")
hist(FARM, breaks=25) # this shows there is one outlier
> print(result)
original s.d. robust s.d.
[1,] 13251.51 12269.88
13. Derive the generalized hat matrix from β produced the final (nth) IWLS
step. Specifically, give the quantities of interest from the nth and (n −1)th step, show how the hat matrix is produced. Secondly, give the form
of Cook’s D appropriate to a GLM and show that it contains hat matrix
values.
I Answer
(a) Derive the generalized hat matrix.
At the nth (last) step of IWLS, we have:
βn = (X′wn−1X)−1X′wn−1zn−1,
where zn−1 and wn−1 are defined as:
zn−1 = ηn−1 +
(∂η
∂µ
∣∣∣∣µn−1
)(y − µn−1),
w−1n−1 =
(∂η
∂µ
∣∣∣∣µn−1
)2(v(µ)
∣∣∣∣µn−1
), and
v(µ) =∂2b(θ)
∂θ2
Then, we have:
X∗ = w1/2n−1X,
and
H∗ =(w
1/2n−1X
)(Xw
1/2n−1X
)1/2 (w
1/2n−1X
)′.
204Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
(b) Cook’s distance for a GLM
Consider the Cook’s distance for a linear model:
Di =(β(i) − β)(X′X)(β(i) − β)
p MSE,
where p is the number of parameters in the model, and MSE is the
mean square error of the regression model.
Then, it can be modified by inserting the IWLS weighting:
Di(GLM) =(β(i) − β)(X′wn−1X)(β(i) − β)
pφ,
where φ is the standard GLM dispersion parameter.
And, the relationship to the hat matrix is found in inserting the
equivalence:
β(i) − β =−(X′X)−1Xir
′i
p(1− hii), r′i =
yi − µi√1− hii
,
where hii is the i-th diagonal element of the hat matrix, H∗.
15. The likelihood function for dichotomous choice regression is given by:
L(b,y) =
n∏i=1
[F (xib)]yi [1− F (xib)]
1−yi .
There are several common choices for the F () function:
Logit: Λ(xib) = 11+exp[xib]
Probit: Φ(xib) =∫ xib
−∞1√2π
exp[−t2/2]dt
Cloglog: CLL(xib) = 1− exp (−exp(xib)).
Generalized Linear Model Review 205
State FDR PRE.DEP POST.DEP FARM
Alabama 1 323 162 4067
Arizona 1 600 321 6100
Arkansas 1 310 157 8134
California 1 991 580 83371
Colorado 1 634 354 10167
Connecticut 0 1024 620 10167
Delaware 0 1032 590 3050
District of Columbia 1 1269 1054 0
Florida 1 518 319 14234
Georgia 1 347 200 10167
Idaho 1 507 274 7117
Illinois 1 948 486 27451
Indiana 1 607 310 11184
Iowa 1 581 297 28468
Kansas 1 532 266 22368
Kentucky 1 393 211 8134
Louisiana 1 414 241 10167
Maine 0 601 377 6100
Maryland 1 768 512 14234
Massachusetts 1 906 613 14234
Michigan 1 790 394 14234
Minnesota 1 599 363 25418
Mississippi 1 286 127 4067
Missouri 1 621 365 15251
Montana 1 592 339 10167
Nebraska 1 596 307 17284
Nevada 1 868 550 4067
New Hampshire 0 686 427 3050
New Jersey 1 918 587 18301
New Mexico 1 410 208 5084
New York 1 1152 1676 38635
North Carolina 1 332 187 8134
North Dakota 1 382 176 14234
Ohio 1 771 400 18301
Oklahoma 1 455 216 9150
Oregon 1 668 379 11184
Pennsylvania 0 772 449 25418
Rhode Island 1 874 575 2033
South Carolina 1 271 159 7117
South Dakota 1 426 189 8134
Tennessee 1 378 198 6100
Texas 1 479 266 33552
Utah 1 551 305 4067
Vermont 0 634 365 5084
Virginia 1 434 284 15251
Washington 1 741 402 14234
West Virginia 1 460 257 6100
Wisconsin 1 673 362 21351
Wyoming 1 675 374 5084
Using the depression era economic and electoral data, calculate a di-
chotomous regression model for using each of these link functions ac-
cording to the following model specification:
FDR ∼ f [(POST.DEP − PRE.DEP ) + FARM ]
where: FDR indicates whether or not Roosevelt carried that state in the
1932 presidential elections, PRE.DEP is the mean per-state income be-
fore the onset of the Great Depression (1929) in dollars, POST.DEP is
the mean per-state income after the onset of the great depression (1932)
in dollars, and FARM is the total farm wage and salary disbursements
in thousands of dollars per state in 1932.
206Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
Do you find substantially different results with the three link functions?
Explain. Which one would you use to report results? Why?
I Answer
Table A.2 shows the results from three different models. In general, all
three models lead to the same substantive conclusion (i.e. Z-values from
different models do not differ a lot). While AIC of the logit model is
the lowest, all AIC’s are very similar. So, we cannot tell which model
is better. Nevertheless, if we have to report only one result, it would be
better to report the logit model because it has the lowest AIC.
Table A.1: Three Different Binomial Models
Estimate Std. Error z value Pr(>|z|)
LOGIT
(intercept) 4.9261 1.9459 2.5315 0.0114
(POST.DEP - PRE.DEP) 0.0152 0.0070 2.1728 0.0298
FARM 0.0001 0.0001 1.5229 0.1278
AIC 34.7525
PROBIT
(intercept) 2.8516 1.0065 2.8332 0.0046
(POST.DEP - PRE.DEP) 0.0085 0.0037 2.3094 0.0209
FARM 0.0001 0.0000 1.4493 0.1473
AIC 34.8669
CLOGLOG
(intercept) 2.1827 0.8285 2.6346 0.0084
(POST.DEP - PRE.DEP) 0.0071 0.0033 2.1643 0.0304
FARM 0.0000 0.0000 1.2725 0.2032
AIC 35.1390
< R.Code >
FDR <- c(1,1,1,1,1,0,0,1,1,1,1,1,1,1,1,1,1,
0,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,
1,1,0,1,1,1,1,1,1,0,1,1,1,1,1)
PREDEP <- c(323, 600, 310, 991, 634, 1024, 1032, 1269,
Generalized Linear Model Review 207
518, 347, 507, 948, 607, 581, 532, 393,
414, 601, 768, 906, 790, 599, 286, 621,
592, 596, 868, 686, 918, 410, 1152, 332,
382, 771, 455, 668, 772, 874, 271, 426,
378, 479, 551, 634, 434, 741, 460, 673, 675)
POSTDEP <- c(162, 321, 157, 580, 354, 620, 590, 1054,
319, 200, 274, 486, 310, 297, 266, 211,
241, 377, 512, 613, 394, 363, 127, 365,
339, 307, 550, 427, 587, 208, 1676, 187,
176, 400, 216, 379, 449, 575, 159, 189,
198, 266, 305, 365, 284, 402, 257, 362, 374)
FARM <- c(4067, 6100, 8134, 83371, 10167, 10167, 3050,
0, 14234, 10167, 7117, 27451, 11184, 28468,
22368, 8134, 10167, 6100, 14234, 14234,
14234, 25418, 4067, 15251, 10167, 17284,
4067, 3050, 18301, 5084, 38635, 8134, 14234,
18301, 9150, 11184, 25418, 2033, 7117, 8134,
6100, 33552, 4067, 5084, 15251, 14234, 6100,
21351, 5084)
DEP <- POSTDEP - PREDEP
logit <- glm(FDR ~ I(POSTDEP - PREDEP) + FARM,
family=binomial(link=logit))
probit <- glm(FDR ~ I(POSTDEP - PREDEP) + FARM,
family=binomial(link=probit))
cloglog <- glm(FDR ~ I(POSTDEP - PREDEP) + FARM,
family=binomial(link=cloglog))
summary(logit); summary(probit); summary(cloglog)
result <- rbind(summary(logit)$coef,
summary(probit)$coef,
summary(cloglog)$coef)
aic <- c(AIC(logit), AIC(probit), AIC(cloglog))
17. Tobit regression (censored regression) deals with an interval-measured
outcome variable that is censored such that all values that would have
naturally been observed as negative are reported as zero, generalizeable
to other values (Tobin 1958, Amemiya (1985, Chapter 10), Chib (1992).
There can be left censoring and right censoring at any arbitrary value,
single and double censoring, mixed truncating and censoring. If z is a
208Bayesian Methods: A Social and Behavioral Sciences Approach, ANSWER KEY
latent outcome variable in this context with the assumed relation:
z = xβ + η and zi ∼ N (xβ, σ2),
then for left censoring at zero, the observed outcome variable is produced
according to:
yi =
zi if zi > 0
0 if zi ≤ 0.
The resulting likelihood function is:
L(β, σ2|y,x) =∏yi=0
[1−Φ
(xiβ
σ
)] ∏yi>0
(σ−1) exp
[− 1
2σ2(yi − xiβ)2
],
where σ2 is called the scale. Explain the structure of the likelihood
function: what parts accommodate observed values and what parts ac-
commodate unobserved values? Replicate Tobin’s original analysis in
R using the survreg function in the survival package. The data are
obtainable by:
tobin <- read.table(
"http://artsci.wustl.edu/~jgill/data/tobin.dat",header=TRUE)
or in the BaM package.
I Answer < R.Code >
tobin.tob <- survreg(Surv(durable,durable>0,type=’left’) ~ age + quant, dist=’gaussian’,data=tobin)
summary(tobin.tob)
Value Std. Error z p
(Intercept) 15.1449 16.0795 0.942 3.46e-01
age -0.1291 0.2186 -0.590 5.55e-01
quant -0.0455 0.0583 -0.782 4.34e-01
Log(scale) 1.7179 0.3103 5.536 3.10e-08
Scale= 5.57
Gaussian distribution
Loglik(model)= -28.9 Loglik(intercept only)= -29.5
Generalized Linear Model Review 209
Chisq= 1.1 on 2 degrees of freedom, p= 0.58
Number of Newton-Raphson Iterations: 3
n= 20
19. Sometimes when estimation is problematic, Restricted Maximum Like-
lihood (REML) estimation is helpful (Bartlett 1937). REML uses a
likelihood function calculated from a transformed set of data so that
some parameters have no effect on the estimation on the others. For a
One-Way ANOVA, y|G ∼ N (Xβ + ZG, σ2I), G ∼ N (0, σ2D), such
that V ar[y] = V ar[ZG] + V ar[ε] = σ2ZDZ′ + σ2I. The uncondi-
tional distribution is now: y ∼ N (Xβ, σ2(I + ZDZ′) where D needs
to be estimated. If V = I + ZDZ′, then the likelihood for the data is:
`(β, σ,D|y) = −n2 log(2π) − 12 log |σ2V | − 1
2σ2 (y −Xβ)′V−1(y −Xβ).
The REML steps for this model are: (1) find a linear transformation k
such that k′X = 0, so k′y ∼ N (0, k′σk), (2) run MLE on this model
to get D, which no longer has any fixed effects, (3) then estimate the
fixed effects with ML in the normal way. Code this procedure in R and
run the function on the depression era economic and electoral data in
Exercise ??.