1 various models and related mcmc stuff. · isye8843a, brani vidakovic handout 11 1 various models...

ISyE8843A, Brani Vidakovic Handout 11

1 Various Models and Related MCMC Stuff.

1.1 Latent Variable Binary Regression.

In the binary responses the observations are of the form

binary response (from{0, 1}) covariates (predictors)Y1 X1 = (X1,1, X1,2, . . . , X1,p)′

Y2 X2 = (X2,1, X2,2, . . . , X2,p)′

. . . . . .Yn Xn = (Xn,1, Xn,2, . . . , Xn,p)′

ResponsesYi are indicators of a particular attribute (event, property, outcome). In such situations the re-searcher is interested in modelingp = P (Y = 1|X), i.e., the probability that the attribute is present, giventhe vector of covariates,X. Assume that

Zi = X ′iβ + εi, i = 1, . . . , εi

iid∼ F,

is a multivariate regression model in whichZi’s are not observable, but the indicatorsYi = 1(Zi > 0) are.Then,

pi = P (Yi = 1) = P (Zi > 0) = P (X ′iβ + εi > 0) = P (εi > −X ′

iβ) = 1− F (−X ′iβ).

If the error distribution is symmetric about zero (as usually assumed),

pi = 1− F (−X ′iβ) = F (X ′

iβ),

is equivalent to probit, logit, and related models.However, the formulation that assumes latent variableZi is allowing Gibbs sampling scheme (eg., Chib

and Albert 1993) and Johnson and Albert (1999)).Successive sampling from full conditionals, (i)[β|Z, Y ] and (ii) [Z|β, Y ].Assume thatF is normal distribution and that the above model isprobit. Then the distribution forβ

givenZ is simply the multivariate normal distribution from the least square theory,

[β|Z, Y ] ∼MVN p((X ′X)−1X ′Z, (X ′X)−1),

whereX is design matrix with rowsX ′1, . . . , X

′n.

Next, we find the conditional distribution for each component ofZ. If Yi andβ are given,Zi is truncatedat 0 normal distribution with meanX ′

iβ. Recall the connectionP (Yi = 1) = P (Zi > 0). Truncation is tothe left, if Yi = 1 and to the right ifYi = 0. The conditional forZi givenYi andβ is

[Zi|β, Yi] ∼{ T N (X ′

iβ, 1,−∞, 0), if Yi = 0T N (X ′

iβ, 1, 0,∞), if Yi = 1,

whereT N (µ, σ2, a, b) is truncated normal with density proportional toexp{− (z−µ)2

2σ2 }1(a ≤ z ≤ b).These two conditional distributions are now defining an easy Gibbs sampler. Part of the matlab file

albertmc3.m is given below. It requires two m-files fromBAYESLAB: (i) rand nort.m that simulatestruncated normalT N distribution, and (ii)rand MVN.mthat simulates multivariate normal distribution.

1

for i=1:MZ = rand_nort(designX * beta, ones(size(designX * beta)), left, right);Zs=[Zs Z];sigma=inv(designX’ * designX);betaMLE = inv(designX’ * designX) * designX’ * Z;beta = rand_MVN(1, betaMLE, sigma)’;betas=[betas beta];piz = [piz .5 * (1+erf(designX * beta/sqrt(2)))];

end

Arrithmia Example. 1 Patients who undergo Coronary Artery Bypass Graft Surgery (CABG) have an ap-proximate 19-40% chance of developing atrial fibrillation (AF). AF can lead to blood clots forming causinggreater in-hospital mortality, strokes, and longer hospital stays. While this can be prevented with drugs, itis very expensive and sometimes dangerous if not warranted. Ideally, several risk factors which would indi-cate an increased risk of developing AF in this population could save lives and money by indicating whichpatients need pharmacological intervention. Researchers began collecting data from CABG patients duringtheir hospital stay such as demographics like age and sex, as well as heart rate, cholesterol, operation time,etc. Then, the researchers recorded which patients developed AF during their hospital stay. Researchersnow want to find those pieces of data which indicate high risk of AF. In the past, indicators like age, hyper-tension, and body surface area (BSA) have been good indicators, though these alone have not produced asatisfactory solution.

The data set looks has 81 records and the first 3 and the last 3 are provided:

ARR INT AGE ACCT CBT ICUT AHR LVEF1 1 68 64 126 20.25 85 811 1 75 111 167 13.5 50 751 1 69 63 94 12 74 62

. . .0 1 71 51 83 16.75 102 590 1 73 68 122 17.83 78.2 500 1 55 44 78 11.5 94.8 40

The key: ARR – presence (1) or absence (0) of arrhythmia; INT – intercept in the model, vector of 1’s in thedesign matrix; AGE – age of the patient; ACCT – aortic cross clamp time; CBT – cardiopulminary bypasstime; ICUT – ICU time; AHR – average heart rate; LVEF – left ventricle ejection fraction.

Figure 1 gives the posterior summaries forβ0, β1, . . . , β6 in the model

pi = Φ(β0 + β1AGE + β2ACCT + β3CBT + β4ICUT + β5AHR + β6LVEF).

The latent variablesZi, although artificially introduced are useful in model diagnostic. For example,Zi −X ′

iβ should be standard normal. This allows for a standard analysis of a model fit.

1.2 Slice Sampler.

Slice sampler generates pairs(X, Y ) that fill in the area under (possibly unnormalized) target density in auniform fashion. Then theX-projection is a sample from the density.

The slice sampler works as follows: Suppose the target densityf(x) is known up to normalizing con-stant, i.e., we knowg(x) such thatf(x) ∝ g(x).

1This example and data have been provided by Matthew Wiggins, BioE Graduate Research Assistant at GaTech

2

−20 −10 0 100

200

400

0 0.1 0.20

200

400

−0.1 0 0.10

200

400

−0.05 0 0.050

200

400

−0.5 0 0.50

200

400

−0.1 0 0.10

200

400

−0.1 0 0.10

200

400

1 2 3 4 5 6 7 8 9 100

0.5

1V

alu

es

Figure 1: Latent variable Gibbs sampler.

STEP 1 Start with someX0 form the domain of target density.STEP 2 At stepi generateYi|Xi−1 ∼ U(0, g(Xi−1)). This determines the

height of a slice.STEP 3 SimulateXi|Yi ∼ U(S(Yi)) where theS(Yi) is a horizontal slice

in functiong(x) at heightYi. The slice is an interval or union ofintervals and bounds are found by solvingg(Xi) ≥ Yi.

STEP 4 Increasei and go to STEP 2 until tired.

It is interesting that only uniform generators are needed to generate a sample from various distributions.Looks like a great universal method, but its theoretical universality is impaired by practical inability to deter-mine the bounds of horizontal slices when the density is complex, i.e., solution of the inequalityg(Xi) ≥ Yi

is not in a closed form.Figure 3 demonstrates slice method to generate standard normal random variable. The following matlab

code illustrates the algorithm.

xsys=[]; xs=[];x=0;

for i=1:5000y = exp(-xˆ2/2) * rand(1,1); %get the height y|xxsys = [xsys; x y]; xs=[xs x];x = sqrt(-2 * log(y)) * (2 * rand(1,1)-1); %horizontal slice x|y

end

3

−4 −2 0 2 40

0.2

0.4

0.6

0.8

1

x

y

−4 −2 0 2 4−4

−3

−2

−1

0

1

2

3

4

normal quantiles

slic

e q

ua

ntile

s

(a) (b)

Figure 2: (i) Slice sampling below the unnormalized normal density; (ii) QQplot forx coordinate.

If g(x) is more complicated, but allows product representation,

g(x) =k∏

i=1

gi(x),

then the slice sampler can be applied factor-by-factor. The algorithm finds heighty(i) for each functiongi

separately and forms an intersection slice. The steps 2 and 3 from the basic slice algorithm are modified asfollows:

STEP 2 At stepi generateY (j)i |Xi−1 ∼ U(0, gj(Xi−1)), for each compo-

nentgj , j = 1, . . . , k.

STEP 3 Simulate Xi|Y (1)i , . . . , Y

(k)i ∼ U(∩k

j=1Sj(Y(j)i )) where the

Sj(Y(j)i ) is a horizontal slice in functiongj(x) at heightY (j)

i .

Slice Example. Assume thatf(x|θ) ∝√

1 + (x− θ)2 and that the prior onθ is π(θ) = cos2(πθ/2).If x = 0 is observed, the Bayes rule is 0, because of the symmetry of the posterior about 0. We mightbe interested in testing thatθ ∈ I, for some intervalI. Find a MCMC approximation to the posteriorprobability of the hypothesesH0 : |θ| ≤ 1/4. Use the slice sampler. (Numerical approximation givesP θ|x=0(|θ| ≤ 0.246117) = 1/2 soH0 should be barely accepted).

Exact posteriorf(θ|x = 0) is sort ofpossible since the marginal at 0 ism(0) = π+2J1(π)4 , almost in a

finite form. The special functiony = Jn(z) in the marginal is Bessel function (of the first kind) and satisfiesthe differential equation

z2y′′ + zy′ + (z2 − n2)y = 0.

The algorithm is

4

STEP 1 Start withθ0 = 0

STEP 2 At step i generate Y(1)i |θi−1 ∼ U(0,

√1 + θ2

i−1) and Y(2)i |θi−1 ∼

U(0, cos2(πθ/2)).STEP 3 Simulateθi|Y (1)

i , Y(2)i ∼

U({

−√

1− (Y (1)i )2,

√1− (Y (1)

i )2}∩

{− 2

π arccos√

Y(2)i , 2

π arccos√

Y(2)i

})

STEP 4 Increasei and return to STEP 2.

-1 -0.5 0.5 1

0.2

0.4

0.6

0.8

1

-1 -0.5 0.5 1

0.2

0.4

0.6

0.8

1

(a) (b)

Figure 3: Slice Example. (a) Component densities of the posterior (b) The posterior.

1.3 Rao-Blackwellization.

Rao-Blackwell theorem states that the risk of any estimatorδ0 is not increased by conditioning that estimatoron the sufficient statistics. LetT be sufficient statistics forθ andδ0 any estimator with finite variance. Defineδ1(t) = EX|θ(δ(X)|T = t). If δ0 is unbiased, thenδ1 is unbiased as well. Then,

R(θ, δ1) ≤ R(θ, δ0).

As a simple illustration considerX1, . . . , Xn ∼ N (θ, 1). The estimatorδ0(X1, . . . , Xn) = X1 is pos-sible. It is unbiased forθ and its risk, with respect to the squared error loss, is equal to its variance, i.e.,R(θ, δ0) = 1. The sufficient statistics is

∑i Xi. Sincet = E(

∑Xi|

∑Xi = t) =

∑i E(Xi|

∑Xi = t) =

nE(X1|∑

i Xi = t), the conditioning gives

δ1(X1, . . . , Xn) = E(X1

∣∣∣∣∣∑

i

Xi ) = X.

The risk ofδ1(X1, . . . , Xn) is 1/n.This example comes from the “Gibbs for kids” manuscript of Casella and George. Assume

π(θ|λ) ∝ λe−θλ, 0 < λ < B,

π(λ|θ) ∝ θe−λθ, 0 < θ < B.

The marginals cannot be computed but the conditional distributions are easy to simulate from. Suppose weobtained samples

θ1, θ2, . . . , θM , and λ1, λ2, . . . , λM .

5

ThenEθ can be approximated by1M∑M

i=1 θi. Since, forB largeE(θ|λ) ' 1λ , Eθ can be approximated by

either

1M

M∑

i=1

θi or1M

M∑

i=1

1λi

.

Similarly,

π(θ) ' 1M

M∑

i=1

π(θ|λi),

is an alternative estimator of the marginal. This estimator is much simpler than the kernel-based NP estima-tor that uses MC sampleθ1, θ2, . . . , θM .

The conditional structure of Gibbs sampling usually gives dual samples form all parameters in the model.Thus, ifEθ|xg(θ) is needed, two estimatorsδ0 andδ1 are available,

δ0 =1M

M∑

i=1

g(θi) and δ1 =1M

M∑

i=1

Eθ|x,λig(θ),

and the second,δ1 has uniformly better risk (not that Bayesians care much, but anyway). Thus, whencalculation ofEθ|x,λig(θ) is easy, one should useRao-Blackwellizedestimatorδ1.

1.4 Hidden Pathology.

Here is a “Gibbs Solution” when the joint posterior does not exist. Casella and George show that for welldefined pair of conditional densities

π(θ|λ) ∝ λe−θλ, λ > 0,

π(λ|θ) ∝ θe−λθ, θ > 0.

the joint density does not exist. By a result concerning existence of joint distributions, given the conditionals(Robert and Casella Monte Carlo Statistical Methods 1999, page 298) the joint density exists if

∫π(θ|λ)π(λ|θ)dθ < ∞,

In this case this integral is∞.The formal MCMC analysis can be done on well defined conditionals, but if the joint posterior distribu-

tion does not exist, the resulting realizations cannot be positive recurrent Markov chains.Consider the balanced random effect ANOVA model

Yij = β + ui + εij , j = 1, . . . , J, i = 1, . . . , I.

Assume that

ui ∼ N (0, τ2), i = 1, . . . , I

εij ∼ N (0, σ2), j = 1, . . . , J, i = 1, . . . , I.

π(β, τ2, σ2) = 1× 1τ2× 1

σ2.

6

The full posterior

π(τ2, σ2|y) ∝ τ−2−Iσ−2−IJ exp

−

12σ2

∑

i,j

(yij − yi)2

× exp{−J

∑i(yi − β)2

2(σ2 + Jτ2)

}(Jσ−2 + τ−2)−I/2,

is improper. Indeed, ifβ is integrated out, the marginal posterior density forτ2, σ2 is

π(τ2, σ2|y) ∝ τ−2−Iσ−2−IJ

(Jσ−2 + τ−2)I/2exp

−

12σ2

∑

i,j

(yij − yi)2 − J∑

i(yi − y)2

2(σ2 + Jτ2)

If σ2 is fixed,π(τ2, σ2|y) behaves asτ−2 in the neighborhood of 0. Therefore joint posterior does not exist.Nevertheless, the full conditionals are available (prove this by considering “joint” distribution),

[ui|β, τ2, σ2, Y

] ∼ N(

J(Yi − β)J + σ2/τ2

, (J/σ2 + 1/τ2)−1

), i = 1, . . . , I

[β|u, τ2, σ2, Y

] ∼ N(

Y − u,σ2

IJ

)

[τ2|u, β, σ2, Y

] ∼ IGamma

(I

2,12

∑

i

u2i

),

[σ2|u, β, τ2, Y

] ∼ IGamma

IJ

2,12

∑

i,j

(Yij − ui − β)2

,

whereY stands for allYij ’s, Yi = 1J

∑Jj=1 Yij , andY = 1

IJ

∑i,j Yij .

The m-filemcmc2.m is formally performing this MCMC simulation. TheM = 20000 with burn-in of2000 gave MCMC runs forβ τ2, andσ2 sampled from their posterior distributions. Normal variates withgroup means 7, 9, and 1 and unit variance are simulated as data. Although, the realizations are not recurrentMC (see the trace forτ2 and its bursts) the medians of these runs areβM = 6.8942, τ2

M = 10.0213, andσ2

M = 0.9543, quite close to thetrue values 7, 9, and 1. The mean of runs is not used here because ofpresence of outliers (bursts in MCMC sequence) as evidenced in Figure 4(a), the middle run forτ2. Thehistograms are given in Panel (b), The histogram forτ2 (middle) is truncated by 30 since the distribution isnot proper. InM = 20000 runs quite a few simulations forτ2 reach the magnitude of several thousand.

1.5 Vanila Regression: Improving Gibbs Sampler.

The well known data set (Efron and Tibshirani Bootstrap Monograph) consists of 15 pairs of numbers (GPA,LSAT) for a sample of American law schools.

SAT (x) 576 635 558 578 666 580 555 661 651 605 653 575 545 572 594GPA (y) 3.39 73.3 2.81 3.03 3.44 3.07 3 3.43 3.36 3.13 3.12 2.74 2.76 2.88 2.96

The model is

yi = β0 + β1xi + εi, i = 1, . . . , n

7

1.95 1.96 1.97 1.98 1.99 2

x 104

5

10

15

1.95 1.96 1.97 1.98 1.99 2

x 104

0

200

400

1.95 1.96 1.97 1.98 1.99 2

x 104

0

1

2

−10 −5 0 5 10 150

500

1000

0 5 10 15 20 25 300

500

0 0.5 1 1.5 2 2.50

500

1000

(a) (b)

Figure 4: Final 500 simulations ofβ, τ andσ2; Panel (b) shows histograms of the posterior ofβ, τ andσ2.The histogram forτ2 is truncated because the distribution is improper and simulations exhibit extremelyheavy tails.

whereεi are iid normalN (0, 1τ ). Often, instead of varianceσ2 we will use its reciprocal (called precision

parameter)τ = 1/σ2, because of modeling issues. Namely, precision parameters are usually getting gammaprior and/or gamma posterior.

When the priorπ(β0, β1, τ) = 1, τ > 0 the joint posterior simplifies to

π(β0, β1, τ |y) ∝ τ1/2 exp{−τn∑

i=1

(yi − β0 − β1xi)2}

The full conditionals are easily completed,

[β0|β1, τ, y] ∼ N ((Sy − β1Sx)/n, (nτ)−1)[β1|β0, τ, y] ∼ N ((Sxy − β0Sx)/Sxx, (Sxxτ)−1)

[τ |β0, β1, y] ∼ Gamma(n/2 + 1,n∑

i=1

(yi − β0 − β1xi)2),

whereSx =∑

i xi, Sy =∑

i yi, Sxx =∑

i x2i , andSxy =

∑i xiyi.

Figure 5 shows the output of Gibbs sampling implemented inlawsc.m . The initial values wereβ0 =β1 = 0, andτ = 1, number of iterations wasM = 10000 with 1000 burnin. As it is evident from panels(a) and (c) in Figure 5, strong correlation betweenβ0 andβ1 as well as strong autocorrelations within eachchain (the chains forβ0 andβ1 resemble random walk and the chain is slowly moving across the spaceof possible values). Thus we are not sure that stationarity of the traces is achieved. One way to improveperformance is to orthogonalize regression. If(β0, β1, τ) 7→ (β′0, β

′1, τ

′) with β′0 = β0 − β1x, β′1 = β1, andτ ′ = τ , the Gibbs sampler is readily adjusted and the output is shown in Figure 6. Now, the traces forβ′0,andβ′1 look better mixed (Panel (a)), the histograms of posterior distributions are smoother (Panel (b)) andthe posterior correlation is removed (Panel (c)).

Orthogonalized regression clearly was more amenable to Gibbs solution. Another way to improve theoriginal solution is toblockparameters. In this case, one can sample(β0, β1) in a block since the distributionof [β0, β1|τ, y] is known,

[β0, β1|τ, y] ∼MVN 2((X ′X)−1X ′y, (X ′X)−1τ−1). (1)

8

1000 2000 3000 4000 5000 6000 7000 8000 9000

−10123

1000 2000 3000 4000 5000 6000 7000 8000 9000

2

4

6

x 10−3

1000 2000 3000 4000 5000 6000 7000 8000 9000

50

100

150

−2 −1 0 1 2 3 40

500

1000

0 1 2 3 4 5 6 7 8

x 10−3

0

500

1000

0 20 40 60 80 100 120 140 1600

500

1000

−2 −1 0 1 2 3 40

1

2

3

4

5

6

7

8x 10

−3

(a) (b) (c)

Figure 5: (a) Traces of chains forβ0, β1 andτ ; (b) Histograms forβ0, β1 andτ with burn-in obestvationsomitted;(c) A scatterplot of(β0, β1) showing strong posterior correlation.

1000 2000 3000 4000 5000 6000 7000 8000 9000

3

3.2

1000 2000 3000 4000 5000 6000 7000 8000 9000−2

02468

x 10−3

1000 2000 3000 4000 5000 6000 7000 8000 900020406080

100120

2.8 2.9 3 3.1 3.2 3.30

500

1000

−4 −2 0 2 4 6 8 10

x 10−3

0

1000

2000

0 20 40 60 80 100 120 1400

500

1000

2.8 2.9 3 3.1 3.2 3.3−4

−2

0

2

4

6

8

10x 10

−3

(a) (b) (c)

Figure 6: (a) Traces of chains forβ′0, β′1 andτ ′; (b) Histograms forβ′0, β

′1 andτ ′ with burn-in obestvations

omitted;(c) A scatterplot of(β′0, β′1) showing showing that their posterior correlation is removed.

1.6 Production Example.

To be added!

1.7 Exercises.

1. Blocking in Law School Regression.Implement Gibbs sampler in the Law School data example byutilizing blocking as in (1).

2. Hemodilution. Clark et al.2 examined the fat filtration characteristics of a packed polyester-and-wool

2Clark, R., Margraf, H., and Beauchamp, R. (1975). Fat and Solid Filtration in Clinical Perfusions.Surgery,77, 216-224.

9

filter used in the arterial lines during clinical hemodilution. They collected data on the filter’s recovery ofsolids for 10 patients who underwent surgery. The table below shows removal rates of lipids and cholesterol.Fit a regression line to the data, where the cholesterol is theY variable and lipids are theX variables.

Removal rates, mg/kg/L×10−2patient Lipids (X) Cholesterol (Y)

1 3.81 1.902 2.10 1.033 0.79 0.444 1.99 1.185 1.03 0.626 2.07 1.297 0.74 0.398 3.88 2.309 1.43 0.9310 0.41 0.29

References

[1] Tierney, L. (1994). Markov Chains for exploring posterior distributions( with discussion).Annals ofStatistics, 22, 1701–1762.

[2] Damien, P., Wakefield, J. and Walker, S. (1999). Gibbs sampling for non-conjugate and hierarchicalmodels by using auxiliary variables.JRSSB.

[3] Neal, R. (1997). Markov chain Monte Carlo methods based onslicing the density function. Technicalreport (U. Toronto).

[4] Robert, C. and Casella, G. (1999).Monte Carlo Statistical Methods.Springer.

[5] Roberts G. and Rosenthal (1999). Convergence of slice sampler Markov chains. JRSSB.

1.8 Matlab

1.8.1 Law Schools.

Here is filelawsc.m in which cosmetics and ploting statements are omitted.

M=10000; burnin=1000;laws=[ ...

4 653 3.12; ...6 576 3.39; ...

13 635 3.30; ...15 661 3.43; ...31 605 3.13; ...35 578 3.03; ...36 572 2.88; ...45 545 2.76; ...47 651 3.36; ...50 555 3.00; ...52 580 3.07; ...53 594 2.96; ...70 666 3.44; ...

10

79 558 2.81; ...82 575 2.74];

x=laws(:,2); y=laws(:,3);n = length(x);beta0=0; beta0s=[];beta1=0; beta1s=[];tau = 1; taus =[];

Sx=sum(x); Sy = sum(y); Sxx=sum(x.ˆ2); Sxy = sum(x. * y);for i = 1:M

beta0 = (n * tau)ˆ(-1/2) * randn + (Sy - beta1 * Sx)/n;beta1 = (Sxx * tau)ˆ(-1/2) * randn + (Sxy - beta0 * Sx)/Sxx;tau = rand_gamma(n/2 + 1, sum((y - beta0 - beta1 * x).ˆ2)/2 );beta0s = [beta0s beta0];beta1s = [beta1s beta1];taus = [taus tau];

endfigure(1)subplot(3,1,1)plot(beta0s(burnin +1:M))subplot(3,1,2)plot(beta1s(burnin +1:M))subplot(3,1,3)plot(taus(burnin +1:M))

For orthogonal regression, the following modification is needed.

xx=laws(:,2);x=xx-mean(xx); y=laws(:,3);n = length(x);beta0=0; beta0s=[];beta1=0; beta1s=[];tau = 1; taus =[];

Sx=sum(x); Sy = sum(y); Sxx=sum(x.ˆ2); Sxy = sum(x. * y);for i = 1:M

beta0 = (n * tau)ˆ(-1/2) * randn + Sy/n;beta1 = (Sxx * tau)ˆ(-1/2) * randn + Sxy/Sxx;tau = rand_gamma(n/2 + 1, sum((y - beta0 - beta1 * x).ˆ2)/2 );beta0s = [beta0s beta0];beta1s = [beta1s beta1];taus = [taus tau];

end

11

1 various models and related mcmc stuff. · isye8843a, brani vidakovic handout 11 1 various models...

Documents