a simulation study for bayesian hierarchical model

A SIMULATION STUDY FOR BAYESIAN HIERARCHICALMODEL SELECTION METHODS

Fang Fang

A Thesis Submitted to theUniversity of North Carolina Wilmington in Partial Fulfillment

of the Requirements for the Degree ofMaster of Science

Department of Mathematics and Statistics

University of North Carolina Wilmington

2009

Approved by

Advisory Committee

Chair

Accepted by

Dean, Graduate School

Steve Fang

打字机文本

Dr. Matthew TenHuisen

Steve Fang

打字机文本

Dr. Yishi Wang

Steve Fang

打字机文本

Steve Fang

打字机文本

Dr. Susan Simmons

Steve Fang

打字机文本

Steve Fang

打字机文本

Steve Fang

打字机文本

Steve Fang

打字机文本

Steve Fang

打字机文本

Steve Fang

打字机文本

Steve Fang

打字机文本

Steve Fang

打字机文本

TABLE OF CONTENTS

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 BAYESIAN STATISTICS . . . . . . . . . . . . . . . . . . . . 3

2.2 HIERARCHICAL BAYESIAN MODEL . . . . . . . . . . . . 4

2.3 GIIBS SAMPLER . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4 PRIOR DISTRIBUTION . . . . . . . . . . . . . . . . . . . . 9

2.5 POSTERIOR DISTRIBUTION . . . . . . . . . . . . . . . . . 10

3 TWO MODEL SEARCH METHOD . . . . . . . . . . . . . . . . . . 12

3.1 ACTIVATION PROBABILITY . . . . . . . . . . . . . . . . . 12

3.2 MODEL SEARCH BY SYSTEMATIC PROCESS . . . . . . . 13

3.3 MODEL SEARCH BY STOCHASTIC PROCESS . . . . . . . 16

4 SIMULATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1 DATA SET . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.2 RESULTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

A. EXAMPLE OF USING R CODE TO GENERATE SIMULATED

DATA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

B. SYSTEMATIC SEARCH METHOD . . . . . . . . . . . . . . . . . 32

ii

C. STOCHASTIC SEARCH METHOD . . . . . . . . . . . . . . . . . 43

iii

ABSTRACT

Model selection is a useful method in determine important features that control a

response. This thesis explores two model selection strategies in a hierarchical setting.

The first method is the systematic search proposed in Haikun Bao’s thesis (2006)

and the second method is a stochastic search proposed in Yi Chen’s thesis (2007) and

further developed by UNCW student Qijun Fang (2009). An intensive simulation

study investigates the usefulness of these two methodologies under a number of

situations Both methods appear to identify important features in the study, but the

stochastic search produces slightly better results.

iv

DEDICATION

I would like to dedicate this thesis to my family especially my husband Qijun

Fang, whose love and support make me able to do more than I thought.

v

ACKNOWLEDGMENTS

I would like to thank my advisor Dr. Susan Simmons from the bottom of my

heart. Without her great insight, guidance, patience and encouragement all the way

through my study and research in the past two and half years, this can never be

possible.

I would also like to thank all the faculty members, staff and fellow students in

the Mathematics and Statistics department for all kinds of help in my studying life

here.

vi

LIST OF TABLES

1 Marker Information (X Matrix) . . . . . . . . . . . . . . . . . . . . . 21

2 Quantitative Trait Information (Y Matrix) . . . . . . . . . . . . . . . 22

3 Result of one QTL detection . . . . . . . . . . . . . . . . . . . . . . . 22

4 Result of two QTLs detection (same chromosome) . . . . . . . . . . . 23

5 Result of two QTLs detection (different chromosomes) . . . . . . . . 24

6 Result of Three QTLs detection . . . . . . . . . . . . . . . . . . . . . 25

7 Result of two QTLs detection (same chromosome) . . . . . . . . . . . 26

vii

LIST OF FIGURES

1 Structure of hierarchical model . . . . . . . . . . . . . . . . . . . . . 5

2 Detect QTL by systematic method . . . . . . . . . . . . . . . . . . . 15

3 Model vectors inside a Markov Chain . . . . . . . . . . . . . . . . . . 18

4 Genetic map for Bay×Sha . . . . . . . . . . . . . . . . . . . . . . . . 19

viii

1 INTRODUCTION

In published literature, there exists many methods to identify potential useful

models. For example, in many regression books, forward selection, stepwise forward

selection, backward elimination and best subsets are just a few of the methods

discussed (Kutner et al, ).

However, the choice of the best model for the data is still a question of interest.

Model selection is the task of choosing the best statistical models for a given data-

set from a group of potential models. By choosing the best model for the data,

the most important features that control the response are identified. One example

is the Quantitative Trait Loci(QTL) experiment, researchers often need to identify

the location or loci on a genetic map responsible for controlling a quantitative trait

from many possible potential markers. The markers responsible for controlling the

response or quantitative trait are referred to as QTL. QTL greatly help researchers

understand the biochemical basis of these traits and their evolution in populations

over time (Bao, et al 2006). However, if a bad model is selected, then inference

based on the data and the model will usually lead to erroneous conclusions and can

potentially lead researchers in the wrong direction.

Therefore, one must be careful when attempting to discover the “best” model for

the data. But the word “best” can be somewhat controversial. Usually, the greater

the number of features the model has, the better that model fits the data. However,

the complexity of the model increases as well. The more potential features there are

for a data set, the more possible models there are. For example, when there are P

features, there are a total of 2P possible models. For large P , the above mentioned

methodologies may lead to numerous different “best” models.

There are a number of frequentist and Bayesian model selection procedures avail-

able to identify the “best” model(Ibrahim et al,). Several criterion-based procedures

1

have been proposed and are frequently used. Among the most widely used crite-

ria are Akaike’s Information Criterion(Akaike, 1973), Schwarz’s Bayesian Informa-

tion Criterion(Schwarz, 1978), Bedrick and Tsai’s modification of AIC(1994) and

Gabriel(1968) and Mckay(1977)’s discussion about the approach of simultaneous

test procedures for model selection in the multivariate linear model. This thesis

investigates two different model search strategies. Both of these strategies are useful

in hierarchical structures.

Hierarchical models are useful in modeling complex data structures. For example,

Dominici, Samet, and Zeger (2000) used a hierarchical approach to combine dose

response estimates characterizing the health effects of air pollution in a series of U.S

cities, drawing on results from Daniel and Kass(1988) to argue that the approach

will give similar results to a full analysis based on raw data when the study specific

sample sizes are moderate to large. Simmons, Piegorsch, Nitcheva, and Zeiger(2003)

used Bayesian hierarchical models to synthesize information from multiple studies

in an environmental mutagenesis context. Coull, Mezzetti, Ryan used a Bayesian

hierarchical model to quantify the adverse health effects associated with in-utero

exposure to methylmercury. Bayesian Hierarchical models are flexible and are able

to handle many features.

This thesis compare two model search strategies developed by graduate stu-

dents at UNCW for hierarchical models. One method involves using a systematic

search(Boone, et al) and the other one involves a stochastic search( Simmons, et al).

Section 2 discusses the background information for the hierarchical model. Section

3 introduces these two model search strategies and section 4 outlines the simulation

study. Section 5 is the conclusion of this study.

2

2 BACKGROUND

2.1 BAYESIAN STATISTICS

Unlike frequentists who take parameters as fixed but unknown quantities, Bayesian

statisticians regard parameters as random variables with a probability distribution.

The process of Bayesian data analysis can be illustrated by the following three steps:

1. Setting up a full probability model for all observable and unobservable data

which are consistent with knowledge about the problem and the data collection

processes.

2. Conditioning on observed data, calculate and interpret the appropriate pos-

terior distribution-the conditional probability distribution of the unobserved

quantities of ultimate interest after the data are observed.

3. Evaluating the fitness of the model and the implications of the resulting poste-

rior distribution. Test if the model fits the data will. Check if the substantive

conclusions are reasonable, and see how sensitive the results are to the model-

ing assumptions. If necessary, one can alter or expand the model and repeat

the three steps.

Bayesians first set up a prior distribution for the parameters, which is from

the prior knowledge and/or expert advice. And then this information is combined

with the observed data to obtain the posterior distribution by Bayes’ Theorem.

Assuming θ is an unknown parameter, and p(θ) is its prior probability distribution.

The distribution of the data is represented as p(y|θ) . Using the prior probability

distribution and the information from the data, the posterior probability can be

found by Bayes’ Theorem:

p(θ|Y ) =p(y|θ)p(θi)∫

M p(y|θ)p(θ)dθ(1)

3

The Bayesian analysis is then based on inferences from the posterior distribution[1][2][3].

2.2 HIERARCHICAL BAYESIAN MODEL

The hierarchical Bayesian Model is more flexible to adequately deal with the

situation when the observed data has multiple levels, such as QTL detection in

plant experiment. Suppose the quantitative trait in the plant QTL experiment is

represented as yij, with i = 1, . . . , L (L=number of plant lines or genotypes) and

j = 1, . . . , ni ( ni=number of replicates). In the first level, the distribution of

observed data from any ith genotype is assumed to be independently distributed

with the mean represented by the parameter θi and a variance σ2i .

The probability density function of the data given the parameter θi and a variance

σ2i is written as p(yij|θi, σ2

i ) , which is called the likelihood function. In the second

level, the mean θi is conditional on the future parameters which are known as hyper-

parameters. Therefore, the Bayesian hierarchical model constructs a relationship

between multi-parameters in a layered data structure[4].

Figure 1 illustrates the structure of the data in a plant QTL experiment. The

data yij are obtained from a distribution with mean θi which depends on the hyper-

parameters of β and τ 2.

4

Hyper‐parameters

β τ2

θ1

y11 ,

y12 ,

y1n1

θ2

y21 ,

y22 ,

y2n2

θ3

y31 ,

y32 ,

y3n3

θi

yi1 ,

yi2 ,

yin3

Figure 1: Structure of hierarchical model

5

2.3 GIIBS SAMPLER

Gibbs sampler is a particular Markov chain Monte Carlo algorithm which has

been found useful in many multidimensional problems. In this thesis, we use the

Gibbs sampler to obtain samples from the joint posterior distribution. In each

iteration of the Gibbs sampler, we get a sample of parameters conditional on the

values of all the other parameters. At each iteration t, parameters θ, σ2, β, τ 2 are

sampled and updated conditional on the last value of the other parameters.

β(t+1) ∼ P (β|θ(t), σ2(t), τ 2(t), y) (2)

τ 2(t+1) ∼ P (τ 2|θ(t), σ2(t), β(t+1), y) (3)

θ(t+1) ∼ P (θ|σ2(t), τ 2(t+1), β(t+1), y) (4)

σ2(t+1) ∼ P (σ|θ(t+1), τ 2(t+1), β(t+1), y) (5)

In addition, in order to generate a Gibbs sequence containing samples following

the stationary joint posterior distribution, we will use the following initial values:

1. θ(0) sample average of observed data in the ith line

2. σ2(0) sample variance of observed data in the ith line

3. β(0)estimates from a regression model based on the marker origin information

matrix

4. τ 2(0) variance between sample means

Now we are able to start the Gibbs sampler from these starting points to obtain

samples for θ, σ2, β, τ 2 by updating these parameters sequentially from their full

conditional distribution. The four full conditional distributions are:

6

1. To obtain samples of θ′s from the distribution of θ conditional on the other

parameters, p(θ|σ2, β, τ 2, y), we have

p(θ|σ2, β, τ 2, y) =p(θ, σ2, β, τ 2|y)

p(σ2, β, τ 2|y)

=p(θ|Xβ, τ 2)p(y|θ, σ2)∫p(θ|Xβ, τ 2)p(y|θ, σ2)dθ

∝ exp[− 1

2τ 2(θ −Xβ)′(θ −Xβ)−

L∑i=1

ni∑j=1

1

2σ2i

(yij − θi)2]

∝ exp{L∑i=1

[−1

2(

1

τ 2+niσ2i

)θ2i + (

Xiβ

τ 2+

∑nij=1 yij

σ2i

)θi]}

∝ exp{L∑ −1

2( 11τ2

+niσ2i

)(θi −

Xiβτ2 +

∑ni yijσ2i

1τ2 + ni

σ2i

)2}

where Xi is the ith line of genotypes in the experiment

θi|σ2, τ 2, β, Y ∼ N [

Xiβτ2 +

∑ni yijσ2i

1τ2 + ni

σ2i

,1

1τ2 + ni

σ2i

] (6)

2. To obtain samples of σ2’s from the distribution of σ2 conditional on the other

parameters, p(σ2|θ, β, τ 2, y), we have

P (σ2|θ, τ 2, β, y) =P (y|θ, σ2)P (θ|τ 2, β)∫P (Y |θ, σ2)P (σ2)dσ2

∝L∏

(σ2i )−(

σ20+2

2+ni2

+1)exp{−[L∑i=1

1

2σ2i

+L∑i=1

ni∑j=1

1

2σ2i

(yij − θi)2]}

∝L∏

(σ2i )−(

σ20+2

2+ni2

+1)exp{−[L∑i=1

(1

2σ2i

)[ni∑j=1

(yij − θ)2 + 1]}

σ2i |θ, τ 2, β, Y ∼ Inv −Gamma[

σ20 + ni

2,ni∑j=1

(yij − θi)2 + 1

2] (7)

3. To obtain samples of τ 2’s from the distribution of τ 2 conditional on the order

7

parameters, p(β|θ, σ2, τ 2, y), we have

P (β|θ, σ2, τ 2, y) =P (θ|τ 2, β)P (β)∫P (θ|τ 2, β)P (β)dβ

∝ exp{−ββ′

200− 1

2τ 2(θ −Xβ)′(θ −Xβ)}

∝ exp{−1

2[β′(

1

100+X ′X

τ 2)β − 2

τ 2θ′Xβ]}

∝ exp{−1

2[β − (

I

100+X ′X

τ 2)X ′θ

τ 2]′(

I

100

+X ′X

τ 2)[β − (

I

100+X ′X

τ 2)X ′θ

τ 2]}

where I is L× L identity matrix.

β|θ, σ2, τ 2, Y ∼ N [(I

100+X ′X

τ 2)X ′θ

τ 2, (

I

100+X ′X

τ 2)−1] (8)

4. To obtain samples of τ 2’s from the distribution of τ 2 conditional on the order

parameters, p(τ 2|θ, σ2, β, y), we have

P (τ 2|θ, σ2, β, y) =p(τ 2, θ, σ2, β|y)

p(θ, σ2, β|y)

=P (y|θ, σ2)P (θ|τ 2, β)P (σ2)P (τ 2)P (β)

P (y|θ, σ2)P (θ|β)P (σ2)P (β)

=P (θ|τ 2, β)P (τ 2)∫P (θ|β, τ 2)P (τ 2)dτ 2

∝ τ−τ0−2−Gexp{− 1

2τ 2[(θ −Xβ)′(θ −Xβ) + 1]}

∝ (τ 2)−(L+τ20

2+1)exp{− [(θ −Xβ)′(θ −Xβ) + 1]/2

τ 2}

where τ 20 = 1, L is the number of gynotypes in the experiment

τ 2i |σ2, τ 2, β, y ∼ Inv −Gamma(

L+ τ 20

2,(θ −Xβ)′(θ −Xβ) + 1

2) (9)

In the QTL plant experiment, there are 38 candidate markers in Bay-0×Shahdara

8

recombinant inbred lines from Arabidopisis thaliana, which means there are 238 pos-

sible regression models existed in the experiment . In this thesis, we will consider a

much smaller subset of the model space and for each model we will only do 52000

iterations in the Gibbs sampler due to the computational intensity, and in order

to diminish the effect of a possible bad starting distribution, the first 2000 itera-

tions out of 52000 iterations for each parameter will be discarded. We assume that

the distribution of the simulated parameter values, for large enough iteration t, con-

verges to a stationary joint posterior distribution p(θ, σ2, β, τ 2|y), which is the target

distribution we are trying to simulate[5][6].

2.4 PRIOR DISTRIBUTION

Reasonable prior distribution are able to be set up in terms of given information

and knowledge, and in order to simplify result, we try to use the conjugate prior

distribution. Conjugate means the property that the posterior distribution follows

the same parametric form as the prior distribution[6]. In many times, the conjugate

prior distributions can put the posterior distribution in analytic form..Here, we have

an example to illustrate this case:

Let’s suppose the data are obtained from a poisson distribution, the likelihood

function is of the form p(y|θ) = θy ·e−θy!

, then the conjugate prior for this distribution

is the Gamma distribution p(θ) = Gamma(α, β), and its posterior distribution is of

the form:

p(θ|y) =p(y|θ)p(θ)∑p(y|θ)p(θ)

=p(y|θ)p(θ)∫p(y|θ)p(θ)dθ

=

θy ·e−θy!

θα−1 e− θβ

Γ(α)·βα∫∞0

θy ·e−θy!

θα−1 e− θβ

Γ(α)·βαdθ

9

=θy+α−1 · e−θ(1+ 1

β)∫∞

0 θy+α−1 · e−θ(1+ 1β

)dθ

=θy+α−1 · e−θ(1+ 1

β)

Γ(y + α)( ββ+1

)y+α∫∞

0θy+α−1·e

− θββ+1

Γ(y+α)( ββ+1

)y+αdθ

=θy+α−1 · e−θ(1+ 1

β)

Γ(y + α)( ββ+1

)y+α

We can see the posterior distribution is Gamma: p(θ|y) ∼ Gamma(y + α, ββ+1

)

2.5 POSTERIOR DISTRIBUTION

The posterior distribution summarize the current state of knowledge about all

the uncertain quantities (including unobservable parameters and also missing, latent,

and unobserved potential data) in Bayesian analysis. Analytically, the posterior

density is proportional to the product of the prior density and the likelihood function.

Now, let’s derive the posterior distribution of parameters p(β, θ, σ2, τ 2|y):

p(β, θ, σ2, τ 2|y) =p(y|β, θ, σ2, τ 2)p(β, θ, σ2, τ 2)∫ ∫

· · ·∫p(y|β, θ, σ2, τ 2)p(β, θ, σ2, τ 2)dθdσ2dβdτ 2

(10)

where

∫ ∫· · ·

∫p(y|β, θ, σ2, τ 2)p(β, θ, σ2, τ 2)dθdσ2dβdτ 2 = p(y) = p(D|M) (11)

This is the probability of data given model which is of great interest and we use

the Monte Carlo method to estimate the integral. The quantity p(D|M) can be

approximated by averaging the product of p(y|β, θ, σ2, τ 2) and p(β, θ, σ2, τ 2) after

substituting samples obtained from the Gibbs Sampler[6].

∫ ∫· · ·

∫p(y|β, θ, σ2, τ 2)p(β, θ, σ2, τ 2)dθdσ2dβdτ 2

10

=1

N

∑p(y|β, θ, σ2, τ 2)p(β, θ, σ2, τ 2) (12)

where N is the number of iterations in the Gibbs sampler, N=50000 in QTL exper-

iment, if the first 2000 iterations have already been discarded.

Our ultimate goal is to obtain posterior probabilities for the model

p(Mi|D) =p(D|Mi)p(Mi)∑Li=1 p(D|Mi)p(Mi)

(13)

Here p(Mi) is the prior probability of the model, and since we have no prior knowl-

edge on which model is the best, we assign p(Mi) to have equal probability. After

simplification we can obtain the posterior probabilities of model, as

p(Mi|D) =p(D|Mi)∑Li=1 p(D|Mi)

(14)

11

3 TWO MODEL SEARCH METHOD

The observed quantitative traits in the plant QTL experiment are represented

as yij, with i = 1, . . . , L(L=number of plant lines or genotypes) and j = 1, . . . , ni (

ni=number of replicates). The ture mean of yij is represented as θi for genotype i ,

and we assume yij ∼ N(θi, σ2) . Each θi is assumed to be linearly dependent on the

genetic composition of the plant which can be expressed as,

θi = β0 + β1xi1 + β2xi2 + · · ·+ βMxiM + ε (15)

where xmi = 0.5 if the marker is from parent A , xmi = −0.5 if the marker is from

parent B, and xmi = 0 if the information of this marker was lost.

3.1 ACTIVATION PROBABILITY

After finding the posterior probabilities for each model, we need to figure out

which marker or markers are most important, we answer this by finding activation

probabilities for each marker. The activation probability for each marker is defined

as

p(βi 6= 0|D) =K∑i=1

p(βj 6= 0|Mi, D)p(Mi|D) (16)

where K is the total number of all the models, Mi is the ith model. By Bayesian

model averaging[8], βj 6= 0 is dependent on whether the jth marker is included in

the best models or not. That is,

P (βj 6= 0|Mi, D) = 1 if jth marker is in the ith model

P (βj 6= 0|Mi, D) = 0 if jth marker is not in the ith model

By using the activation probability, we can us to detect which marker has the

significant effect on the plant QTL. In our experiment, we set up 0.5 as the sufficient

12

line for the marker’s activation probability.

Since many genetics usually have a lot of markers(sometimes more than 100),

suppose we have M markers, then the numbers of potential models should be 2M ,

which may be an extremely large number. It can become computationally inten-

sive in these situations, so we need to find a method to simplify the procedure of

computation.

3.2 MODEL SEARCH BY SYSTEMATIC PROCESS

The systematic processs can breaks down the genome into smaller regions by

conditioning on the regions of important. In systematic search method, first we

break the genome into N chromosomes, yielding 2N number of models need to be

evaluated. Then we obtain the activation probability for each segments by using:

p(Si 6= 0|D) =K∑i=1

p(Sj 6= 0|Mi, D)p(Mi|D) (17)

where Sj means the ith segment(chromosome here).

First, we identify which chromosome(s) is(are) important if a chromosome has

an activation probability great than 0.5, then in the next step, we will divide

it into halves. We will keep doing this procedure do these until the important

marker(s) with QTL(s) are identified. For instance, in our plant experiment there

are 5 chromosomes which contains 9,7,6,8,8 markers respectively. The search algo-

rithm first find which chromosomes make a significant contribution to the QTL

by searching through all potential models and calculating the activation proba-

bility for each chromosome (We denote them as S1, S2, S3, S4, S5). Suppose we

obtained p(S1 6= 0|D) = 0.6, p(S2 6= 0|D) = 0.4, p(S3 6= 0|D) = 0.7, p(S4 6=

0|D) = 0.3, p(S5 6= 0|D) = 0.2. Then chromosomes 1(S1) and 3(S3) have acti-

vation probabilities greater than 0.5, Then we do future analysis on chromosomes

13

1 and 3. We know chromosome 1 has 9 markers and chromosome 3 has 6 mark-

ers, and then we divide these two chromosomes to two parts, as 5+4 and 3+3,

so now we have 4 segments which are denoted as S11, S12, S31, S32( 24 models).

Suppose, after calculating the new activation probabilities for segments, we have

p(S11 6= 0|D) = 0.5, p(S12 6= 0|D) = 0.4, p(S31 6= 0|D) = 0.9, p(S32 6= 0|D) = 0.3.

The algorithm is rerun and continues to divide S11 an S31. Finally it will pick out the

markers with activation probability higher than 0.5, and they are of great interest.

14

Chr 1 M1‐9

0.99

Chr 2 M10‐16

0.01

Chr 3 M17‐22

0.98

Chr 4 M23‐30

0.32

Chr 5 M31‐38

0.01

M1‐5

0.99

M6‐9

0.11

M17‐19

0.98

M20‐22

0.22

M1‐3

0.02

M4‐5

0.99

M17‐18

0.12

M19

0.99

M4

0.99

M5

0.30

M19

0.99

M4

0.99

M19

0.99

Figure 2: Detect QTL by systematic method

15

3.3 MODEL SEARCH BY STOCHASTIC PROCESS

In stochastic search method, we use Monte Carlo Markov Chain Model Com-

parison, a widely used stochastic search algorithm. We define the Model Selection

Vector for the ith model as−→Mi. M , The length of

−→Mi is the number of total candidate

markers in the experiment. Each mth (m ≤ M) element in−→Mi corresponds to its

mth marker counterpart. The value of the mth element is either 1 or 0, and defined

by if the mth marker in the model or not. For example, we have 5 markers in our

experiment, if the−→Mi is [1,1,0,0,1], the model should include the marker 1,2 and 5,

that is,

θi = β0 + β1X1i + β1X2i + β1X5i + εi

where i = 1, · · · , L.

The stochastic search begins by randomly choosing a starting model, and cal-

culating the posterior model probability p(Mi|D) of this model. And then, it will

randomly replace one location value in the model vector (replace to 1 if it is origi-

nally 0; replace to 0 if it is originally 1). Referring the example above, if the vector

−→Mi is [1,1,0,0,1], the

−−−→Mi+1 for the (q + 1)th model could be [0,1,0,0,1] , [1,0,0,0,1] ,

[1,1,1,0,1] , [1,1,0,1,1] , or [1,1,0,0,0] , suppose the first location was chosen, which is

[0,1,0,0,1], then the markers included in the model changed to marker 2 and 5, and

the posterior model probability p(Mi+1|D) is calculated for this model.

With posterior model probability p(Mi|D) for the ith model, we can identify

which model or models are better fit the given quantitative trait data, and to lo-

cate potential markers associated with the quantitative trait. The posterior model

probabilities can be used to help with the model search through the model space.

Based on the two consecutive posterior model probabilities we can use acceptance

probability to compare the two models. Similar to Metropolis-Hasting algorithm,

the acceptance probability αi+1,i is defined as minimum between 1 and the ratio of

16

Posterior Model Probabilities of the (i + 1)th model to the ith model which is the

best-fit model given data[5][7]that is

αi+1,i = min(1 +p(Mi+1|D)

p(Mi|D)) (18)

By using acceptance probability αi+1,i as the successs probability, we randomly

generate a Bernoulli random variable with success probability αi+1,i. If the generated

variable is 1, then the chain move to the new model. That indicates if this move

happens then the new model (the (i+1)th model) will replace the old one (ith model)

and the new model is the best fit model given the data so far. Also by using the

acceptance probability, A tally for each best model is kept. This tally indicates the

frequency in which each model is defined as the best fit model[3].

In our plant QTL experiment, we run 20 chains and 2000 steps within each

chain. Therefore, we have 40000 models to consider in total. And then, we will find

the activation probabilities for each marker based on these models and find out the

significant marker from the result table.

17

01101100111111001000101 Step 1

11101100111111001000101 Step 2

11111100111111001000101 Step 3

10111100111111001000101 Step 4

00110101101001111011010 Step 2000

Figure 3: Model vectors inside a Markov Chain

18

4 SIMULATION

4.1 DATA SET

In our plant experiment we use the marker information of the Bay-0×Shahdara

(Bay×Sha) population, which is in 165 lines × 38 markers’ structure (X matrix).

The Bay×Sha population is created by Oliver Loudet and Sylvain Chaillou(11). Fig-

ure 4 illustrates the genetic map of the Bay×Sha population, including the locations

of genetic markers within five chromosomes and their relative distance.Genetic Map of Arabidopsis thalianaGenetic Map of Arabidopsis thaliana

Pattern Recognition in Bioinformatics 20072

Figure 4: Genetic map for Bay×Sha

Bay×Sha population includes 5 chromosomes which contain 6 to 9 markers re-

spectively. Marker value are set to be -0.5, 0, or 0.5. If the marker value is -0.5,

that means the marker is from parent A; if the marker value is 0.5, that means the

19

marker is from parent B; if the marker value is 0, that means the marker information

is missing. The simulated response matrix y is L× J matrix where L is the number

of lines and J is the number of replicates of the observations within each line.

In this simulation, L=165 and J=10. simulations are made to identify 1 QTL

to 6 QTLs, different level of effect sizes have been assigned, and different level of

gamma noise have been added. The simulated responses are created by

yij = µ+∑

ai × xi +Rgamma (19)

where i− 1, 2, . . . , 165, j = 1, 2, . . . , 10, µ is the true mean, ai is the effect size of the

ith marker, xi is the ith marker value, and Rgamma is random noise from the gamma

distribution. For example, let’s suppose maker 2 and 9 have significant effect. We

assign mean as 40, effect size as 1 and 3 for each quantitative trait, and we also assign

a random noise following gamma distribution (e.g. Gamma(0.5,1)). The simulated

response yij can be created by

yij = 40 + 1 ∗ x2 + 3 ∗ x9 +Rgamma(0.5, 1) (20)

where x2 and x9 is decided by the marker value in the ith line of Bay×Sha matrix

for yij, effect size of marker 2 is 1 and effect size of marker 9 is 3 . For each

ith line in the simulated data matrix, we have different yij values due to different

marker information of the line and the type of gamma noise added. In this thesis,

60 simulated response y matrix are created. Table3-7 will show the creation of these

simulated response y matrix.

20

Line M1 M2 M3 M4 · · · M35 M36 M37 M381 0.5 0.5 -0.5 -0.5 · · · 0.5 0.5 -0.5 -0.52 0.5 0.5 -0.5 -0.5 · · · 0.5 0 -0.5 -0.53 0.5 -0.5 -0.5 -0.5 · · · 0.5 0.5 0.5 0.54 -0.5 -0.5 -0.5 -0.5 · · · -0.5 -0.5 -0.5 -0.55 -0.5 -0.5 -0.5 -0.5 · · · 0.5 0.5 0.5 0.56 -0.5 -0.5 -0.5 0.5 · · · 0.5 0.5 0.5 0.57 -0.5 -0.5 -0.5 -0.5 · · · 0.5 0.5 0.5 0.58 0.5 0.5 0.5 -0.5 · · · -0.5 0.5 0.5 0.59 0.5 0.5 0.5 0.5 · · · -0.5 -0.5 0.5 0.510 -0.5 -0.5 -0.5 0.5 · · · -0.5 0.5 -0.5 -0.5...

......

......

. . ....

......

...156 0.5 -0.5 -0.5 -0.5 · · · 0.5 0.5 0.5 0.5157 0.5 0.5 0.5 0.5 · · · 0 -0.5 -0.5 -0.5158 -0.5 -0.5 -0.5 0.5 · · · 0.5 -0.5 -0.5 0.5159 -0.5 -0.5 -0.5 0.5 · · · 0.5 -0.5 0.5 0.5160 0.5 0.5 0.5 0.5 · · · 0 -0.5 0 0.5161 0.5 -0.5 -0.5 0.5 · · · 0.5 0.5 0.5 0.5162 0.5 0.5 0.5 0.5 · · · 0 -0.5 -0.5 -0.5163 -0.5 -0.5 0.5 0 · · · 0.5 0.5 -0.5 -0.5164 -0.5 -0.5 -0.5 -0.5 · · · 0.5 0.5 -0.5 -0.5165 0.5 0.5 0.5 0.5 · · · -0.5 -0.5 0.5 0.5

Table 1: Marker Information (X Matrix)

21

Line Replicate 1 Replicate 2 Replicate 3 · · · Replicate 8 Replicate 9 Replicate 101 41.18938131 41.56474455 41.23084718 · · · 40.63483704 43.00570651 40.568718272 40.55281128 40.54592232 40.50762652 · · · 40.91177787 41.19623356 40.565815523 39.50224105 40.82294396 41.39494765 · · · 39.54800084 39.52676063 39.501633554 40.08673491 40.20206574 39.50099809 · · · 41.98796159 39.93326206 41.675876635 39.67481233 39.99182259 39.56960166 · · · 40.06223603 40.62162165 40.191047876 39.650172 39.65022736 39.76031288 · · · 39.72163767 40.15230126 39.575957337 39.5000614 40.83352131 40.21347511 · · · 39.94490492 39.63723997 41.263690558 40.8803921 40.5132346 41.54494996 · · · 40.57912699 40.85283013 42.076662299 42.05764982 41.12081364 41.07372557 · · · 40.50223531 42.12374071 41.0507723910 40.82399762 39.71636811 39.7291572 · · · 40.39473932 39.56300054 39.54394469

......

......

. . ....

......

156 39.51070138 39.759929 40.44457607 · · · 39.61957826 39.79321536 39.78441496157 40.84933151 41.69942735 41.09785163 · · · 40.67930273 40.67975313 40.511839158 39.6581928 39.59629037 39.9879948 · · · 41.56171914 39.90985647 39.86658909159 40.29898451 39.58609607 39.50691827 · · · 40.9229873 39.60580775 39.59049143160 40.71519641 40.66436995 41.28936524 · · · 41.16070788 40.65967771 41.26533944161 39.50003279 40.00588332 39.92925531 · · · 39.50237111 41.25149955 39.76797481162 40.74623311 41.7616776 40.52894885 · · · 41.3252194 40.80819499 40.99630735163 39.66519567 40.77677182 39.6134509 · · · 41.46895354 39.59331165 40.67266395164 39.66885692 39.63394557 39.57497036 · · · 39.98043048 39.5397068 39.76860469165 40.68137354 40.67683189 41.01929184 · · · 40.60035641 40.65613246 40.51191945

Table 2: Quantitative Trait Information (Y Matrix)

Effect Objective Gamma noise Result of Result ofsize(s) Marker(s) parameters systematic search stochastic search

1 C1M2 α = 0.5, β = 1 C1M2, C1M5 C1M2α = 1, β = 3 C1M2 C1M2

3 C1M2 α = 0.5, β = 1 C1M2 C1M2α = 1, β = 3 C1M2 C1M2

5 C1M2 α = 0.5, β = 1 C1M1, C1M2 C1M2α = 1, β = 3 C1M5 C1M2

7 C1M2 α = 0.5, β = 1 C1M2 C1M2α = 1, β = 3 C1M2 C1M2

9 C1M2 α = 0.5, β = 1 C1M1, C1M2 C1M2α = 1, β = 3 C1M1, C1M2 C1M2

Table 3: Result of one QTL detection

22


1,3 C1M2, C1M5 α = 0.5, β = 1 C1M2, C1M5 C1M2, C1M5α = 1, β = 3 C1M1, C1M2, C1M5 C1M2, C1M5

1,5 C1M2, C1M5 α = 0.5, β = 1 C1M1, C1M2, C1M5 C1M2, C1M5α = 1, β = 3 C1M2, C1M5 C1M2, C1M5

1,7 C1M2, C1M5 α = 0.5, β = 1 C1M2, C1M5 C1M2, C1M5α = 1, β = 3 C1M1, C1M2, C1M5, C1M9 C1M2, C1M5

1,9 C1M2, C1M5 α = 0.5, β = 1 C1M1, C1M2, C1M4, C1M5 C1M2, C1M5α = 1, β = 3 C1M2, C1M5 C1M2, C1M5

3,5 C1M2, C1M5 α = 0.5, β = 1 C1M2, C1M4, C1M5 C1M2, C1M5α = 1, β = 3 C1M1, C1M2, C1M4, C1M5 C1M2, C1M5





7,9 C1M2, C1M5 α = 0.5, β = 1 C1M1, C1M2, C1M4, C1M5 C1M2, C1M5α = 1, β = 3 C1M1, C1M2, C1M4, C1M5 C1M2, C1M5

Table 4: Result of two QTLs detection (same chromosome)

23


2,4 C1M5, C2M15 α = 0.5, β = 1 C1M4, C1M5, C1M5, C2M15C2M14, C2M15

α = 1, β = 3 C1M4, C1M5, C1M5, C2M15C2M14, C2M15


α = 1, β = 3 C1M5, C2M12, C2M13, C1M5, C2M15C2M14, C2M15

2,8 C1M5, C2M15 α = 0.5, β = 1 C1M5, C2M15 C1M5, C2M15

α = 1, β = 3 C1M4, C1M5, C1M5, C2M15C2M14, C2M15


α = 1, β = 3 C1M5, C2M15 C1M5, C2M15

4,8 C1M5, C2M15 α = 0.5, β = 1 C1M5, C2M15 C1M5, C2M15

α = 1, β = 3 N/A C1M5, C2M15


α = 1, β = 3 C1M5, C2M15 C1M5, C2M15

Table 5: Result of two QTLs detection (different chromosomes)

24

Effect Objective Gamma noise Result of Result ofsize(s) Marker(s) parameters systematic search stochastic search2,4,6 C1M6, α = 0.5, β = 1 N/A C1M6,

C2M15, C3M21 C2M15, C3M21α = 1, β = 3 C1M6, C2M15, C1M6,

C3M20, C3M21 C2M15, C3M212,4,8 C1M6, α = 0.5, β = 1 N/A C1M6,

C2M15, C3M21 C2M15, C3M21α = 1, β = 3 N/A C1M6,

C2M15, C3M212,6,8 C1M6, α = 0.5, β = 1 N/A C1M6,

C2M15, C3M21 C2M15, C3M21α = 1, β = 3 N/A C1M6,

C2M15, C3M214,6,8 C1M6, α = 0.5, β = 1 N/A C1M6,

C2M15, C3M21 C2M15, C3M21α = 1, β = 3 N/A C1M6,

C2M15, C3M211,5,9 C1M2, α = 0.5, β = 1 C1M2, C1M9, C1M2,

C1M9, C2M15 C2M15, C2M16 C1M9, C2M15α = 1, β = 3 N/A C1M2,

C1M9, C2M153,6,9 C1M2, α = 0.5, β = 1 C1M1, C1M2, C1M2,

C3M19, C4M26 C3M19, C4M26 C3M19, C4M26α = 1, β = 3 C1M1, C1M2, C1M2,

C3M19, C4M26 C3M19, C4M26

Table 6: Result of Three QTLs detection

25


1,3, C1M2, C1M9, α = 0.5, β = 1 N/A C1M2, C1M9,5, 9 C2M15, C5M31 C2M15, C5M31

α = 1, β = 3 N/A C1M2, C1M9,C2M15, C5M31

1,3, 5, C1M2, C1M9, α = 0.5, β = 1 N/A C1M2, C1M9,7, 9 C4M23, C4M23,

C5M31, C5M35 C5M31, C5M35α = 1, β = 3 N/A C1M2, C1M9,

C4M23,C5M31, C5M35

1,2, 5, C1M2, α = 0.5, β = 1 N/A C1M2,7,8, 9 C1M5, C1M9, C1M5, C1M9,

C2M15, C2M15,C4M27, C5M35 C4M27, C5M35

α = 1, β = 3 N/A C1M2,C1M5, C1M9,

C2M15, C4M27,C5M35

Table 7: Result of two QTLs detection (same chromosome)

26

4.2 RESULTS

The systematic search method and stochastic search method are compared in

this simulation study. We use 0.5 as the threshold value for the marker activation

probability. We run 52000 iterations in the Gibbs sampler and the cut off is 2000.

There are 20 Markov chains in the stochastic search method and each chain is

proceeded by 2000 steps.

The program is written in Fortran77 code, and we use the Fortran Power Sta-

tion4.0 developer to compile and build the code. We ran the executable file on

DELL OPTIPLEX GX745 PC which uses operation system of Microsoft Windows

XP Professional SP2, Intel(R) Core(TM)2(R) CPU 6400 at 2.13 GHz and 512 MB

RAM.

It takes the program one week to obtain the result by using the stochastic search

method and the time spent by using the systematic search method varies from several

minutes to half an hour which is determined by the number of segments in each step.

Tables3-7 summarize the results from the systematic and stochastic search meth-

ods with the activation probability of all important QTLs. In each method, we test

the effect of multiple QTL, effect size, and the influence of different levels of noises.

Generally, both search methods are able to identify the correct 1 QTL and 2 QTL

successfully. However, when we are dealing with more than 2 QTLs, the systematic

search method often identifies some extra wrong QTLs at the mean time identifying

the correct QTLs.

The greatest advantage for the systematic search method is its high speed to

identify the significant markers. However, this method may perform poorly under

a smaller effective size and stronger noise. From Table3-7 we can see, except for

some special cases, the systematic search method may identify more QTLs than we

want due to the influence of effective sizes and level of Gamma noises. In the case

27

of dealing with more than 2 QTLs, the systematic search method will fail to obtain

the correct activation probability information if 8 or more segments appears in the

searching step (that’s why we can see many N/A’s in the table).

Compared with the systematic search method, the main shortage for the stochas-

tic search method is that it takes relatively much longer time to identify the correct

QTLs. But, the good thing is its robustness in various simulation environments.

From the result tables, we can see that the stochastic search method can always

identify the desired QTLs no matter of the number of QTLs considered, the levels

of effective sizes assigned or the levels of Gamma noise added. The result from

the stochastic search method are always quite sound. The significant QTLs are so

significant while most of the insignificant markers have activation probabilities less

than the threshold value.

28

5 CONCLUSION

The Bayesian hierarchical regression model is an effective method to detect QTL

because complex data structures can be incorporated under this model. In this the-

sis, we compare two search method under this model (stochastic search method and

systematic search method). Since fitting every possible model would be computa-

tional challenge, the stochastic search method randomly choose 20 chains of models

to calculate the activation probabilities, while the systematic search method divides

segments on the genome into smaller and smaller until QTLs are identified. By com-

paring these two methods, the stochastic search method has a better performance

because of its successful identification of the correct QTLs without error.

We found both methods have advantages, the stochastic search method can de-

tect the accurate QTLs we are interested in, while the Systematic search method

saves time and is very efficient. In future researches, we may study a way to combine

these two methods in order to efficiently and correctly identify the QTLs.

29

REFERENCES

[1] Gelman A., Carlin A.J, Stern H.S and Rubin D.B., Bayesian Data Analysis,

2nd Edition, Chapman Hall/CRC, Boca Raton London Ney York Washington,

D.C., 2004.

[2] Micheal Lavine, What is Bayesian statistics and why everything else is wrong,

ISDS, Duke University, Durham, North Carolina.

[3] Chen Y, QTL Detection from stochastic processs by Bayesian hieraechical re-

gression model, UNCW,2007.

[4] David B Dunson, “Practical Advantages of Bayesian Analysis of Epidemio-

logic”, American Journal of Epidemiology, VOL.153.No.12:1222-6,2001.

[5] Walsh B., Markov Chain Monte Carlo and Gibbs Sampling, Lecture Note for

EEB 581,version 26,April, 2004.

[6] Bao H, Bayesian Hierarchical regression model to detect Quantitative trait loci,

UNCW,2006.

[7] Boone E.L, Simmons S.J, Ye K, Stapleton A.E, “Analyzing Quantitative Trait

Loci for the Arabidopsis thaliana using Markov Chain Monte Carlo Model Com-

position with restricted and unrestricted model spaces”, Statistical methodology,

(3) 69-78, 2006.

[8] Congdon P, Bayesian Statistical Modeling, 2nd Edition, John Wiley and Son.

Ltd.

[9] Loudet O, “Chaillou S,Daniel-Vedele F”Bay-0*Shahdara recombination inbred

line population: a powerful tool for the genetic dissection of complex traits in

Arabidopsis”, Theoretical and Applied Genetics, VOL. 104,1173-1184, 2002.

30

APPENDIX

A. EXAMPLE OF USING R CODE TO GENERATE SIMULATED DATA

X<-read.table(’bayxsha2.csv’,sep=’,’)

yinew<-matrix(nrow=165,ncol=10)

for (i in 1:165)

{for (j in 1:10)

{yinew[i,j]<-40+3*X[i,2]+6*X[i,19]+9*X[i,26]+rgamma(1,1,3)}}

write.table(yinew,’i369b.csv’,sep=’,’,row.names=FALSE,col.names=FALSE)

31

B. SYSTEMATIC SEARCH METHOD

program test

USE MSIMSL

PARAMETER (M=38,L=165,taunot=0.5,sigmanot=0.5,KK=52000,

& kutoff=2000,sigbeta=100.d0,numseg=5)

! M is number of Markers (column) and L is number of lines

DOUBLE PRECISION taua,sigmaa(L),

& Xinit(L,M),Yinit(L,20),sumy(L),sumy2(L),ybar(L),

& thetas(L),sigma2(L),ybar2(L),tau2(1),Xuse(L,M),

& XB(L),XTXsend(M+1,M+1),adjust,

& temp,tau2init,thetasinit(L),Marksegprob(numseg),

& likesum , segPostProb(numseg) ,

& sigma2init(L), postmodelprob(L)

INTEGER ni(L),NOBS,Modelvec(M),Muse,Mtemp,numval, digit,

& ni_seg(numseg),segvec(numseg),Modelmatrix(L,numseg),

& temp11,temp12

!Setting parameters

taua = 1 + taunot + (L/2)

NOBS = 0

sumtheta = 0.0d0

sumtheta2 = 0.0d0

open(10, file=’ni.csv’,status=’old’)

do i=1,L

read(10,*) ni(i)

NOBS =NOBS + ni(i)

enddo

close(10)

open(55, file=’ni_seg.csv’,status=’old’)

do i=1,numseg

read(55,*) ni_seg(i)

enddo

close(55)

do i = 1,L

sigmaa(i)=(ni(i)/2) + 1 + sigmanot

enddo

!Read data

open(16, file=’bayxsha2.csv’, status=’old’)

do i=1,L

read(16,*) (Xinit(i,j),j=1,M)

enddo

close(16)

open(19, file=’a1a.csv’, status=’old’)

32

do i=1,L

mtemp=ni(i)

read(19,*) (Yinit(i,j), j=1,mtemp)

enddo

close(19)

! Get initial estimates

do i=1,L

sumy(i) = 0.d0

sumy2(i) = 0.d0

do j=1,ni(i)

sumy(i) =sumy(i) + Yinit(i,j) !Create ybar

sumy2(i) = sumy2(i) + Yinit(i,j)*Yinit(i,j)

enddo

ybar(i) = sumy(i)/ni(i)

thetas(i) = ybar(i)

sigma2(i) = (sumy2(i) - ni(i)*(ybar(i)**2))/(ni(i) - 1)

ybar2(i) = sumy2(i)/ni(i)

thetasinit(i) = thetas(i)

sigma2init(i) = sigma2(i)

enddo

do i = 1,L

sumtheta = sumtheta + thetas(i)

sumtheta2 = sumtheta2 + (thetas(i)**2)

enddo

thetabar = sumtheta/L

tau2(1) = (sumtheta2 - L*(thetabar**2))/(L - 1)

tau2init = tau2(1)

do i=1,numseg

segvec(i) = 1

Marksegprob(i) =0.d0

segPostProb(i) = 0 !Set all the Posterier Probabilitys of Seg 0

enddo

knum=0

do i =1, numseg

if (segvec(i).eq.1) then

do j = 1, ni_seg(i)

Modelvec(j+knum) = 1

enddo

endif

knum=knum + ni_seg(i)

enddo

Muse=M

Mtemp=Muse+1

numval=1

33

adjust=0.d0

CALL GETX (M,Xinit,Xuse,Modelvec,L)

CALL REGRESSION (Xuse,Muse,M,L,Yinit,ni,NOBS,XTXsend,XB)

write (*,*) ’Program initializing...Please Stand by’

CALL Gibbs(Xuse,Yinit,ni,L,M,Muse,taua,sigmaa,

& ybar,thetas,ybar2,tau2,KK,kutoff,

& sigma2,sigbeta,XB,XTXsend,Mtemp,numval,adjust,

& temp)

write (*,*) ’Completed adjustment’

write (*,*) ’Adjustment = ’, adjust

nummodel=(2**numseg)-1

do i=1, nummodel

knum = 0

write(*,*) ’i = ’,i

temp11=i

temp12=temp11/2

Muse = 0 !initiate the Muse and Modelvec

do j=1, M

Modelvec(j) = 0

enddo

do j=1, numseg

digit=temp11-temp12*2

write(*,*) ’digit for ’, j, ’is’, digit

temp11=temp12

temp12=temp11/2

do jj=1,ni_seg(j)

Modelvec(jj+knum)=1*digit

enddo

if (digit.eq.1) then

Muse=Muse+ni_seg(j)

endif

Modelmatrix(i,j) = digit

knum = knum + ni_seg(j)

!write(*,*) ’modelvector’, Modelvec

enddo

Mtemp=Muse+1

numval = 2 !numval=2 indicate the Gibbs will find the likelihood


write(*,*) ’Muse=’, Muse


tau2(1)=tau2init

do ii = 1, L

thetas(ii) = thetasinit(ii)

sigma2(ii) = sigma2init(ii)

34

enddo

! write(*,*) ’modelvector’, Modelvec




& temp)

postmodelprob(i)=temp !temp is the likelihood value

write(*,*) ’likelihood value is ’, postmodelprob(i)

enddo

!***************************Part Four********************************

likesum = 0.d0

do i=1,nummodel !Find the Posterier Probability for each Model

likesum=likesum+postmodelprob(i)

enddo

do i = 1,nummodel

postmodelprob(i) = postmodelprob(i)/likesum

enddo

do i=1,numseg !Find the Posterier Probability for each segment

do j = 1,nummodel

MarksegProb(i)=Marksegprob(i)+

& Modelmatrix(j,i)* postmodelprob(j)

enddo

enddo

open(3,file="MarkerProbability.txt",status="new")

do i=1,numseg

write (3,*) ’Probability of segment’,i

write(3,*) MarksegProb(i)

enddo

close(3)

!write (*,*) ’Posterier Probability of Markers are’, MarkPostProb

end !Main program ends here

!***********************************************************************

!******************Subroutine Part**************************************

!***********************************************************************

! Get the correct X columns in the beginning of matrix

SUBROUTINE GETX (M,Xinit,Xuse,Modelvec,L)

doubleprecision Xinit(L,M),Xuse(L,M)

integer M, Modelvec(M),L

j=1

do i=1,M

if (Modelvec(i).eq.1) then

do s=1,L

Xuse(s,j) = Xinit(s,i)

enddo

35

j = j + 1

endif

enddo

return

end

SUBROUTINE REGRESSION (Xuse,Muse,M,L,Yinit,ni,NOBS,XTXsend,XB)

doubleprecision Xuse(L,M), X(L,Muse),Yinit(L,20),yregress(NOBS),

& xregress(NOBS,Muse),betastemp(Muse+1),

& SST, SSE,Xbetas(L,Muse+1),XB(L),XTXsend(M+1,M+1),

& XTX(Muse+1,Muse+1)

integer Muse,M,L,ni(L),num,num2

! This part of the routine subsets the full X matrix to

! get the correct X

do j=1,Muse

do i=1,L

X(i,j) = Xuse(i,j)

enddo

enddo

! Rearrange the Y matrix to a vector for regression

num = 1

do i = 1,L

do j = 1,ni(i)

yregress(num) = Yinit(i,j)

num = num + 1

enddo

enddo

! Expand the correct X matrix for regression

num2 = 1

do i = 1,L

do s = 1,ni(i)

do j = 1,Muse

xregress(num2,j) = X(i,j)

enddo

num2 = num2 + 1

enddo

enddo

! This does the regression

CALL DRLSE (NOBS, yregress, Muse, xregress, NOBS, 1, betastemp,

& SST, SSE)

! From the regression routine

!write(*,*) betastemp

! Create appropriate X matrix by adding a column of 1’s for the

! intercept term

do i = 1,L

36

Xbetas(i,1) = 1.d0

XB(i) = 0.d0

do j = 1, Muse

Xbetas(i,j+1) = Xuse(i,j)

enddo

enddo

! Calculate XB

do i = 1,L

do j = 1,Muse+1

XB(i) = XB(i) + Xbetas(i,j)*betastemp(j)

enddo

enddo

! Calculates XTX

CALL DMXTXF (L, Muse+1, Xbetas, L, Muse+1, XTX, Muse+1)

! Need to send XTX out of this function. In order to do so

! must save this to an M by M matrix

do i = 1,Muse+1

do j=1,Muse+1

XTXsend(i,j) = XTX(i,j)

enddo

enddo

return

end

!Subroutine for Gibbs sampler

SUBROUTINE Gibbs (Xuse,Y,ni,L,M,Muse,taua,sigmaa,



& bayesfac)

DOUBLE PRECISION Xuse(L,M),XB(L),tau2(1),

& taua,taub(1),sigmab(L),Y(L,20),betamu(Muse+1),

& covarbeta(Muse+1,Muse+1),sigma2(L),thetamu(L),

& thetas(L),thetasig(L),ybar(L),

& stdtau2(1),betasst(Muse+1),stdsig(L),ybar2(L),

& stdtheta(L),sigmaa(L),minloglik,

& liktemp(KK),temp5,maxloglik,

& sumtemp4,bayesfac,RSIG(Muse+1,Muse+1),TOL,

& X(L,Muse),DMACH,betasst2(Muse+1),

& XTXsend(M+1,M+1),Xbetas(L,Muse+1),betas(Muse+1),

& XTX(Muse+1,Muse+1),adjust

INTEGER ni(L),IRANK,KK,kutoff,Mtemp,L,Muse,icount,numsim,numval

!Set up a groups of new parameters

Mtemp=Muse+1

TOL = 100.0*DMACH(4)

minloglik = 1.d8

37

maxloglik = -1.d8

sumtemp4 = 0.d0

icount = 0

!sigbeta = 100.d0

! Get the correct X

do j=1,Muse

do i=1,L

X(i,j) = Xuse(i,j)

enddo

enddo

! Get the correct XTX

do j=1,Muse+1

do i=1,Muse+1

XTX(i,j) = XTXsend(i,j)

enddo

enddo

!Gibbs Sampler

do numsim=1,KK

!write (*,*) numsim

!***** THETAS ***************************

CALL thetapar (tau2,sigma2,XB,L,ybar,ni,thetamu,thetasig) !parameter

CALL DRNNOR (L,stdtheta)

do i=1,L

thetas(i) = stdtheta(i)*thetasig(i) + thetamu(i)

enddo

!***** TAU ***************************

CALL tauparm (thetas,XB,L,taub)

CALL drngam(1,taua,stdtau2)

tau2(1) = taub(1)/stdtau2(1)

!***** BETA ***************************

CALL betapar (XTX,Muse,tau2,L,thetas,X,betamu,covarbeta,sigbeta)

CALL DCHFAC (Mtemp, covarbeta, Mtemp, TOL, IRANK, RSIG, Mtemp)

! Cholesky factor

CALL DRNNOR(Mtemp,betasst)

do i=1,Mtemp

betasst2(i) = 0.d0

do j=1,Mtemp

betasst2(i) = betasst2(i) + RSIG(i,j)*betasst(j)

enddo

betas(i) = betasst2(i) +betamu(i)

enddo

do i = 1,L

Xbetas(i,1) = 1.d0

XB(i) = 0.d0

38

do j = 1, Muse


enddo

enddo

do i = 1,L

do j = 1,Muse+1

XB(i) = XB(i) + Xbetas(i,j)*betas(j)

enddo

enddo

! ***** SIGMA ***************************

CALL sigmaparm (ybar,ybar2,ni,thetas,L,sigmab)

CALL drngam(L,sigmaa(1),stdsig)

do i = 1,L

sigma2(i) = sigmab(i)/stdsig(i)

enddo

!write (*,*) ’taub=’, taub

!write (*,*) ’tau2=’, tau2

CALL llike (betas,XB,tau2,Y,sigma2,thetas,

& L,Muse,sigmaa,taua,temp5,sigbeta,icountup,ni,

& adjust)

if (temp5.le.10) liktemp(numsim)=dexp(temp5)

if (temp5.gt.10) liktemp(numsim)=0

if (numval.eq.1) then

if ((temp5.ge.maxloglik) .and. (numsim.ge.kutoff))

& maxloglik = temp5

!write(*,*) "temp5 = ", temp5

!write(*,*) "numsim = ", numsim

if ((temp5.le.minloglik) .and. (numsim.ge.kutoff))

& minloglik = temp5

if (s.ge.kutoff) icount = icount + icountup

endif

enddo ! Here ends the simulation for the Gibbs Sampler

if (numval.eq.1) adjust = maxloglik


do s=(kutoff+1),KK

sumtemp4 = sumtemp4 + liktemp(s)

enddo

denom = (KK-(kutoff+1.0)+0.d0)

bayesfac = sumtemp4/denom

!write(*,*) ’bayesfac = ’, bayesfac ,’icount = ’, icount

endif

return

end

!Subroutine for updating the Tau parameter

39

SUBROUTINE tauparm (thetas,XB,L,taub)

DOUBLE PRECISION sumTXB,taub(1),thetas(L),XB(L)

INTEGER L

sumTXB=0.d0

do i=1,L

sumTXB=sumTXB + (thetas(i) - XB(i))*(thetas(i) - XB(i))

& +1.d0

enddo

taub(1)=0.5*sumTXB

return

end

!Subroutine for updating the Sigma parameter

SUBROUTINE sigmaparm (ybar,ybar2,ni,thetas,L,sigmab)

DOUBLEPRECISION ybar(L),thetas(L),sumythetas,sigmab(L),ybar2(L),

& dni(L)

INTEGER ni(L)

sumythetas=0.d0

do i=1,L

dni(i) = ni(i) + 0.0

sigmab(i) = 0.5*(1+(dni(i)*ybar2(i) - 2*thetas(i)*dni(i)*

& ybar(i) + dni(i)*thetas(i)*thetas(i)))

enddo

return

end

!Subroutine for updating the Beta parameter

SUBROUTINE betapar (XTX,Muse,tau2,L,thetas,X,betamu,covarbeta,

& sigbeta)

DOUBLE PRECISION XTX(Muse+1,Muse+1),step1(Muse+1,Muse+1),

& covarbeta(Muse+1,Muse+1),mupart2(Muse+1),thetas(L),

& tau2(1) , Xbetas(L,Muse+1),X(L,Muse), betamu(Muse+1)

INTEGER Muse,L

do i=1,Muse+1

do j=1,Muse+1

if (i.eq.j) then

step1(i,j)=(1/sigbeta)+((1/tau2(1))*XTX(i,j))

else

step1(i,j) = ((1/tau2(1))*XTX(i,j))

endif

enddo

enddo

do i = 1,L

Xbetas(i,1) = 1.d0

do j = 1, Muse

Xbetas(i,j+1) = X(i,j)

40

enddo

enddo

CALL DLINDS (Muse+1, step1, Muse+1, covarbeta, Muse+1)

! CALL DMURRV (L, Muse+1, Xbetas, L, Muse+1, thetas, 1, L,

!& mupart2)

do j = 1,Muse+1

mupart2(j)=0.d0

do i = 1,L

mupart2(j)=mupart2(j)+Xbetas(i,j)*thetas(i)

enddo

mupart2(j) = mupart2(j)/tau2(1) ! I am the one

enddo

! CALL DMURRV (Mtemp, Mtemp, covarbeta, Mtemp, Mtemp, mupart2,

! & 1,Mtemp, betamu)

do i= 1, Muse+1

betamu(i)=0.d0

do j =1, Muse+1

betamu(i)=betamu(i)+covarbeta(i,j)*mupart2(j)

enddo

enddo

return

end

!Subroutine for updating the Theta parameter

SUBROUTINE thetapar (tau2,sigma2,XB,L,ybar,ni,thetamu,thetasig)

DOUBLE PRECISION tau2(1),sigma2(L),XB(L),ybar(L),thetamu(L),

& thetasig(L),dni(L)

INTEGER L ,ni(L)

do i=1,L

dni(i)=ni(i) + 0.0

thetamu(i) = (1/tau2(1))*(tau2(1)*sigma2(i)/(dni(i)*tau2(1)

& +sigma2(i)))*XB(i) +(1/sigma2(i))

& *(tau2(1)*sigma2(i)/(dni(i)*tau2(1)+sigma2(i)))*

& dni(i)*ybar(i)

enddo

do i=1,L

thetasig(i) = sqrt(tau2(1)*sigma2(i)/(dni(i)*tau2(1)

& +sigma2(i)))

enddo

return

end

!Subroutine for likelyhood function

SUBROUTINE llike (betas,XB,tau2,Y,sigma2,thetas,

& L,M1,sigmaa,taua,likehood2,sigbeta,icountup,ni,

& adjust)

41

DOUBLE PRECISION betas(M1+1),XB(L),tau2(1),

& taua,Y(L,20),btb,thetas(L),

& sigma2(L),sigmaa(L),lik1,lik2,likehood,

& likehood2 ,adjust

INTEGER M1,L ,ni(L)

lik1=0.d0

lik2=0.d0

btb=0.d0

icountup = 0

Mtemp=M1+1

do i=1,L

lik1= lik1 - (sigmaa(i))*dlog(sigma2(i)) -

& (1/(2.d0*sigma2(i))) -

& (1/(2.d0*tau2(1)))*

& (thetas(i) - XB(i))*

& (thetas(i) - XB(i))

end do

do i=1,L

do j=1,ni(i)

lik2 = lik2 -(1/(2.d0*sigma2(i)))*(Y(i,j)-thetas(i))*

& (Y(i,j)-thetas(i))

end do

end do

do i = 1,M1+1

btb=btb + betas(i)*betas(i)

end do

likehood = lik1 + lik2 - (taua)*dlog(tau2(1))

& - (1/(2.d0*tau2(1))) - (1/(2.d0*sigbeta)) * btb

likehood2=likehood - adjust !Adjusting likelihood

!write(*,*) "likelihood =", likehood2

return

end

42

C. STOCHASTIC SEARCH METHOD

program test

USE MSIMSL

PARAMETER (M=38,L=165,taunot=0.5,sigmanot=0.5,KK=52000,

& kutoff=2000,sigbeta=100.d0,knum=2000,nn=10)

! M is number of Markers (column) and L is number of lines

DOUBLE PRECISION taua,sigmaa(L),

& Xinit(L,M),Yinit(L,20),sumy(L),sumy2(L),ybar(L),

& thetas(L),sigma2(L),ybar2(L),tau2(1),Xuse(L,M),

& tempModel(M),XB(L),XTXsend(M+1,M+1),adjust,

& temp,curlikhood,oldlikhood,

& likesum , MarkPostProb(M) ,tau2init, thetasinit(L),

& sigma2init(L), DummyVal1, DummyVal2,

& ModelTable(knum*20,42)

REAL probsucc

INTEGER ni(L),NOBS,Modelvec(M),Muse ,Mtemp,numval,newvector(1),LL,

& Newmodel, Modelvecbuf(M),lbinval(1),Locator,Previous,ii,tt

!Setting parameters

taua = 1 + taunot + (L/2)

NOBS = 0

sumtheta = 0.0d0

sumtheta2 = 0.0d0

open(10, file=’ni.csv’,status=’old’)

do i=1,L

read(10,*) ni(i)

NOBS =NOBS + ni(i)

enddo

close(10)

do i = 1,L

sigmaa(i)=(ni(i)/2) + 1 + sigmanot

enddo

!Read data

open(16, file=’bayxsha2.csv’, status=’old’)

do i=1,L

read(16,*) (Xinit(i,j),j=1,M)

enddo

close(16)

open(19, file=’newy.csv’, status=’old’)

do i=1,L

mtemp=ni(i)

read(19,*) (Yinit(i,j), j=1,mtemp)

enddo

43

close(19)

! Get initial estimates

do i=1,L

sumy(i) = 0.d0

sumy2(i) = 0.d0

do j=1,ni(i)

sumy(i) =sumy(i) + Yinit(i,j) !Create ybar

sumy2(i) = sumy2(i) + Yinit(i,j)*Yinit(i,j)

enddo

ybar(i) = sumy(i)/ni(i)

thetas(i) = ybar(i)

sigma2(i) = (sumy2(i) - ni(i)*(ybar(i)**2))/(ni(i) - 1)

ybar2(i) = sumy2(i)/ni(i)

thetasinit(i) = thetas(i)

sigma2init(i) = sigma2(i)

enddo

do i = 1,L

sumtheta = sumtheta + thetas(i)

sumtheta2 = sumtheta2 + (thetas(i)**2)

enddo

thetabar = sumtheta/L

tau2(1) = (sumtheta2 - L*(thetabar**2))/(L - 1)

tau2init = tau2(1)

do i=1,M

Modelvec(i) = 1

MarkPostProb(i) = 0 !Set all the Posterier Probabilitys of Markers 0

enddo

Muse=M

Mtemp=Muse+1

numval=1

adjust=0.d0



write (*,*) ’Program initializing...Please Stand by’




& temp)

write (*,*) ’Finish the first Gibbs...’

!***********************************************************************

!*******************Start the Stochastic search*************************

!***********************************************************************

do ii = 1, nn !Over all loop start here

write (*,*) ’ii=’,ii

44

CALL DRNUN (M, tempModel)

!initiallize the beginning model vector

do i=1,M

if (tempModel(i).ge.0.5) Modelvec(i) = 1

if (tempModel(i).lt.0.5) Modelvec(i) = 0

enddo

Newmodel=1

oldlikhood=0

curlikhood=0

likesum=0

probsucc=0

Locator=1

previous=0

do k = 1,knum !Each iteration loop start here

tt = k+(ii-1)*knum !Model indicator

write (*,*) ’k=’,k

!*****************************Part One**********************************

if (k.eq.1) then

Newmodel = 1

endif

select case (Newmodel)

case (0) !When Newmodel is 0, reserve the modelvectors

do i=1,38

Modelvec(i) = Modelvecbuf(i)

enddo

CALL RNUND(1, M, newvector) !Change the model vectors

newtemp=newvector(1)

if (Modelvec(newtemp).eq.1) then

Modelvec(newtemp) = 0

else


endif

case (1)

CALL RNUND(1, M, newvector) !Change the model vectors

newtemp=newvector(1)

if (Modelvec(newtemp).eq.1) then


else


endif

endselect

!write(*,*) ’modelvectors are’,Modelvec

!Initialize the current row of the table

ModelTable(tt,1) = 0 !First column is the Model Value

45

ModelTable(tt,2) = 0 !Second column is the Model likelihood

ModelTable(tt,3) = 0 !Thrid column is the Visiting times

ModelTable(tt,4) = 0 !Fourth the posterior probability

do j=1,38

ModelTable(tt,(j+4))=Modelvec(j)

enddo !Find the new model vector

DummyVal1 = 0

DummyVal2 = 0

do i = 1,19


DummyVal1 = DummyVal1 + 2**(i-1)

endif

enddo

do i = 1,19

if (Modelvec(i+19).eq.1) then

DummyVal2 = DummyVal2 + 2**(i-1)

endif

enddo

ModelTable(tt,1) = DummyVal1*10000000+DummyVal2

!Find the Model values

!write (*,*) ’Model value initially=’,Modelinfo(k,1)

!write (*,*) ’Model vectors=’,Modelvec

!*****************************Part two**********************************

if (tt.ne.1) then

!Search if the model has been done before (for K>1)

LL = 1

do while ((ModelTable(tt,1).ne.

& ModelTable(LL,1)).AND.(LL.ne.tt))

LL = LL + 1

enddo

!write (*,*) ’LL=’,LL

!write (*,*) ’k(>1)=’,k

if (LL.lt.tt) then

!write (*,*) ’Modelinfo(LL,2)=’,Modelinfo(LL,2)

curlikhood = ModelTable(LL,2)

probsucc = curlikhood/oldlikhood

!write (*,*) ’curlikhood=’,curlikhood

!write (*,*) ’oldlikhood=’,oldlikhood

if ((probsucc.lt.1).AND.(probsucc.gt.0)) then

CALL RNBIN(1,1,probsucc,lbinval)

Newmodel=lbinval(1)

elseif (probsucc.eq.0) then

Newmodel=0

elseif (probsucc.ge.1) then

46

Newmodel=1

endif

!write (*,*) ’Newmodel=’, Newmodel

if (Newmodel.eq.1) then

ModelTable(LL,3) = ModelTable(LL,3) + 1

!write (*,*) ’The deleted k is (1)’,k

do j=1,42

ModelTable(tt,j)=0

enddo !Delete the information of this row

Locator = LL

else

!write (*,*) ’The delete k is (0)’,k

ModelTable(Locator,3) = ModelTable(Locator,3) + 1

do j=1,42

ModelTable(tt,j)=0

enddo

endif

else

Muse = 0

do i = 1,M

Muse = Muse + Modelvec(i)

enddo

Mtemp = Muse+1




tau2(1)=tau2init

do i = 1, L

thetas(i) = thetasinit(i

sigma2(i) = sigma2init(i)

enddo




& temp)

curlikhood=temp !temp is the likelihood value

!write (*,*) ’Curentlikelihood of more than 2=’,curlikhood

if (curlikhood.ne.0) then

if ((oldlikhood.eq.0).or

& .(curlikhood.gt.oldlikhood)) then

probsucc = 1 !Make sure it is greater than 1

else

probsucc = curlikhood/oldlikhood

!write (*,*) ’oldlikhood of more than 2=’,oldlikhood

47

!write (*,*) ’probsucc=’,probsucc

endif !Get the right value for Probsucc

if ((probsucc.lt.1).and.(probsucc.gt.0)) then

CALL RNBIN(1,1,probsucc,lbinval)

Newmodel=lbinval(1)

elseif (probsucc.ge.1) then

Newmodel=1

elseif (probsucc.eq.0) then

Newmodel=0

endif

if (Newmodel.eq.1) then

ModelTable(tt,3) = ModelTable(tt,3) + 1

ModelTable(tt,2) = curlikhood

Locator = tt

elseif (Newmodel.eq.0) then

ModelTable(Locator,3) = ModelTable(Locator,3) + 1

ModelTable(tt,2) = curlikhood

endif

else

do j = 1, 42

ModelTable(tt,j) = 0

Newmodel = 0

enddo

endif

endif

elseif (tt.eq.1) then

!write (*,*) ’k1=’,k

Muse = 0

do i = 1,M

Muse = Muse + Modelvec(i)

enddo

Mtemp = Muse+1




tau2(1) = tau2init

do i = 1, L

sigma2(i) = sigma2init(i)

thetas(i) = thetasinit(i)

enddo




& temp)

48

curlikhood=temp !temp is the likelihood value

if (curlikhood.ne.0) then

ModelTable(1,3) = ModelTable(1,3) + 1

ModelTable(1,2) = curlikhood

else

do j = 1, 42

ModelTable(1,j) = 0

enddo

NewModel = 0

endif

!write (*,*) ’Currentlikelihood=’,curlikhood

endif !If the model doesn’t move, visiting times will be 0

if ((k.eq.1).or.(Newmodel.eq.1)) then

if (NewModel.eq.1) then

oldlikhood=curlikhood !Store the likelihood

do i=1,38 !Store the model vectors

Modelvecbuf(i) = Modelvec(i)

enddo

else

NewModel = 1

endif

endif

!write (*,*) ’Newmodel=’,Newmodel

!write (*,*) ’modelinfo(k,2)=’,modelinfo(k,2)

!write (*,*) ’Model value finally is=’, Modelinfo(k,1)

enddo !Each iteration loop ends here

enddo !Over all loop ends here

!***************************Part Four********************************

do i=1,(knum*nn) !Find the Posterier Probability for each Model

likesum=likesum+ModelTable(i,2)*ModelTable(i,3)

enddo

!write (*,*) ’likesum=’,likesum

do i=1,(knum*nn)

ModelTable(i,4)=(ModelTable(i,2)/likesum)*ModelTable(i,3)

!write (*,*) ’Modelinfo(i,1)=’,Modelinfo(i,1)

!write (*,*) ’MOdelinfo(i,2)=’,Modelinfo(i,2)



enddo

do i=1,38 !Find the Posterier Probability for each Marker

do k=1,(knum*nn)

MarkPostProb(i)=MarkPostProb(i)+

&ModelTable(k,(i+4))*ModelTable(k,4)

enddo

49

enddo

open(3,file="MarkerProbability.txt",status="new")

do i=1,38

write (3,*) ’Probability of Marker’,i

write(3,*) MarkPostProb(i)

enddo

close(3)

!write (*,*) ’Posterier Probability of Markers are’, MarkPostProb

end !Main program ends here

!***********************************************************************

!******************Subroutine Part**************************************

!***********************************************************************

! Get the correct X columns in the beginning of matrix

SUBROUTINE GETX (M,Xinit,Xuse,Modelvec,L)

doubleprecision Xinit(L,M),Xuse(L,M)

integer M, Modelvec(M),L

j=1

do i=1,M


do s=1,L

Xuse(s,j) = Xinit(s,i)

enddo

j = j + 1

endif

enddo

return

end

SUBROUTINE REGRESSION (Xuse,Muse,M,L,Yinit,ni,NOBS,XTXsend,XB)

doubleprecision Xuse(L,M), X(L,Muse),Yinit(L,20),yregress(NOBS),

& xregress(NOBS,Muse),betastemp(Muse+1),

& SST, SSE,Xbetas(L,Muse+1),XB(L),XTXsend(M+1,M+1),

& XTX(Muse+1,Muse+1)

integer Muse,M,L,ni(L),num,num2

! This part of the routine subsets the full X matrix to

! get the correct X

do j=1,Muse

do i=1,L

X(i,j) = Xuse(i,j)

enddo

enddo

! Rearrange the Y matrix to a vector for regression

num = 1

do i = 1,L

do j = 1,ni(i)

50

yregress(num) = Yinit(i,j)

num = num + 1

enddo

enddo

! Expand the correct X matrix for regression

num2 = 1

do i = 1,L

do s = 1,ni(i)

do j = 1,Muse

xregress(num2,j) = X(i,j)

enddo

num2 = num2 + 1

enddo

enddo

! This does the regression

CALL DRLSE (NOBS, yregress, Muse, xregress, NOBS, 1, betastemp,

& SST, SSE)

! From the regression routine

!write(*,*) betastemp

! Create appropriate X matrix by adding a column of 1’s for the

! intercept term

do i = 1,L

Xbetas(i,1) = 1.d0

XB(i) = 0.d0

do j = 1, Muse


enddo

enddo

! Calculate XB

do i = 1,L

do j = 1,Muse+1

XB(i) = XB(i) + Xbetas(i,j)*betastemp(j)

enddo

enddo

! Calculates XTX

CALL DMXTXF (L, Muse+1, Xbetas, L, Muse+1, XTX, Muse+1)

! Need to send XTX out of this function. In order to do so

! must save this to an M by M matrix

do i = 1,Muse+1

do j=1,Muse+1

XTXsend(i,j) = XTX(i,j)

enddo

enddo

return

51

end

!Subroutine for Gibbs sampler

SUBROUTINE Gibbs (Xuse,Y,ni,L,M,Muse,taua,sigmaa,



& bayesfac)

DOUBLE PRECISION Xuse(L,M),XB(L),tau2(1),

& taua,taub(1),sigmab(L),Y(L,20),betamu(Muse+1),

& covarbeta(Muse+1,Muse+1),sigma2(L),thetamu(L),

& thetas(L),thetasig(L),ybar(L),

& stdtau2(1),betasst(Muse+1),stdsig(L),ybar2(L),

& stdtheta(L),sigmaa(L),minloglik,

& liktemp(KK),temp5,maxloglik,

& sumtemp4,bayesfac,RSIG(Muse+1,Muse+1),TOL,

& X(L,Muse),DMACH,betasst2(Muse+1),

& XTXsend(M+1,M+1),Xbetas(L,Muse+1),betas(Muse+1),

& XTX(Muse+1,Muse+1),adjust

INTEGER ni(L),IRANK,KK,kutoff,Mtemp,L,Muse,icount,numsim,numval

!Set up a groups of new parameters

Mtemp=Muse+1

TOL = 100.0*DMACH(4)

minloglik = 1.d8

maxloglik = -1.d8

sumtemp4 = 0.d0

icount = 0

!sigbeta = 100.d0

! Get the correct X

do j=1,Muse

do i=1,L

X(i,j) = Xuse(i,j)

enddo

enddo

! Get the correct XTX

do j=1,Muse+1

do i=1,Muse+1

XTX(i,j) = XTXsend(i,j)

enddo

enddo

!Gibbs Sampler

do numsim=1,KK

!write (*,*) numsim

!***** THETAS ***************************

CALL thetapar (tau2,sigma2,XB,L,ybar,ni,thetamu,thetasig) !parameter

CALL DRNNOR (L,stdtheta)

52

do i=1,L

thetas(i) = stdtheta(i)*thetasig(i) + thetamu(i)

enddo

!***** TAU ***************************

CALL tauparm (thetas,XB,L,taub)

CALL drngam(1,taua,stdtau2)

tau2(1) = taub(1)/stdtau2(1)

!***** BETA ***************************

CALL betapar (XTX,Muse,tau2,L,thetas,X,betamu,covarbeta,sigbeta)

CALL DCHFAC (Mtemp, covarbeta, Mtemp, TOL, IRANK, RSIG, Mtemp)

! Cholesky factor

CALL DRNNOR(Mtemp,betasst)

do i=1,Mtemp

betasst2(i) = 0.d0

do j=1,Mtemp

betasst2(i) = betasst2(i) + RSIG(i,j)*betasst(j)

enddo

betas(i) = betasst2(i) +betamu(i)

enddo

do i = 1,L

Xbetas(i,1) = 1.d0

XB(i) = 0.d0

do j = 1, Muse


enddo

enddo

do i = 1,L

do j = 1,Muse+1

XB(i) = XB(i) + Xbetas(i,j)*betas(j)

enddo

enddo

! ***** SIGMA ***************************

CALL sigmaparm (ybar,ybar2,ni,thetas,L,sigmab)

CALL drngam(L,sigmaa(1),stdsig)

do i = 1,L

sigma2(i) = sigmab(i)/stdsig(i)

enddo

!write (*,*) ’taub=’, taub

!write (*,*) ’tau2=’, tau2

CALL llike (betas,XB,tau2,Y,sigma2,thetas,

& L,Muse,sigmaa,taua,temp5,sigbeta,icountup,ni,

& adjust)

if (temp5.le.10) liktemp(numsim)=dexp(temp5)

if (temp5.gt.10) liktemp(numsim)=0

53


if ((temp5.ge.maxloglik) .and. (numsim.ge.kutoff))

& maxloglik = temp5

!write(*,*) "temp5 = ", temp5

!write(*,*) "numsim = ", numsim

if ((temp5.le.minloglik) .and. (numsim.ge.kutoff))

& minloglik = temp5

if (s.ge.kutoff) icount = icount + icountup

endif

enddo ! Here ends the simulation for the Gibbs Sampler

if (numval.eq.1) adjust = maxloglik


do s=(kutoff+1),KK

sumtemp4 = sumtemp4 + liktemp(s)

enddo

denom = (KK-(kutoff+1.0)+0.d0)

bayesfac = sumtemp4/denom

!write(*,*) ’bayesfac = ’, bayesfac ,’icount = ’, icount

endif

return

end

!Subroutine for updating the Tau parameter

SUBROUTINE tauparm (thetas,XB,L,taub)

DOUBLE PRECISION sumTXB,taub(1),thetas(L),XB(L)

INTEGER L

sumTXB=0.d0

do i=1,L

sumTXB=sumTXB + (thetas(i) - XB(i))*(thetas(i) - XB(i))

& +1.d0

enddo

taub(1)=0.5*sumTXB

return

end

!Subroutine for updating the Sigma parameter

SUBROUTINE sigmaparm (ybar,ybar2,ni,thetas,L,sigmab)

DOUBLEPRECISION ybar(L),thetas(L),sumythetas,sigmab(L),ybar2(L),

& dni(L)

INTEGER ni(L)

sumythetas=0.d0

do i=1,L

dni(i) = ni(i) + 0.0

sigmab(i) = 0.5*(1+(dni(i)*ybar2(i) - 2*thetas(i)*dni(i)*

& ybar(i) + dni(i)*thetas(i)*thetas(i)))

enddo

54

return

end

!Subroutine for updating the Beta parameter

SUBROUTINE betapar (XTX,Muse,tau2,L,thetas,X,betamu,covarbeta,

& sigbeta)

DOUBLE PRECISION XTX(Muse+1,Muse+1),step1(Muse+1,Muse+1),

& covarbeta(Muse+1,Muse+1),mupart2(Muse+1),thetas(L),

& tau2(1) , Xbetas(L,Muse+1),X(L,Muse), betamu(Muse+1)

INTEGER Muse,L

do i=1,Muse+1

do j=1,Muse+1

if (i.eq.j) then

step1(i,j)=(1/sigbeta)+((1/tau2(1))*XTX(i,j))

else

step1(i,j) = ((1/tau2(1))*XTX(i,j))

endif

enddo

enddo

do i = 1,L

Xbetas(i,1) = 1.d0

do j = 1, Muse

Xbetas(i,j+1) = X(i,j)

enddo

enddo

CALL DLINDS (Muse+1, step1, Muse+1, covarbeta, Muse+1)

! CALL DMURRV (L, Muse+1, Xbetas, L, Muse+1, thetas, 1, L,

!& mupart2)

do j = 1,Muse+1

mupart2(j)=0.d0

do i = 1,L

mupart2(j)=mupart2(j)+Xbetas(i,j)*thetas(i)

enddo

mupart2(j) = mupart2(j)/tau2(1) ! I am the one

enddo

! CALL DMURRV (Mtemp, Mtemp, covarbeta, Mtemp, Mtemp, mupart2,

! & 1,Mtemp, betamu)

do i= 1, Muse+1

betamu(i)=0.d0

do j =1, Muse+1

betamu(i)=betamu(i)+covarbeta(i,j)*mupart2(j)

enddo

enddo

return

end

55

!Subroutine for updating the Theta parameter

SUBROUTINE thetapar (tau2,sigma2,XB,L,ybar,ni,thetamu,thetasig)

DOUBLE PRECISION tau2(1),sigma2(L),XB(L),ybar(L),thetamu(L),

& thetasig(L),dni(L)

INTEGER L ,ni(L)

do i=1,L

dni(i)=ni(i) + 0.0

thetamu(i) = (1/tau2(1))*(tau2(1)*sigma2(i)/(dni(i)*tau2(1)

& +sigma2(i)))*XB(i) +(1/sigma2(i))

& *(tau2(1)*sigma2(i)/(dni(i)*tau2(1)+sigma2(i)))*

& dni(i)*ybar(i)

enddo

do i=1,L

thetasig(i) = sqrt(tau2(1)*sigma2(i)/(dni(i)*tau2(1)

& +sigma2(i)))

enddo

return

end

!Subroutine for likelyhood function

SUBROUTINE llike (betas,XB,tau2,Y,sigma2,thetas,

& L,M1,sigmaa,taua,likehood2,sigbeta,icountup,ni,

& adjust)

DOUBLE PRECISION betas(M1+1),XB(L),tau2(1),

& taua,Y(L,20),btb,thetas(L),

& sigma2(L),sigmaa(L),lik1,lik2,likehood,

& likehood2 ,adjust

INTEGER M1,L ,ni(L)

lik1=0.d0

lik2=0.d0

btb=0.d0

icountup = 0

Mtemp=M1+1

do i=1,L

lik1= lik1 - (sigmaa(i))*dlog(sigma2(i)) -

& (1/(2.d0*sigma2(i))) -

& (1/(2.d0*tau2(1)))*

& (thetas(i) - XB(i))*

& (thetas(i) - XB(i))

end do

do i=1,L

do j=1,ni(i)

lik2 = lik2 -(1/(2.d0*sigma2(i)))*(Y(i,j)-thetas(i))*

& (Y(i,j)-thetas(i))

end do

56

end do

do i = 1,M1+1

btb=btb + betas(i)*betas(i)

end do

likehood = lik1 + lik2 - (taua)*dlog(tau2(1))

& - (1/(2.d0*tau2(1))) - (1/(2.d0*sigbeta)) * btb

likehood2=likehood - adjust !Adjusting likelihood

!write(*,*) "likelihood =", likehood2

return

end

57

a simulation study for bayesian hierarchical model

Documents