using profile regression mixture models and dirichlet...

58
Using Profile Regression Mixture Models and Dirichlet Processes to explore the combined effect of risk factors; the R package PReMiuM Silvia Liverani and Michail Papathomas British-German meeting on classification - UCL November 2013 SL and MP (ICL/MRC and St Andrews) Profile Regression 1 / 40

Upload: nguyennhan

Post on 14-Feb-2019

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Using Profile Regression Mixture Models andDirichlet Processes to explore the combined effect

of risk factors; the R package PReMiuM

Silvia Liverani and Michail Papathomas

British-German meeting on classification - UCLNovember 2013

SL and MP (ICL/MRC and St Andrews) Profile Regression 1 / 40

Page 2: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Outline

I MotivationI MethodI The R package PReMiuMI Examples

SL and MP (ICL/MRC and St Andrews) Profile Regression 2 / 40

Page 3: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

People

I David HastieI John MolitorI Sylvia Richardson

SL and MP (ICL/MRC and St Andrews) Profile Regression 3 / 40

Page 4: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Motivation

Multicollinearity

I Goal of epidemiological studies is to investigate the joint effect ofdifferent covariates / risk factors on a phenotype...

I ... but highly correlated risk factors create collinearity problems!

ExampleResearchers are interested in determining if a relationship existsbetween blood pressure (y = BP, in mm Hg) and

I weight (x1 = Weight, in kg)I body surface area (x2 = BSA, in sq m)I duration of hypertension (x3 = Dur, in years)I basal pulse (x4 = Pulse, in beats per minute)I stress index (x5 = Stress)

SL and MP (ICL/MRC and St Andrews) Profile Regression 4 / 40

Page 5: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Motivation

Multicollinearity

I Goal of epidemiological studies is to investigate the joint effect ofdifferent covariates / risk factors on a phenotype...

I ... but highly correlated risk factors create collinearity problems!

ExampleResearchers are interested in determining if a relationship existsbetween blood pressure (y = BP, in mm Hg) and

I weight (x1 = Weight, in kg)I body surface area (x2 = BSA, in sq m)I duration of hypertension (x3 = Dur, in years)I basal pulse (x4 = Pulse, in beats per minute)I stress index (x5 = Stress)

SL and MP (ICL/MRC and St Andrews) Profile Regression 4 / 40

Page 6: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Motivation

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

●●

●● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

86 88 90 92 94 96 98 100

01

23

Weight

BS

A

Weight = x1, BSA = x2

SL and MP (ICL/MRC and St Andrews) Profile Regression 5 / 40

Page 7: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Motivation

I Highly correlated risk factors create collinearity problems, causinginstability in model estimation

Model β̂1 SE β̂1 β̂2 SE β̂2

y ∼ x1 2.64 0.30 – –y ∼ x2 – – 3.34 1.33y ∼ x1 + x2 6.58 0.53 -20.44 2.28

I Effect 1: the estimated regression coefficient of any one variabledepends on which other predictor variables are included in themodel.

I Effect 2: the precision of the estimated regression coefficientsdecreases as more predictor variables are added to the model.

SL and MP (ICL/MRC and St Andrews) Profile Regression 6 / 40

Page 8: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Motivation

I Highly correlated risk factors create collinearity problems, causinginstability in model estimation

Model β̂1 SE β̂1 β̂2 SE β̂2

y ∼ x1 2.64 0.30 – –y ∼ x2 – – 3.34 1.33y ∼ x1 + x2 6.58 0.53 -20.44 2.28

I Effect 1: the estimated regression coefficient of any one variabledepends on which other predictor variables are included in themodel.

I Effect 2: the precision of the estimated regression coefficientsdecreases as more predictor variables are added to the model.

SL and MP (ICL/MRC and St Andrews) Profile Regression 6 / 40

Page 9: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Motivation

When it is of interest to detect interactions between covariates,standard regression modelling may become problematic.

I In a classical setting, fitting a linear model with many parameterssometimes requires an impractically large vector ofobservations to produce valid inferences (Burton et al., IJE,2009). Also, identifiability and collinearity problems are oftenpresent.

I In Bayesian model comparison, the space of models becomesvast, and model search algorithms like the Reversible Jumpapproach (Green, Bka 1995) require an impractically largenumber of iterations before they converge (Dobra and Massam,St Meth 2010).

SL and MP (ICL/MRC and St Andrews) Profile Regression 7 / 40

Page 10: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Issues caused byI correlated risk factorsI interacting risk factors

Profile regressionI partitions the subjects into groups according to covariate

profileI investigation of the joint effects of multiple risk factorsI jointly models the covariate patterns and health outcomesI flexible but tractable Bayesian model

SL and MP (ICL/MRC and St Andrews) Profile Regression 8 / 40

Page 11: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Issues caused byI correlated risk factorsI interacting risk factors

Profile regressionI partitions the subjects into groups according to covariate

profileI investigation of the joint effects of multiple risk factorsI jointly models the covariate patterns and health outcomesI flexible but tractable Bayesian model

SL and MP (ICL/MRC and St Andrews) Profile Regression 8 / 40

Page 12: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore
Page 13: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

x1 and x2

only x1

Page 14: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Notation

For individual i

yi outcome of interestxi = (xi1, . . . , xiP) covariate profilewi fixed effectszi = c the allocation variable indicates the

cluster to which individual i belongs

SL and MP (ICL/MRC and St Andrews) Profile Regression 11 / 40

Page 15: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Mixture model for the covariates

f (xi |φ, ψ) =∑

c

ψc f (xi |zi = c, φc)

I For example for discrete covariates

f (xi |zi = c, φc) =J∏

j=1

φzi ,j,xi,j

SL and MP (ICL/MRC and St Andrews) Profile Regression 12 / 40

Page 16: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Mixture model for the covariates

f (xi |φ, ψ) =∑

c

ψc f (xi |zi = c, φc)

I For example for discrete covariates

f (xi |zi = c, φc) =J∏

j=1

φzi ,j,xi,j

SL and MP (ICL/MRC and St Andrews) Profile Regression 12 / 40

Page 17: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Mixture model for the covariates

f (xi |φ, ψ) =∑

c

ψc f (xi |zi = c, φc)

I For example for discrete covariates

f (xi |zi = c, φc) =J∏

j=1

φzi ,j,xi,j

SL and MP (ICL/MRC and St Andrews) Profile Regression 12 / 40

Page 18: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Mixture model for the covariates

f (xi |φ, ψ) =∑

c

ψc f (xi |zi = c, φc)

I Prior model for the mixture weights ψcI stick-breaking priors (constructive definition of the Dirichlet

Process)P(Zi = c|ψ) = ψc ψ1 = V1

ψc = Vc

∏l<c

(1− Vl) Vc ∼ Beta(1, α)

I larger concentration parameter α the more evenly distributed is theresulting distribution.

I smaller concentration parameter α the more sparsely distributed isthe resulting distribution, with all but a few parameters having aprobability near zero

SL and MP (ICL/MRC and St Andrews) Profile Regression 13 / 40

Page 19: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Mixture model for the covariates

f (xi |φ, ψ) =∑

c

ψc f (xi |zi = c, φc)

I Prior model for the mixture weights ψcI stick-breaking priors (constructive definition of the Dirichlet

Process)P(Zi = c|ψ) = ψc

ψ1 = V1

ψc = Vc

∏l<c

(1− Vl) Vc ∼ Beta(1, α)

I larger concentration parameter α the more evenly distributed is theresulting distribution.

I smaller concentration parameter α the more sparsely distributed isthe resulting distribution, with all but a few parameters having aprobability near zero

SL and MP (ICL/MRC and St Andrews) Profile Regression 13 / 40

Page 20: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Mixture model for the covariates

f (xi |φ, ψ) =∑

c

ψc f (xi |zi = c, φc)

I Prior model for the mixture weights ψcI stick-breaking priors (constructive definition of the Dirichlet

Process)P(Zi = c|ψ) = ψc ψ1 = V1

ψc = Vc

∏l<c

(1− Vl) Vc ∼ Beta(1, α)

I larger concentration parameter α the more evenly distributed is theresulting distribution.

I smaller concentration parameter α the more sparsely distributed isthe resulting distribution, with all but a few parameters having aprobability near zero

SL and MP (ICL/MRC and St Andrews) Profile Regression 13 / 40

Page 21: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Mixture model for the covariates

f (xi |φ, ψ) =∑

c

ψc f (xi |zi = c, φc)

I Prior model for the mixture weights ψcI stick-breaking priors (constructive definition of the Dirichlet

Process)P(Zi = c|ψ) = ψc ψ1 = V1

ψc = Vc

∏l<c

(1− Vl) Vc ∼ Beta(1, α)

I larger concentration parameter α the more evenly distributed is theresulting distribution.

I smaller concentration parameter α the more sparsely distributed isthe resulting distribution, with all but a few parameters having aprobability near zero

SL and MP (ICL/MRC and St Andrews) Profile Regression 13 / 40

Page 22: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Joint covariate and response model

f (xi , yi |φ, θ, ψ, β) =∑

c

ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)

I Mixture model jointly for covariate and responseI For example, for Bernoulli outcome

logit{p(yi = 1|θc , β,wi)} = θc + βT wi

I The association of the profiles with the response arecharacterised by the risk effect parameters θc

I Note, the above framework, adopted in Molitor et al. (Biostatistics,2010) is similar to the one in Bigelow and Dunson (JASA, 2009).

SL and MP (ICL/MRC and St Andrews) Profile Regression 14 / 40

Page 23: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Joint covariate and response model

f (xi , yi |φ, θ, ψ, β) =∑

c

ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)

I Mixture model jointly for covariate and responseI For example, for Bernoulli outcome

logit{p(yi = 1|θc , β,wi)} = θc + βT wi

I The association of the profiles with the response arecharacterised by the risk effect parameters θc

I Note, the above framework, adopted in Molitor et al. (Biostatistics,2010) is similar to the one in Bigelow and Dunson (JASA, 2009).

SL and MP (ICL/MRC and St Andrews) Profile Regression 14 / 40

Page 24: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Joint covariate and response model

f (xi , yi |φ, θ, ψ, β) =∑

c

ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)

I Mixture model jointly for covariate and responseI For example, for Bernoulli outcome

logit{p(yi = 1|θc , β,wi)} = θc + βT wi

I The association of the profiles with the response arecharacterised by the risk effect parameters θc

I Note, the above framework, adopted in Molitor et al. (Biostatistics,2010) is similar to the one in Bigelow and Dunson (JASA, 2009).

SL and MP (ICL/MRC and St Andrews) Profile Regression 14 / 40

Page 25: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Statistical Framework

I Joint covariate and response model

f (xi , yi |φ, θ, ψ, β) =∑

c

ψc f (xi |zi = c, φc)f (yi |zi = c, θc , β,wi)

I Mixture model jointly for covariate and responseI For example, for Bernoulli outcome

logit{p(yi = 1|θc , β,wi)} = θc + βT wi

I The association of the profiles with the response arecharacterised by the risk effect parameters θc

I Note, the above framework, adopted in Molitor et al. (Biostatistics,2010) is similar to the one in Bigelow and Dunson (JASA, 2009).

SL and MP (ICL/MRC and St Andrews) Profile Regression 14 / 40

Page 26: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Implementation

We have implemented profile regression in C++ (now wrapped in an Rpackage) for

I binary, binomial, categorical, Normal and Poisson outcomeI Normal and discrete covariates

It can doI dependent or independent slice sampling (Kalli et al., 2011)I or truncated Dirichlet process model (Ishwaran and James, 2001)

as well asI handles missing data

I we check which cluster the subject is allocated to and then sampleI includes some label switching moves

SL and MP (ICL/MRC and St Andrews) Profile Regression 15 / 40

Page 27: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Characterising the “best partition”

I Construct a score matrix S = Sij that records the probability thatindividuals i and j are assigned to the same cluster.

I Task is to find a partition zbest that best represents SI susceptible to Monte Carlo errorI Dahl (2006) suggests using a least square distance criterionI Alternatively, we process the similarity matrix through the

Partitioning Around Medoids algorithm (Kaufman & Rousseu, 1990)to derive zbest .

SL and MP (ICL/MRC and St Andrews) Profile Regression 16 / 40

Page 28: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Profile Regression

Post-processing the output

I Once zbest is defined, it is important to have a measure ofuncertainty/ stability

I This can be done by model averaging quantities related to zbest

(effects and cluster related parameters) through post-processingI If zbest is a consistent clustering, then subgroup parameters will

vary little from iteration to iteration, leading to narrow posteriorcredible intervals.

I PredictingI At each iteration of the sampler, calculate the probability that i ′

belongs to each of the clusters

p(zi′ = c|xi′ , ψc , φ) ∝ p(zi′ |c, ψc)p(xi′ |zi′ = c, φ)

and sample θc according to these probabilities

SL and MP (ICL/MRC and St Andrews) Profile Regression 17 / 40

Page 29: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package PReMiuM

Program to implement Dirichlet Process Bayesian ClusteringI Allows user to exclude the response from the modelI Allows user to run with a fixed alpha or update alpha (default)I Allows users to run predictive scenarios (basic or

Rao-Blackwellised predictions)I Handling of missing dataI Adaptive MCMC where appropriateI Can generate simulation dataI Performs post processingI Functions for flexible plotting

SL and MP (ICL/MRC and St Andrews) Profile Regression 18 / 40

Page 30: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package PReMiuM

I Can be downloaded from CRANhttp://cran.r-project.org/web/packages/PReMiuM/

I or

install.packages("PReMiuM")

SL and MP (ICL/MRC and St Andrews) Profile Regression 19 / 40

Page 31: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Example: Simulated data

The profiles are given byy : outcome, Bernoullix : 5 covariates, all discrete with 3 levelsw : 2 fixed effects, continuous or discrete

> profRegr(yModel="Bernoulli",xModel="Discrete",nSweeps=1000,nBurn=1000,data="dataMatrix.txt",output="output/output",nCovariates=5,covNames=c("Var1","Var2","Var3","Var4","Var5"),nFixedEffects=2,fixedEffectsNames=c("FE1","FE2")xLevels=c(3,3,3,3,3))

SL and MP (ICL/MRC and St Andrews) Profile Regression 20 / 40

Page 32: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Example: Simulated data

The profiles are given byy : outcome, Bernoullix : 5 covariates, all discrete with 3 levelsw : 2 fixed effects, continuous or discrete

> profRegr(yModel="Bernoulli",xModel="Discrete",nSweeps=1000,nBurn=1000,data="dataMatrix.txt",output="output/output",nCovariates=5,covNames=c("Var1","Var2","Var3","Var4","Var5"),nFixedEffects=2,fixedEffectsNames=c("FE1","FE2")xLevels=c(3,3,3,3,3))

SL and MP (ICL/MRC and St Andrews) Profile Regression 20 / 40

Page 33: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package DiPBaC: Output

The programme creates a number of output files:I output z.txt → allocations z

I output nClusters.txt → cluster sizesI output theta.txt → cluster-specific parameter for outcome: θI output phi.txt → cluster-specific parameter for covariates: φI output alpha.txt → parameter of the Dirichlet process αI output beta.txt → fixed effects parameter βI output entropy.txtI output nMembers.txtI output logPost.txtI output psi.txtI output betaProp.txtI output log.txtI output thetaProp.txtI output alphaProp.txtI output input.txt

SL and MP (ICL/MRC and St Andrews) Profile Regression 21 / 40

Page 34: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package DiPBaC: Output

The programme creates a number of output files:I output z.txt → allocations zI output nClusters.txt → cluster sizes

I output theta.txt → cluster-specific parameter for outcome: θI output phi.txt → cluster-specific parameter for covariates: φI output alpha.txt → parameter of the Dirichlet process αI output beta.txt → fixed effects parameter βI output entropy.txtI output nMembers.txtI output logPost.txtI output psi.txtI output betaProp.txtI output log.txtI output thetaProp.txtI output alphaProp.txtI output input.txt

SL and MP (ICL/MRC and St Andrews) Profile Regression 21 / 40

Page 35: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package DiPBaC: Output

The programme creates a number of output files:I output z.txt → allocations zI output nClusters.txt → cluster sizesI output theta.txt → cluster-specific parameter for outcome: θ

I output phi.txt → cluster-specific parameter for covariates: φI output alpha.txt → parameter of the Dirichlet process αI output beta.txt → fixed effects parameter βI output entropy.txtI output nMembers.txtI output logPost.txtI output psi.txtI output betaProp.txtI output log.txtI output thetaProp.txtI output alphaProp.txtI output input.txt

SL and MP (ICL/MRC and St Andrews) Profile Regression 21 / 40

Page 36: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package DiPBaC: Output

The programme creates a number of output files:I output z.txt → allocations zI output nClusters.txt → cluster sizesI output theta.txt → cluster-specific parameter for outcome: θI output phi.txt → cluster-specific parameter for covariates: φ

I output alpha.txt → parameter of the Dirichlet process αI output beta.txt → fixed effects parameter βI output entropy.txtI output nMembers.txtI output logPost.txtI output psi.txtI output betaProp.txtI output log.txtI output thetaProp.txtI output alphaProp.txtI output input.txt

SL and MP (ICL/MRC and St Andrews) Profile Regression 21 / 40

Page 37: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package DiPBaC: Output

The programme creates a number of output files:I output z.txt → allocations zI output nClusters.txt → cluster sizesI output theta.txt → cluster-specific parameter for outcome: θI output phi.txt → cluster-specific parameter for covariates: φI output alpha.txt → parameter of the Dirichlet process α

I output beta.txt → fixed effects parameter βI output entropy.txtI output nMembers.txtI output logPost.txtI output psi.txtI output betaProp.txtI output log.txtI output thetaProp.txtI output alphaProp.txtI output input.txt

SL and MP (ICL/MRC and St Andrews) Profile Regression 21 / 40

Page 38: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package DiPBaC: Output

The programme creates a number of output files:I output z.txt → allocations zI output nClusters.txt → cluster sizesI output theta.txt → cluster-specific parameter for outcome: θI output phi.txt → cluster-specific parameter for covariates: φI output alpha.txt → parameter of the Dirichlet process αI output beta.txt → fixed effects parameter β

I output entropy.txtI output nMembers.txtI output logPost.txtI output psi.txtI output betaProp.txtI output log.txtI output thetaProp.txtI output alphaProp.txtI output input.txt

SL and MP (ICL/MRC and St Andrews) Profile Regression 21 / 40

Page 39: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package DiPBaC: Output

The programme creates a number of output files:I output z.txt → allocations zI output nClusters.txt → cluster sizesI output theta.txt → cluster-specific parameter for outcome: θI output phi.txt → cluster-specific parameter for covariates: φI output alpha.txt → parameter of the Dirichlet process αI output beta.txt → fixed effects parameter βI output entropy.txtI output nMembers.txtI output logPost.txtI output psi.txtI output betaProp.txtI output log.txtI output thetaProp.txtI output alphaProp.txtI output input.txt

SL and MP (ICL/MRC and St Andrews) Profile Regression 21 / 40

Page 40: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package PReMiuM: Postprocessing

I Retuns file with output info

> runInfoObj<-readRunInfo(fileStem="output",directoryPath="output")

I Computes the score matrix that records the probability that eachpair of individuals are in the same clusters

> dissimObj<-calcDissimilarityMatrix(runInfoObj)

I Computes the optimal partition

> clusObj<-calcOptimalClustering(dissimObj)

SL and MP (ICL/MRC and St Andrews) Profile Regression 22 / 40

Page 41: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

R package PReMiuM: Postprocessing

I Computes the average risk and profile for each cluster

> riskProfileObj<-calcAvgRiskAndProfile(clusObj)

I Plots

> clusterOrderObj<-plotRiskProfile(riskProfileObj,"output/summary.png")

SL and MP (ICL/MRC and St Andrews) Profile Regression 23 / 40

Page 42: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore
Page 43: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Predictions

> profRegr(yModel="Bernoulli",xModel="Discrete",nSweeps=1000,nBurn=1000,data="dataMatrix.txt",output="output/output",nCovariates=5,covNames=c("Var1","Var2","Var3","Var4","Var5"),nFixedEffects=2,fixedEffectsNames=c("FE1","FE2")xLevels=c(3,3,3,3,3),predict="predict.scenarios.txt")

SL and MP (ICL/MRC and St Andrews) Profile Regression 25 / 40

Page 44: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Predictions

I The file predict.scenarios.txt contains the predictivescenarios for the 5 covariates and the 2 fixed effects.0 1 0 0 0 -0.348608422340999 0.7869825038141480 0 1 0 0 3.832601433577385 -0.67410633721989

I Postprocessing follows as beforeI Predictions are postprocessed separately as follows

> predictions<-calcPredictions(riskProfileObj,doRaoBlackwell=F,fullSweepPredictions=T,fullSweepLogOR=T)

SL and MP (ICL/MRC and St Andrews) Profile Regression 26 / 40

Page 45: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore
Page 46: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore
Page 47: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Variable selection

I In many applications, the number of predictors is large and the fullprofiles become difficult to interpret⇒ Aim is to identify the predictors that contribute more than others to

the formation of sub-populationsI Use statistical procedures embedding variable selection within

profile regression (Papathomas et al., 2012), inspired fromTadesse et al. (2005) and Chung and Dunson (2009)

SL and MP (ICL/MRC and St Andrews) Profile Regression 29 / 40

Page 48: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Variable selection

I Context of application: genetic epidemiology where there isinterest in investigating “gene-gene interactions”

I As the number of markers genotyped is huge, blind search forinteractions followed by adjustment for multiple testing has very lowpower→ more focussed strategies are recommended

I In MP et al. (2012) we investigate how profile regressioncombined with variable selection can highlight combinations ofSNPs associated with higher disease risk in a genetic associationstudy of lung cancer (Hung et al 2008).

SL and MP (ICL/MRC and St Andrews) Profile Regression 30 / 40

Page 49: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore
Page 50: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Using 0-1 variable selection switches

I Consider cluster specific binary switches, γc, j , so thatI γc, j = 1 when predictor xj is important for allocating subjects to

cluster c, and in this case the associated probabilities areφc, j(k),1 ≤ k ≤ mj

I When γc, j = 0 the associated probability is φ0, j(k), a commonprobability for all individuals for whom xij = k .

So the discrete covariate model is modified accordingly to

f (xi | φc , γc) =∏

j: γc, j=1

φzi , j(xij)∏

j: γc, j=0

φ0, j(xij)

I Prior for switches: given ρj , γc, j ∼ Bernoulli(1, ρj).I We consider a sparsity inducing prior for ρ with an atom at zero:

ρj ∼ 1{wj=0}δ0(ρj) + 1{wj=1}Beta(αρ, βρ)

where wj ∼ Bernoulli(0.5).Similar to Chung and Dunson (JASA, 2009), but in their set up, covariate observations contribute to the likelihood through aregression model. In our case, covariate observations contribute directly to the likelihood, and we introduce φ0,j (xij ).

SL and MP (ICL/MRC and St Andrews) Profile Regression 32 / 40

Page 51: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Using latent selection probabilities

I The 0-1 variable selection switches model exhibits sometimespoor mixing, requiring long runs

I We introduce an new parametrisation involving continuousselection probabilities ζj , taking values in [0,1]

I Each φc, j is replaced by a hybrid parameter,

φ∗c, j(x) = ζj × φc, j(x) + (1− ζj)× pj(x) (1)

where pj(k) is the observed proportion of individuals for whomxij = k .

I Parameters ζj can be viewed as proxies for the probability thatpredictor j supports the cluster structure:→ important predictors for the clustering will have values of ζjclose to 1.

I Similarly, use a sparsity inducing prior for ζj with an atom at zero.

SL and MP (ICL/MRC and St Andrews) Profile Regression 33 / 40

Page 52: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

The R package PReMiuM

Variable Selection: R commands

> profRegr(yModel="Bernoulli",xModel="Discrete",nSweeps=1000,nBurn=1000,data="dataMatrix.txt",output="output/output",nCovariates=5,covNames=paste("Var",seq_along(1:10),sep=""),nFixedEffects=0,xLevels=rep(2,10),varSelect="BinaryCluster")

SL and MP (ICL/MRC and St Andrews) Profile Regression 34 / 40

Page 53: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore
Page 54: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Additional remarks - References

Other features

Other inputs for profRegr:hyper Set hyperparameters

nClusInit Set initial number of clusters, equal to integersampler Set sampler: Truncated, SliceDependent or

SliceIndependentalpha Can fix the value of α

excludeYextraYVar

SL and MP (ICL/MRC and St Andrews) Profile Regression 36 / 40

Page 55: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

p(Z |X ,Y , α) - simulated data

Page 56: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

p(Z |X ,Y , α) - real data

Page 57: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Additional remarks - References

Further work

Further workI Mixing properties of the algorithm: further moves, temperingI Extension to include spatial and time componentsI New epidemiological and clinical applications and gene-gene and

gene-environment interaction search

SL and MP (ICL/MRC and St Andrews) Profile Regression 39 / 40

Page 58: Using Profile Regression Mixture Models and Dirichlet ...ucakche/agdank/agdank2013presentations/... · Using Profile Regression Mixture Models and Dirichlet Processes to explore

Additional remarks - References

References

I Bigelow and Dunson (2009). Bayesian Semiparametric Joint Models for FunctionalPredictors. JASA, 104, 26-36.

I J. T. Molitor, M. Papathomas, M. Jerrett and S. Richardson (2010) Bayesian ProfileRegression with an Application to the National Survey of Children’s Health, Biostatistics,11, 484-498.

I M. Papathomas, J. Molitor, S. Richardson, E. Riboli and P. Vineis (2011) Examining thejoint effect of multiple risk factors using exposure risk profiles: lung cancer in non smokers.Environmental Health Perspectives, 119, 84-91.

I Papathomas, M , Molitor, J, Hoggart, C, Hastie, D and Richardson, S (2012) Exploring datafrom genetic association studies using Bayesian variable selection and the Dirichletprocess: application to searching for gene-gene patterns. Genetic Epidemiology.36:663-674

I Hastie, D. I., Liverani, S. and Richardson, S. (2013) Sampling from Dirichlet processmixture models with unknown concentration parameter: Mixing issues in large dataimplementations. Available at http://uk.arxiv.org/abs/1304.1778

I Liverani, S., Hastie, D. I., Papathomas, M. and Richardson, S. (2013) PReMiuM: An Rpackage for Profile Regression Mixture Models using Dirichlet Processes. Available athttp://uk.arxiv.org/abs/1303.2836

SL and MP (ICL/MRC and St Andrews) Profile Regression 40 / 40