incorporating linkage disequilibrium blocks in genome-wide ...€¦ · 4 currentworks alia dehman...
TRANSCRIPT
Incorporating linkage disequilibrium blocks inGenome-Wide Association Studies
Alia DehmanDirector: Christophe AmbroiseCo-director: Pierre Neuvial
Laboratoire "Statistique et Génome""Des Génomes Aux Organismes"
June 7th, 2013
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 1 / 23
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 3 / 23
The regression model
To identify genetic markers that are significantly associated with aphenotype of interest.
Phenotypic trait : qualitative or quantitativeGenetic markers : Single Nucleotide Polymorphisms (SNP)
The regression model
Yi = β0 +
p∑j=1
Xijβj + εi , i = 1, . . . , n
I n : number of individualsI p : number of covariatesI Yi : response for the individual iI X.j : observations for covariate j (coded in 0, 1 or 2)
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 4 / 23
Sparsity and high-dimension contexts
Sparsity : Only a subset of SNPs is significantly associated with thephenotype.
Card{j, βj 6= 0} � p
High-dimension : Many thousands of markers vs a few hundredobservations.
p� n
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 5 / 23
The LD measures
Linkage Disequilibrium (or Gametic Disequilibrium) : Is thenon-random association of alleles at two or more loci. Its amount dependson the difference between observed allelic frequencies and those expectedfrom a homogenous, randomly distributed model.
Zmj the indicator of the presence of minor allele for SNP j onmaternal copy.Zpj the indicator of the presence of minor allele for SNP j on paternalcopy.Zj ∼ B(pj)
D(j, k) = cov(Zj , Zk)
r2(j, k) = corr(Zj , Zk)
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 6 / 23
How to estimate it ?
snp vv vV VVuu a b cuU d e fUU g h i
snp v Vu α β
U γ δ
Only the genotype data table is observed : X.j = Zmj + Zpj .
α, β, γ, δ are estimateda system of equations. exmpl : α = 2a+ b+ d+ pe
with p the « probability » of the haplotype (uv, UV ).
⇒ estimating p, then (α, β, γ, δ) and finally D.
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 7 / 23
The LD-block structure
the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)
LD structured in blocks
Dimensions: 50 x 50Column
Row
10
20
30
40
10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 8 / 23
The LD-block structure
the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)
LD structured in blocks
Dimensions: 50 x 50Column
Row
10
20
30
40
10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 8 / 23
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 9 / 23
Classical approach : tag-SNP
To deal with high-dimensional problems and dependence among SNP :
based on LDselection of « representative » SNP of each LD-block : tagging(Chloé Friguet presentation)
Loss of information
Loss of power : tag-SNP not necessarly the causal SNP.
A different approach :a block-selection
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 10 / 23
A Two-Step Approach
Inference of blocks :
only the genotype data X are used.a p× p matrix LD pairwise measures is calculated.Ward Constrained Hierarchichal Clustering (R package rioja)
d(A,B) =nAnBnA + nB
(1
n2ASA,A +
1
n2BSB,B −
2
nAnBSA,B)
Selection of blocks associated with phenotype :
The Group Lasso : well-adapted to group-structured variables :
β̂λ = argminβ
(||y −Xβ||22 + λ
G∑g=1
√pg||βg||2).
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 11 / 23
Competing methods
Lasso :
β̂l1= argmin
β
∑i
(yi −Xi.β)2 + λ||β||1,
Elastic-Net :
β̂EN
= argminβ
∑i
(yi −Xi.β)2 + λ1||β||1 + λ2||β||22,
with λ, λ1 and λ2 three regularization parameters.
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 12 / 23
Evaluation
SNP-selection & block-selection :
SNP-level evaluation : causal SNP.
block-level evaluation : SNP in the same block that a causal SNP.
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 13 / 23
Simulation study
n = 200, p = 512.
9 groups of sizes (2, 2, 4, 8, 16, 32, 64, 128, 256).
the first 2 SNPs of groups of sizes 2, 2, 4, 8 are associated with thephenotype.
If j 6= j′ are in the same group, cov(X.j , X.j′) = ρ elsecov(X.j , X.j′) = 0.
R2 = 0.2 the coefficient of determination.
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 14 / 23
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 15 / 23
True number of clusters
ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2
Figure: The number of clusters is set to 9.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 16 / 23
Under-estimated clustering
ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
Figure: The number of clusters is set to 5.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 17 / 23
Over-estimated clustering
ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2
Figure: The number of clusters is set to 13.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 18 / 23
Outline
1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD
2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods
3 ResultsTrue number of clustersUnder- and over-estimated clustering
4 Current works
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 19 / 23
Memory requirement of the clustering
rioja cWardType of entry p × p dissimilarity
matrixthe n × p designmatrix
Time complexity O(np2) O(np2)
Memory complexity O(p2) O(np)
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 20 / 23
Automatic model selection
Inferring the number of clusters :maximal jump (Bühlmann et. al., 2012, arXiv :1209.5908v1)BIC criterionGap Statistic (Tibshirani et. al., 2001, JRSSB)
Tree-Group Lasso
β̂Tree
= argminβ
∑i
(yi −Xi.β)2 + λ
d∑h=0
Gh∑g=1
ωhg ||βhg ||2.
ALIA DEHMAN (Statistique et Génome) June 7th, 2013 21 / 23
Automatic model selection
ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.00.
00.
20.
40.
60.
81.
0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
Figure: 1 cluster is inferred for ρ = 0, 5 clusters for ρ = 0.1 and 6 clusters forρ = 0.2ALIA DEHMAN (Statistique et Génome) June 7th, 2013 22 / 23
Current works
ρ = 0.5, block-level ρ = 0.7, block-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
ρ = 0.5, SNP-level ρ = 0.7, SNP-level
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
FPR
TP
R
groupLassolassoelasticNet2groupLasso.Gaptree−lasso
Figure: 7 clusters are inferred for ρ = 0.5 and 8 clusters for ρ = 0.7.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 23 / 23