incorporating linkage disequilibrium blocks in genome-wide ...€¦ · 4 currentworks alia dehman...

27
Incorporating linkage disequilibrium blocks in Genome-Wide Association Studies Alia Dehman Director: Christophe Ambroise Co-director: Pierre Neuvial Laboratoire "Statistique et Génome" "Des Génomes Aux Organismes" June 7 th , 2013 ALIA DEHMAN (Statistique et Génome) June 7 th , 2013 1 / 23

Upload: others

Post on 19-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Incorporating linkage disequilibrium blocks inGenome-Wide Association Studies

Alia DehmanDirector: Christophe AmbroiseCo-director: Pierre Neuvial

Laboratoire "Statistique et Génome""Des Génomes Aux Organismes"

June 7th, 2013

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 1 / 23

Page 2: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23

Page 3: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23

Page 4: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23

Page 5: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23

Page 6: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 3 / 23

Page 7: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

The regression model

To identify genetic markers that are significantly associated with aphenotype of interest.

Phenotypic trait : qualitative or quantitativeGenetic markers : Single Nucleotide Polymorphisms (SNP)

The regression model

Yi = β0 +

p∑j=1

Xijβj + εi , i = 1, . . . , n

I n : number of individualsI p : number of covariatesI Yi : response for the individual iI X.j : observations for covariate j (coded in 0, 1 or 2)

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 4 / 23

Page 8: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Sparsity and high-dimension contexts

Sparsity : Only a subset of SNPs is significantly associated with thephenotype.

Card{j, βj 6= 0} � p

High-dimension : Many thousands of markers vs a few hundredobservations.

p� n

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 5 / 23

Page 9: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

The LD measures

Linkage Disequilibrium (or Gametic Disequilibrium) : Is thenon-random association of alleles at two or more loci. Its amount dependson the difference between observed allelic frequencies and those expectedfrom a homogenous, randomly distributed model.

Zmj the indicator of the presence of minor allele for SNP j onmaternal copy.Zpj the indicator of the presence of minor allele for SNP j on paternalcopy.Zj ∼ B(pj)

D(j, k) = cov(Zj , Zk)

r2(j, k) = corr(Zj , Zk)

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 6 / 23

Page 10: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

How to estimate it ?

snp vv vV VVuu a b cuU d e fUU g h i

snp v Vu α β

U γ δ

Only the genotype data table is observed : X.j = Zmj + Zpj .

α, β, γ, δ are estimateda system of equations. exmpl : α = 2a+ b+ d+ pe

with p the « probability » of the haplotype (uv, UV ).

⇒ estimating p, then (α, β, γ, δ) and finally D.

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 7 / 23

Page 11: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

The LD-block structure

the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)

LD structured in blocks

Dimensions: 50 x 50Column

Row

10

20

30

40

10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 8 / 23

Page 12: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

The LD-block structure

the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)

LD structured in blocks

Dimensions: 50 x 50Column

Row

10

20

30

40

10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 8 / 23

Page 13: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 9 / 23

Page 14: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Classical approach : tag-SNP

To deal with high-dimensional problems and dependence among SNP :

based on LDselection of « representative » SNP of each LD-block : tagging(Chloé Friguet presentation)

Loss of information

Loss of power : tag-SNP not necessarly the causal SNP.

A different approach :a block-selection

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 10 / 23

Page 15: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

A Two-Step Approach

Inference of blocks :

only the genotype data X are used.a p× p matrix LD pairwise measures is calculated.Ward Constrained Hierarchichal Clustering (R package rioja)

d(A,B) =nAnBnA + nB

(1

n2ASA,A +

1

n2BSB,B −

2

nAnBSA,B)

Selection of blocks associated with phenotype :

The Group Lasso : well-adapted to group-structured variables :

β̂λ = argminβ

(||y −Xβ||22 + λ

G∑g=1

√pg||βg||2).

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 11 / 23

Page 16: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Competing methods

Lasso :

β̂l1= argmin

β

∑i

(yi −Xi.β)2 + λ||β||1,

Elastic-Net :

β̂EN

= argminβ

∑i

(yi −Xi.β)2 + λ1||β||1 + λ2||β||22,

with λ, λ1 and λ2 three regularization parameters.

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 12 / 23

Page 17: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Evaluation

SNP-selection & block-selection :

SNP-level evaluation : causal SNP.

block-level evaluation : SNP in the same block that a causal SNP.

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 13 / 23

Page 18: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Simulation study

n = 200, p = 512.

9 groups of sizes (2, 2, 4, 8, 16, 32, 64, 128, 256).

the first 2 SNPs of groups of sizes 2, 2, 4, 8 are associated with thephenotype.

If j 6= j′ are in the same group, cov(X.j , X.j′) = ρ elsecov(X.j , X.j′) = 0.

R2 = 0.2 the coefficient of determination.

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 14 / 23

Page 19: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 15 / 23

Page 20: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

True number of clusters

ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2

Figure: The number of clusters is set to 9.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 16 / 23

Page 21: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Under-estimated clustering

ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

Figure: The number of clusters is set to 5.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 17 / 23

Page 22: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Over-estimated clustering

ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

Figure: The number of clusters is set to 13.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 18 / 23

Page 23: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersUnder- and over-estimated clustering

4 Current works

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 19 / 23

Page 24: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Memory requirement of the clustering

rioja cWardType of entry p × p dissimilarity

matrixthe n × p designmatrix

Time complexity O(np2) O(np2)

Memory complexity O(p2) O(np)

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 20 / 23

Page 25: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Automatic model selection

Inferring the number of clusters :maximal jump (Bühlmann et. al., 2012, arXiv :1209.5908v1)BIC criterionGap Statistic (Tibshirani et. al., 2001, JRSSB)

Tree-Group Lasso

β̂Tree

= argminβ

∑i

(yi −Xi.β)2 + λ

d∑h=0

Gh∑g=1

ωhg ||βhg ||2.

ALIA DEHMAN (Statistique et Génome) June 7th, 2013 21 / 23

Page 26: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Automatic model selection

ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

Figure: 1 cluster is inferred for ρ = 0, 5 clusters for ρ = 0.1 and 6 clusters forρ = 0.2ALIA DEHMAN (Statistique et Génome) June 7th, 2013 22 / 23

Page 27: Incorporating linkage disequilibrium blocks in Genome-Wide ...€¦ · 4 Currentworks ALIA DEHMAN (Statistique et Génome) June 7th, 2013 2 / 23. Outline 1 Genome-Wide Association

Current works

ρ = 0.5, block-level ρ = 0.7, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

ρ = 0.5, SNP-level ρ = 0.7, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

Figure: 7 clusters are inferred for ρ = 0.5 and 8 clusters for ρ = 0.7.ALIA DEHMAN (Statistique et Génome) June 7th, 2013 23 / 23