incorporating linkage disequilibrium blocks in genome-wide

26
Incorporating linkage disequilibrium blocks in Genome-Wide Association Studies Alia Dehman Christophe Ambroise Pierre Neuvial Laboratoire Statistique et Génome July 2 nd , 2013 ALIA DEHMAN (Statistique et Génome) July 2 nd , 2013 1 / 22

Upload: others

Post on 22-Jun-2022

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Incorporating linkage disequilibrium blocks in Genome-Wide

Incorporating linkage disequilibrium blocks inGenome-Wide Association Studies

Alia DehmanChristophe Ambroise

Pierre Neuvial

Laboratoire Statistique et Génome

July 2nd, 2013

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 1 / 22

Page 2: Incorporating linkage disequilibrium blocks in Genome-Wide

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22

Page 3: Incorporating linkage disequilibrium blocks in Genome-Wide

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22

Page 4: Incorporating linkage disequilibrium blocks in Genome-Wide

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22

Page 5: Incorporating linkage disequilibrium blocks in Genome-Wide

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 2 / 22

Page 6: Incorporating linkage disequilibrium blocks in Genome-Wide

Genome-Wide Association Studies

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 3 / 22

Page 7: Incorporating linkage disequilibrium blocks in Genome-Wide

Genome-Wide Association Studies The regression model

The regression model

To identify genetic markers that are significantly associated with aphenotype of interest.

Phenotypic trait : qualitative or quantitativeGenetic markers : Single Nucleotide Polymorphisms (SNP)

The regression model

Yi = β0 +

p∑j=1

Xijβj + εi , i = 1, . . . , n

I n : number of individualsI p : number of covariatesI Yi : response for the individual iI X.j : observations for covariate j (coded in 0, 1 or 2)

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 4 / 22

Page 8: Incorporating linkage disequilibrium blocks in Genome-Wide

Genome-Wide Association Studies Sparsity and high-dimension contexts

Sparsity and high-dimension contexts

Sparsity : Only a subset of SNPs is significantly associated with thephenotype.

Card{j, βj 6= 0} � p

High-dimension : Many thousands of markers vs a few hundredobservations.

p� n

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 5 / 22

Page 9: Incorporating linkage disequilibrium blocks in Genome-Wide

Genome-Wide Association Studies Biological context : LD

The LD measures

Linkage Disequilibrium (or Gametic Disequilibrium) : Is thenon-random association of alleles at two or more loci. Its amount dependson the difference between observed allelic frequencies and those expectedfrom a homogenous, randomly distributed model.

Zj the indicator of the presence of minor allele for SNP j.Zj ∼ B(pj)

D(j, k) = cov(Zj , Zk)

r2(j, k) = corr(Zj , Zk)

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 6 / 22

Page 10: Incorporating linkage disequilibrium blocks in Genome-Wide

Genome-Wide Association Studies Biological context : LD

How to estimate it ?

snp vv vV VVuu a b cuU d e fUU g h i

snp v Vu α β

U γ δ

Only the genotype data table is observed

α, β, γ, δ are estimateda system of equations. e.g : α = 2a+ b+ d+ pe

with p the « probability » of the haplotype (uv, UV ).

⇒ estimating p, then (α, β, γ, δ) and finally D = pUV − pUpV .

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 7 / 22

Page 11: Incorporating linkage disequilibrium blocks in Genome-Wide

Genome-Wide Association Studies Biological context : LD

The LD-block structure

the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)

LD structured in blocks

Dimensions: 50 x 50Column

Row

10

20

30

40

10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 8 / 22

Page 12: Incorporating linkage disequilibrium blocks in Genome-Wide

Genome-Wide Association Studies Biological context : LD

The LD-block structure

the r2 coefficients amongthe 50 first SNP of theChromosome 22(Dalmasso et al. 2008)

LD structured in blocks

Dimensions: 50 x 50Column

Row

10

20

30

40

10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 8 / 22

Page 13: Incorporating linkage disequilibrium blocks in Genome-Wide

Taking the group structure into account

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 9 / 22

Page 14: Incorporating linkage disequilibrium blocks in Genome-Wide

Taking the group structure into account Classical approach

Classical approach : tag-SNP

To deal with high-dimensional problems and dependence among SNP :

based on LDselection of « representative » SNP of each LD-block : tagging

Loss of information

Loss of power : tag-SNP not necessarly the causal SNP.

A different approach :a block-selection

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 10 / 22

Page 15: Incorporating linkage disequilibrium blocks in Genome-Wide

Taking the group structure into account A Two-Step Approach

A Two-Step Approach

Inference of blocks

only the genotype data X are used.a p× p matrix LD pairwise measures is calculated.Ward Constrained Hierarchichal Clustering (R package rioja)

Selection of blocks associated with phenotype

The Group Lasso : well-adapted to group-structured variables

β̂λ = argminβ

∑i

(yi −Xi.β)2+ λ

G∑g=1

√pg||βg||2).

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 11 / 22

Page 16: Incorporating linkage disequilibrium blocks in Genome-Wide

Taking the group structure into account Competing methods

Competing methods

Lasso

β̂l1= argmin

β

∑i

(yi −Xi.β)2 + λ||β||1,

Elastic-Net

β̂EN

= argminβ

∑i

(yi −Xi.β)2 + λ1||β||1 + λ2||β||22,

with λ, λ1 and λ2 three regularization parameters.(R package quadrupen)

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 12 / 22

Page 17: Incorporating linkage disequilibrium blocks in Genome-Wide

Taking the group structure into account Competing methods

Evaluation

Parametersn = 200, p = 512, K = 9 groups of sizes(2, 2, 4, 8, 16, 32, 64, 128, 256).The first 2 SNPs of groups of sizes 2, 2, 4, 8 are associated with thephenotype.cov(X.j , X.j′) = ρ 1j=j′ .Coefficient of determination : R2 = 0.2.

Definition of associated SNPs

SNP-level Block-levelALIA DEHMAN (Statistique et Génome) July 2nd, 2013 13 / 22

Page 18: Incorporating linkage disequilibrium blocks in Genome-Wide

Results

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 14 / 22

Page 19: Incorporating linkage disequilibrium blocks in Genome-Wide

Results True number of clusters

True number of clusters

ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2

Figure: The number of clusters is set to 9.ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 15 / 22

Page 20: Incorporating linkage disequilibrium blocks in Genome-Wide

Results Misspecified number of clusters

Misspecified number of clusters : too few

ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

Figure: The number of clusters is set to 5.ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 16 / 22

Page 21: Incorporating linkage disequilibrium blocks in Genome-Wide

Results Misspecified number of clusters

Misspecified number of clusters : too many

ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2

Figure: The number of clusters is set to 13.ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 17 / 22

Page 22: Incorporating linkage disequilibrium blocks in Genome-Wide

Current works

Outline

1 Genome-Wide Association StudiesThe regression modelSparsity and high-dimension contextsBiological context : LD

2 Taking the group structure into accountClassical approachA Two-Step ApproachCompeting methods

3 ResultsTrue number of clustersMisspecified number of clusters

4 Current works

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 18 / 22

Page 23: Incorporating linkage disequilibrium blocks in Genome-Wide

Current works

Memory requirement of the clustering

Ward Constrained Hierarchichal Clustering

d(A,B) =nAnBnA + nB

(1

n2ASA,A +

1

n2BSB,B −

2

nAnBSA,B

)

rioja cWardType of entry p × p dissimilarity

matrixthe n × p designmatrix

Time complexity O(np2) O(np2)

Memory complexity O(p2) O(np)

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 19 / 22

Page 24: Incorporating linkage disequilibrium blocks in Genome-Wide

Current works

Automatic model selection

Inferring the number of clusters :maximal gap (Bühlmann et. al., 2012, arXiv :1209.5908v1)BIC criterionGap Statistic (Tibshirani et. al., 2001, JRSSB)

Tree-Group Lasso

β̂Tree

= argminβ

∑i

(yi −Xi.β)2 + λ

d∑h=0

Gh∑g=1

ωhg ||βhg ||2.

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 20 / 22

Page 25: Incorporating linkage disequilibrium blocks in Genome-Wide

Current works

Automatic model selection

ρ = 0, block-level ρ = 0.1, block-level ρ = 0.2, block-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

ρ = 0, SNP-level ρ = 0.1, SNP-level ρ = 0.2, SNP-level

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

0.0 0.2 0.4 0.6 0.8 1.00.

00.

20.

40.

60.

81.

0

FPR

TP

R

groupLassolassoelasticNet2groupLasso.Gaptree−lasso

Figure: ρ = 0 : 1 cluster, ρ = 0.1 : 5 clusters, ρ = 0.2 : 6 clustersALIA DEHMAN (Statistique et Génome) July 2nd, 2013 21 / 22

Page 26: Incorporating linkage disequilibrium blocks in Genome-Wide

Thank you for your attention !

ALIA DEHMAN (Statistique et Génome) July 2nd, 2013 22 / 22