pathways-driven sparse regression identifies pathways and genes associated with high-density...

25
Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts Silver M, Chen P, Li R, Cheng C-Y, Wong T-Y, et al. In PLOS Genetics, 2013

Upload: so-yeon-kim

Post on 14-Apr-2017

106 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian

CohortsSilver M, Chen P, Li R, Cheng C-Y, Wong T-Y, et al.

In PLOS Genetics, 2013

Page 2: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Introduction• Genes do not act in isolation, but interact in complex

networks or pathways• Rather than univariate approaches, a joint modelling

approach, a dual-level, sparse regression model is pro-posed• can simultaneously identify pathways and genes for pathway

selection• Pathways-driven gene selection in a search for pathways and

genes associated with variation

Page 3: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Sparse group lasso model• N individuals, P SNPs, (N x P) genotype matrix X, L pathways• Assumptions

• All P SNPs may be mapped to L groups or pathways• Pathways are disjoint or non-overlapping

causal SNPs

causal pathways

Pathway level constraintSNP level constraint

controls how the sparsity constraint is distributed between the two penalties

controls the degree of sparsity in

Page 4: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

SGL model estimation• To estimate ,

• block, or group-wise coordinate gradient de-scent (BCGD) algorithm

• Select a pathway

• Select SNP in selected pathway

• Pathway, SNP partial residuals• Regress out the current estimated effects of

all other pathways and SNPs

Page 5: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

SGL simulation study 1• Hypothesis

• causal SNPs are enriched in a given pathway• pathway-driven SNP selection using SGL will outperform sim-

ple lasso selection• Randomly select 5 causal SNPs from a single pathway / all

2500 SNPs (without pathway information)

Page 6: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

The problem of overlapping path-ways• Genes and SNPs may map to multiple pathways

• The optimization is no longer separable into groups (pathways)• Not be able to select pathways independently

• By duplicating SNP predictors, SNPs belonging to more than one pathway can enter the model separately

• SNPs are selected in each pathway whose joint effects pass a pathway selection threshold, irrespective of over-laps between pathways

• Pathways are independent• they do not compete in the model estimation process

Partially overlapping causal SNPs

Page 7: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

The problem of overlapping path-ways•

• each pathway is regressed against the phenotype vector y

• Only coordinate gradient descent within selected pathway (SGL-CGD)

• Under the independence assumption, the estima-tion of each doesn’t depend on the other estimates

• Need only record the set of selected SNPs in each selected pathway

Page 8: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

SGL simulation study 2Figure 5. SGL Simulation Study with overlapping pathways

Table 1. Mean number of pathways and SNPs selected by each model at each effect size, γ, across 2000 MC simulations

• SNPs are mapped to 50 overlapping pathways, each containing 30 SNPs

• Each pathway overlaps any adjacent pathway by 10 SNPs

• The number of selected pathways or SNPs increases with decreasing effect size, as the number of pathways close to the se-lection threshold set

Page 9: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

SGL simulation study 2• Pathway and SNP selection power and

False positive rates (FPR) at MC simula-tion z

• SGL-CGD consistently outperforms SGL, both in terms of pathway selection sensi-tivity and control of false positives

• SGL-BCGD typically has a higher FPR than SGL-CGD, since more SNPs are selected from non-causal pathways

• SGL-CGD is more often able to select both causal pathways, and to select additional causal SNPs that are missed by SGL

Figure 6. SGL-CGD vs SGL-BCGD performance

Page 10: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Pathway and SNP selection bias• Biasing factors• pathway size, varying patterns of SNP-SNP correlations, and

gene sizes

• An adaptive weight-tuning strategy to reduce selection bias• tuning the pathway weight vector to ensure that each path-

way must have an equal chance of being selected

Page 11: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Ranking variables• A resampling strategy

• calculate pathway, gene and SNP selection frequencies by repeatedly fitting the model over B subsamples of the data, at fixed values for and

• exploit knowledge of finite sample variability obtained by subsampling, to achieve better estimates of a variable's importance

• can rank pathways, genes and SNPs in order of their strength of as-sociation with the phenotype

• Pathways or SNPs and genes are ranked in order of their selection probabilities

Page 12: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Simulation study 3• Evaluate ranking strategies• Use real genotype and pathways data

• genome-wide SNP dataset ‘SP2’• KEGG pathways database

• SNP ranking• TP: selected SNPs that tag at least one causal SNP• FP: selected SNPs which do not tag any causal SNP

• gene ranking• TP: selected causal genes(map to true causal SNP)• FP: selected non-causal genes

• Compared with SNP and gene rankings us-

ing a univariate, regression-based quantita-tive trait test (QTT) K: the number of causal SNPs

GV, TV: proportion of trait variance

Page 13: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Simulation study 3TPR: The proportion of subsamples in which the correct causal pathway is selected

Figure 7. A–F: SNP and gene ranking performance for the six different scenarios 

Page 14: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Pathway mapping• Genes are mapped to pathways using informa-

tion on gene-gene interactions. • Many SNPs and genes do not map to any known

pathway. • Genes and SNPs may map to more than one

pathway. • Many SNPs cannot be mapped to a pathway

since they do not map to a mapped gene.

Available SNPs492,639 SNPs (SP2)515,503 SNPs (SiMES)Genes: GRCH36/hg1821,004 genes

mapped to within 10kbpPathways: KEGG185 Pathways con-taining5,267 distinct genes

SNP to genemapping

mapped to and 185 pathways

SNP to pathwaymapping

Page 15: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Results• Pathways-driven SNP selection on the SP2 and SiMES

datasets separately using SGL

• Combine this with the subsampling procedure to high-light pathways and genes associated with variation

• Compare results from both datasets

Page 16: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

• Compare with the resulting pathway and SNP selection frequency distribu-tions with null distributions

• A greater number of SNPs contribute to increase the number of pathways

• The number of SNPs may affect the re-sulting pathway and SNP rankings

• Optimal =?

Table 5. Separate combinations of regularisation pa-rameters, and used for analysis of the SP2 dataset.

Pathway level constraintSNP level constraint

Pathway and SNP selection results

Page 17: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Pathway and SNP selection results

Figure 11. Empirical and null pathway selec-tion frequency distributions for all 185 KEGG pathways with the SP2 dataset

Figure 12. Empirical and null SNP selection frequency distributions with the SP2 dataset

Figure 14. Empirical and null pathway (top) and SNP (bottom) selection frequency distri-butions for the SiMES dataset

𝛼=0.85

𝛼=0.95

clearer separation of empirical and null distri-butions

Biased empirical pathway and SNP selection frequency distri-butions

𝛼=0.95

Page 18: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Pathway and SNP selection results

Figure 13. SP2 dataset: scatter plots comparing empirical and null selection frequencies presented in Figures 11 and 12

Figure 15. SiMES dataset: Scatter plots comparing empirical and null path-way (left) and SNP (right) selection frequencies presented in Figure 14

Page 19: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

• Increased correlation between empirical and null selec-tion frequency distributions at the lower increase bias in the empirical results• The selection of too many SNPs will add noise, bias

Table 6. SP2 dataset: Pearson correlation coefficients (r) and p-values for the data plotted in Figure 13

Table 9. SiMES dataset: Pearson correlation coefficients (r) and p-values for the data plotted in Figure 15.

Pathway and SNP selection results

Page 20: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Top 30 pathways and genes

... … … … …

Table 7. SP2 dataset: Top 30 pathways, ranked by pathway selection frequency, .Table 8. SP2 and SiMES datasets: Top 30 genes ranked by gene selection frequency, .

... … … …

Page 21: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Top 30 pathways

... … … … …

Table 10. SiMES dataset: Top 30 pathways, ranked by pathway selection frequency, .

Page 22: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Comparison of ranked pathway and gene lists• Pathway rankings

Figure 16. Comparison of top-k SP2 and SiMES pathway rankingsNormalized Canberra distance(left), FDR q-values (right)

Table 11. Consensus set of important pathways, , for SP2 and SiMES datasets with k = 25.

closest agreement when k = 25

Page 23: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Comparison of ranked pathway and gene lists• Gene rankings

Figure 17. Comparison of top-k SP2 and SiMES gene rankings, for k = 1,…,500.Normalized Canberra distance(left), FDR q-values (right)

Table 13. Top 30 consensus genes ordered by their average rank,

closest agreement when k=244

Page 24: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Discussion• A method for the detection of pathways and genes associated with a

quantitative trait• uses a sparse regression model, the sparse group lasso, that enforces sparsity

at the pathway and SNP level.• identify important pathways and also maximize the power to detect causal SNPs

• Simulation studies• SGL has greater SNP selection power than lasso• a modified SGL-CGD estimation algorithm that treats pathways as independent,

may offer greater sensitivity for the detection of causal SNPs and pathways• combines with a weight-tuning algorithm to reduce selection bias• a resampling technique is designed to provide a robust measure of variable im-

portance

Page 25: Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Thank you

Q & A