interaction-based learning in genomics

39
Interaction-based Learning in Genomics Shaw-Hwa Lo, Tian Zheng & Herman Chernoff Columbia University Harvard University

Upload: pascal

Post on 23-Feb-2016

19 views

Category:

Documents


0 download

DESCRIPTION

Interaction-based Learning in Genomics. Shaw- Hwa Lo, Tian Zheng & Herman Chernoff - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Interaction-based Learning in Genomics

Interaction-based Learning in Genomics

Shaw-Hwa Lo, Tian Zheng & Herman Chernoff Columbia University Harvard University

Page 2: Interaction-based Learning in Genomics

• Other Collaborators : lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Adeline Lo

Page 3: Interaction-based Learning in Genomics

Partition-Retention

We have n observation on a dependent variable Y and many discrete valued explanatory variables X1,X2, . . . ,XS. We wish : 1). to identify those of the explanatory variables which influence Y;

2). to predict Y, based on 1)’s findings.We assume that Y is influenced by a number of small groups of interacting variables. ( group sizes~ 1 to 8, depending on sample sizes and effects)

Page 4: Interaction-based Learning in Genomics

Marginal Effects: Causal and Observable

1. If Xi has an effect on Y we expect Y to be correlated withXi or some function of Xi. In that case Xi has a causal andobservational marginal effect.

2. A variable Xi unrelated to (independent of) Y should beuncorrelated with Y except for random variation. But if S (numbers of variables) is large and n moderate, some of the explanatory variables not influencing Y may have a substantial correlation ( or marginal observable effects) with Y . They are impostors.

3. Group of important Interacting influential variables may or may not have marginal observable effects (MOE). Therefore, methods rely on the presence of strong observable marginal effects are unlikely to succeed if MOE are weak.

Page 5: Interaction-based Learning in Genomics

Ex. 1. X1 and X2 are independent with P(Xi = 1) = P(Xi =−1) = 1/2, Y = X1X2, E(Y |X1) = E(Y |X2) = 0. Y is uncorrelated with X1 and X2 although the pair determine Y .

Ex. 2. Y = X1X2, P(Xi = 1) = 3/4 and P(Xi = −1) = 1/4. Here Y is correlated with X1 and X2, and the sample will clearly show marginal observable effects ( and can be detected by t-test).

That is the interaction of both X1 and X2 is needed to have an influence on Y .Conclusion:

To detect interacting influential variables, it is desirable andsometimes necessary to consider interactive effects. Impostorsmay present observable marginal effects if S is large and n ismoderate.

Page 6: Interaction-based Learning in Genomics

An ideal analytical tool should have the abilities to:

• 1. handle an extremely large number of variables and their higher-order interactions.

• 2. detect “ module effects”, referring to the phenomenon where a module C ( a cluster of variables) holds predictive power but becomes useless in prediction if any variable is removed.

• 3. identify interaction effects: effect of one variable on a response depends on the values of other variables.

• 4. detect and utilize nonlinear and non-additive effects.

Page 7: Interaction-based Learning in Genomics

A score with four features• Need A sensible score that can be used to

measure the influence of a group of variables.• to design an algorithm for removing noisy and

non-informative variables while dimensions were altered– meaning this score measures in the same scale in different dimensions

• Given a cluster of variables, one can use the score to test the significances of its influences

• the cluster with high score ( influential) automatically possess predictive ability

Page 8: Interaction-based Learning in Genomics

A Special Case of Influential Measure: Genotype-Trait Distortion

• In the event of case-control studies:

• Where and are counts of cases and controls in each genotype (partition element) , and are the total number of cases and controls under study. A SNP has 3 genotypes (aa, ab, bb).

k

i u

kiu

d

kidk

nn

nn

GTD3

1

2)(,

)(,)(

)(,kidn

)(,kiun

dn un

Page 9: Interaction-based Learning in Genomics

A general form• Let Y be the disease status (1 for cases and 0 for controls). Then,

for a genotype partition П, the score we just discussed can be naturally defined as:

Page 10: Interaction-based Learning in Genomics

Theorem: Under the null hypothesis that none of the variables has an influence, the null

distribution of when is normalized, is asymptotically a weighted sum of independent

Chi-square variables.

This applies to both the null random-and the null specified- models, under the standard conditions for the applicability of the CLT. Certainly the case-control studies ( is a special case of specified -odels. (2009)

Page 11: Interaction-based Learning in Genomics

Example

Page 12: Interaction-based Learning in Genomics

General Setting• The main idea applies much more generally than to

special genetic problems. A more general version is proposed to deal with the problem of detecting which, of many potentially influential variables Xs, have an effect on a dependent variable Y using a sample of n observations on, Z =(X, Y) where X =(X1,X2, . . . ,XS).

• In the background is the assumption that Y may be slightly or negligibly influenced by each of a few variables Xs, but may be profoundly influenced by the confluence of appropriate values within one or a few small groups of these variables. At this stage the object is not to measure the overall effects of the influential variables, but to discover them efficiently.

Page 13: Interaction-based Learning in Genomics

Example• We introduce the partition retention approach and related

terminology and issues by considering a small artificial example.

• Suppose that an observed variable Y is normally distributed with mean X1X2 and variance 1, where X1 and X2 are two of S = 6 observed and potentially influential variables which can take on the values 0 and 1. Given the data on Y and X = (X1, . . . ,X6), for n = 200 subjects, the statistician, who does not know this model, desires to infer which of the six explanatory variables are causally related to Y. In our computation the Xi are selected independently to be 1 with probabilities 0.7, 0.7, 0.5, 0.5, 0.5, 0.5.

Page 14: Interaction-based Learning in Genomics

Example

Page 15: Interaction-based Learning in Genomics

Example

Page 16: Interaction-based Learning in Genomics

Example

Page 17: Interaction-based Learning in Genomics

Example

Page 18: Interaction-based Learning in Genomics

Example

Page 19: Interaction-based Learning in Genomics

Example

Page 20: Interaction-based Learning in Genomics

• A capable analytical tool should have the ability to surmount the following difficulties:

• (a) handle an extremely large number of variables (SNPs and other variables in hundreds of thousands or millions) in the data.

• (b) detect the so-called “module effect”, which refers to the phenomenon where removing one variable from the current module renders it useless in prediction.

• (c) identify interaction ( often higher orders effects) : the effect of one variable on a response variable depends on the values of other variables in the same module.

• (d) extract and utilize nonlinear effects (or non-additive effects).

Page 21: Interaction-based Learning in Genomics

• Let , the response variable Y, and X , the explanatory variables (30 Xs, all independent), all be binary, taking values 0 or 1 with 50% chance each. We independently generate 200 observations and Y is related to X via the model

The task is to predict Y based on the information in X.

We use 150 observations as the training set and 50 as the test set. This example has a 25% theoretical lower bound for prediction error rates since we do not know which of the two causal variable modules generates the response Y.

Page 22: Interaction-based Learning in Genomics

Method LDA SVM RF LogicFS LL LASSO Elastic net Proposed

Train error .14 ± .03 .00 ± .00 .00 ± .00 .13 ± .02 .23 ± .05 .27 ± .06 .27 ± .06 .21 ± .01

Test error .47 ± .02 .50 ± .01 .44 ± .04 .34 ± .04 .45 ± .03 .48 ± .04 .48 ± .04 .24 ± .03

Table 1. Classification error rates for the toy example.

Page 23: Interaction-based Learning in Genomics

Diagrams of conventional approach and the variable-module enabled approach.

Screening based on significance test modeling

Stage 1: variable module discovery

Stage 2: Dissection: predictive modeling and functional relevance

Significance evaluationInteraction

basedSystems-oriented

Functional relevance

Conventional approach

Proposed variable-module enabled approach

Page 24: Interaction-based Learning in Genomics

Basic tool: the Backward dropping algorithm (BDA). BDA is a “greedy” algorithm that seeks the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset (k variables) sampled from the variable space (p variables). K << p

Page 25: Interaction-based Learning in Genomics

• Training Set : Consider a training set of n observations, where = X is a p -dimensional vector of discrete variables. Typically p is very large (thousands).

• Sampling randomly from Variable Space: Select an initial subset of k explanatory variables k << p.

• Compute I-Score based on k variables.• Drop Variables: Tentatively drop each variable and

recalculate the -score with one variable less. Then permanently drop the variable that results in the highest -score when tentatively dropped.

• Return Set: Continue the next round of dropping on until only one variable left. Keep the subset that yields the highest -score in the whole dropping process. Refer to this subset as the return set.

Page 26: Interaction-based Learning in Genomics

Figure 5. Change of I-Score

Page 27: Interaction-based Learning in Genomics

Structural diagram of proposed methodology .

Proposed research on rare variants (rG) continuous variables Environmental variables:

GxE, rareGxE

Discovery Stage

Functional relevance dissection

Predictive classifier dissection

Data

Results

Automated analysis w

orkflows (w

ith case studies)

Step 1: Interaction-based variable screening*

Step 2: Variable module generation

Dissection stage

*Optional for studies with small numbers of genes.

Proposed research on Predictive modeling and

classifier integration Network enrichment

analysis of discovered modules

Core codes in C/JAVAR functionsDatabase tools and R routines for data processing

Page 28: Interaction-based Learning in Genomics

Classification based on van’t Veer’s Data (2002) .

In applying procedures described in Discovery stage , we successfully identified 18 influential modules with sizes ranging from 2 to 6. The purpose of the original study was to predict breast cancer relapse using gene expression data. The original data contains the expression levels of 24,187 genes for 97 patients, 46 relapse (distant metastasis < 5 year) and 51 non-relapse (no distant metastasis ≥ 5 year). We used 4,918 genes for the classification task, which were reduced by Tibashirani and Efron (2002). 78 cases out of 97 were used as the training set (34 relapse and 44 non-relapse) and 19 (12 relapse and 7 non-relapse) as the test set. The best error rates (biased or not) on this particular test set in the literature is around 10% (2 errors). Proposed method yields a zero error rate (no error) on the test set

Page 29: Interaction-based Learning in Genomics

The CV error rates of the van’t Veer data are typically around 30%. The proposed method yields an average error rate of 8% over 10 randomly selected CV test samples representing a 74% reduction of error rate (30%-8%/ 30%= 74%) when compared with existing methods. We run the CV experiment by randomly partitioning the 97 patients into a training sample of size 87 and a test sample of 10, then repeated the experiment ten times

Page 30: Interaction-based Learning in Genomics

•In case-control design when there are n cases and n controls in a study, the last line of the equations, divided by will converge to Is the class probability. (two classes, case vs control). This expression is directly related to the correct predictive rate corresponding to the partition . Thus searching for cluster with larger I-score has the automatic effect of seeking clusters with stronger predictive ability---- a very desirable property.

Page 31: Interaction-based Learning in Genomics

Example Using Breast Cancer Data

Gene Locus SNPs Genes Locus SNPsCASP8 2q33-q34 12 SLC22A18 11p15.5 16

TP53 17p13.1 6 BARD1 2q34-q35 27

ATM 11q22-q23 12 BRCA1 17q21 13

PIK3CA 3q26.3 8 KRAS2 12p12.1 28

PHB 17q21 10 ESR1 6q25 78

BRCA2 13q12.3 31 BRIP1 17q22-q24 19

RB1CC1 8q11 9 RAD51 15q15.1 4

PPM1D 17q23.2 2 TSG101 11p15 11

PALB2 16p12.1 7 CHEK2 22q12.1 11

• Case-Control Sporadic Breast Cancer data from NCI Cancer Genetic Markers of Susceptibility (CGEMS)

• 2287 postmenopausal women 1145 cases and 1142 controls

• 18 genes with 304 SNPS selected from literatures:

Page 32: Interaction-based Learning in Genomics

Under the null estimated by permutations. P-values of the observed marginal effects

Page 33: Interaction-based Learning in Genomics

ResultsMean-Ratio Method Quantile-Ratio Method

Gene Pair Curve p-value Rank p-value Gene Pair Curve p-value Rank p-value1 ESR1 – BRCA1 0.017 ≤ 0.001 ESR1 – BRCA1 0.013 0.001

2 BRCA1 – PHB 0.026 0.040 BRCA1 – PHB 0.029 0.073

3 KRAS2 – BRCA1 0.002 0.006 KRAS2 – BRCA1 0.002 0.004

4 SLC22A18 – BRCA1 0.032 0.072 SLC22A18 – BRCA1 0.019 0.079

5 RAD51 – BRCA1 0.052 0.090 RAD51 – BRCA1 0.005 0.032

6 RB1CC1 – SLC22A18 0.024 0.026 ESR1 – SLC22A18 0.033 0.016

7 CASP8 – KRAS2 0.043 0.038 RB1CC1 – SLC22A18 0.009 0.008

8 CASP8 – SLC22A18 0.042 0.048 CASP8 – KRAS2 0.038 0.036

9 PIK3CA – BRCA1 0.030 0.048 CASP8 – SLC22A18 0.021 0.012

10 PIK3CA – ESR1 0.047 0.032 PIK3CA – BRCA1 0.014 0.049

11 PIK3CA – RB1CC1 0.047 0.051 PIK3CA – ESR1 0.021 0.005

12 PIK3CA – SLC22A18 0.025 0.036 PIK3CA – RB1CC1 0.044 0.053

13 BRCA1 – CHEK2 0.016 0.031 CASP8 – PIK3CA 0.007 0.009

14 BARD1 – BRCA1 0.032 0.057 BRCA1 – CHEK2 0.007 0.022

15 BARD1 – ESR1 0.044 0.025 BARD1 – BRCA1 0.003 0.015

16 BARD1 – TP53 0.019 0.019 BARD1 – ESR1 0.017 0.003

17 BARD1 – TP53 0.015 0.010

18 BARD1 – SLC22A18 0.056 0.063

CASP8 – ESR1 0.071 0.048 CASP8 – ESR1 0.066 0.031

BARD1 – KRAS2 0.055 0.036

ESR1 – KRAS2 0.145 ≤ 0.001 ESR1 – KRAS2 0.103 ≤ 0.001

ESR1 – PPM1D 0.252 0.021 ESR1 – PPM1D 0.348 ≤ 0.001

Page 34: Interaction-based Learning in Genomics

Two-way Interaction Networks

Pair-wise network based on16 pairs of genes identified by Mean-ratioMethod.

Pair-wise network based on 18 pairs of genes identified by Quantile-ratiomethod.

Page 35: Interaction-based Learning in Genomics

Three-way Interaction Networks

3-way interaction network based on 10 genes identified by Mean-ratiomethod

3-way interaction network based on 8 genes identified by Quantile-ratio method

Page 36: Interaction-based Learning in Genomics

Pairwise Interaction

(M, R)-plane: observed data and permutation quantiles1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4- SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-RB1CC1SLC22A18, 7-CASP8 KRAS2, 8-CASP8 SLC22A18, 9-PIK3CA BRCA1, 10-PIK3CA ESR1, 11-PIK3CA RB1CC1,12-PIK3CA SLC22A18, 13-BRCA1 CHEK2, 14-BARD1 BRCA1, 15-BARD1 ESR1, 16-BARD1 TP53

(M, Q)-plane: observed data and permutation quantiles1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4-SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-ESR1 SLC22A18,7-RB1CC1 SLC22A18, 8-CASP8 KRAS2, 9-CASP8 SLC22A18, 10-PIK3CA BRCA1,11-PIK3CA ESR1, 12-PIK3CARB1CC1, 13-CASP8 PIK3CA, 14-BRCA1 CHEK2, 15-BARD1 BRCA1, 16-BARD1 ESR1, 17-BARD1 TP53,18-BARD1 SLC22A18

Page 37: Interaction-based Learning in Genomics

Remarks

• One limitation of marginal approaches is due in part that only a fractional information from the data is used;

• The proposed approach intends to draw more relevant information ; Improving prediction;

• Additional scientific findings are likely if data already collected be suitably reanalyzed;

• The proposed approach is particularly useful when a large number of dense markers becomes available;

• Information about gene-gene interactions and their disease-networks can be derived and constructed.

Page 38: Interaction-based Learning in Genomics

Collaborators

• Herman Chernoff, Tian Zheng, lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Kjell Doksum

Page 39: Interaction-based Learning in Genomics

Key References

• Lo SH, Zheng T (2002) Backward haplotype transmission association (BHTA) algorithm—a fast multiple-marker screening method. Human Heredity 53 (4): 197-215.

• Lo SH, Zheng T (2004) A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data. PNAS U S A 101(28):10386-91

• Lo SH, Chernoff,H., Cong,L., Ding,Y.,Zheng,T.(2008) Discovering Interactions Among BRCA1 and Other Candidate Genes Involved in Sporadic Breast Cancer. PNAS 105:12387-12392.

• Chernoff H, Lo SH, Zheng T (2009) Discovering Influential Variables: A Method of Partitions. Annals of Applied Statistics. 3.(4): 1335-1369.

• Wang H ., Lo SH, Zheng T &Hu I (2012) Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28(21): 2834-2842.