interaction-based learning in genomics

Interaction-based Learning in Genomics

Shaw-Hwa Lo, Tian Zheng & Herman Chernoff Columbia University Harvard University

• Other Collaborators : lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Adeline Lo

Partition-Retention

We have n observation on a dependent variable Y and many discrete valued explanatory variables X1,X2, . . . ,XS. We wish : 1). to identify those of the explanatory variables which influence Y;

2). to predict Y, based on 1)’s findings.We assume that Y is influenced by a number of small groups of interacting variables. ( group sizes~ 1 to 8, depending on sample sizes and effects)

Marginal Effects: Causal and Observable

1. If Xi has an effect on Y we expect Y to be correlated withXi or some function of Xi. In that case Xi has a causal andobservational marginal effect.

2. A variable Xi unrelated to (independent of) Y should beuncorrelated with Y except for random variation. But if S (numbers of variables) is large and n moderate, some of the explanatory variables not influencing Y may have a substantial correlation ( or marginal observable effects) with Y . They are impostors.

3. Group of important Interacting influential variables may or may not have marginal observable effects (MOE). Therefore, methods rely on the presence of strong observable marginal effects are unlikely to succeed if MOE are weak.

Ex. 1. X1 and X2 are independent with P(Xi = 1) = P(Xi =−1) = 1/2, Y = X1X2, E(Y |X1) = E(Y |X2) = 0. Y is uncorrelated with X1 and X2 although the pair determine Y .

Ex. 2. Y = X1X2, P(Xi = 1) = 3/4 and P(Xi = −1) = 1/4. Here Y is correlated with X1 and X2, and the sample will clearly show marginal observable effects ( and can be detected by t-test).

That is the interaction of both X1 and X2 is needed to have an influence on Y .Conclusion:

To detect interacting influential variables, it is desirable andsometimes necessary to consider interactive effects. Impostorsmay present observable marginal effects if S is large and n ismoderate.

An ideal analytical tool should have the abilities to:

• 1. handle an extremely large number of variables and their higher-order interactions.

• 2. detect “ module effects”, referring to the phenomenon where a module C ( a cluster of variables) holds predictive power but becomes useless in prediction if any variable is removed.

• 3. identify interaction effects: effect of one variable on a response depends on the values of other variables.

• 4. detect and utilize nonlinear and non-additive effects.

A score with four features• Need A sensible score that can be used to

measure the influence of a group of variables.• to design an algorithm for removing noisy and

non-informative variables while dimensions were altered– meaning this score measures in the same scale in different dimensions

• Given a cluster of variables, one can use the score to test the significances of its influences

• the cluster with high score ( influential) automatically possess predictive ability

A Special Case of Influential Measure: Genotype-Trait Distortion

• In the event of case-control studies:

• Where and are counts of cases and controls in each genotype (partition element) , and are the total number of cases and controls under study. A SNP has 3 genotypes (aa, ab, bb).

k

i u

kiu

d

kidk

nn

nn

GTD3

1

2)(,

)(,)(

)(,kidn

)(,kiun

dn un

A general form• Let Y be the disease status (1 for cases and 0 for controls). Then,

for a genotype partition П, the score we just discussed can be naturally defined as:

Theorem: Under the null hypothesis that none of the variables has an influence, the null

distribution of when is normalized, is asymptotically a weighted sum of independent

Chi-square variables.

This applies to both the null random-and the null specified- models, under the standard conditions for the applicability of the CLT. Certainly the case-control studies ( is a special case of specified -odels. (2009)

Example

General Setting• The main idea applies much more generally than to

special genetic problems. A more general version is proposed to deal with the problem of detecting which, of many potentially influential variables Xs, have an effect on a dependent variable Y using a sample of n observations on, Z =(X, Y) where X =(X1,X2, . . . ,XS).

• In the background is the assumption that Y may be slightly or negligibly influenced by each of a few variables Xs, but may be profoundly influenced by the confluence of appropriate values within one or a few small groups of these variables. At this stage the object is not to measure the overall effects of the influential variables, but to discover them efficiently.

Example• We introduce the partition retention approach and related

terminology and issues by considering a small artificial example.

• Suppose that an observed variable Y is normally distributed with mean X1X2 and variance 1, where X1 and X2 are two of S = 6 observed and potentially influential variables which can take on the values 0 and 1. Given the data on Y and X = (X1, . . . ,X6), for n = 200 subjects, the statistician, who does not know this model, desires to infer which of the six explanatory variables are causally related to Y. In our computation the Xi are selected independently to be 1 with probabilities 0.7, 0.7, 0.5, 0.5, 0.5, 0.5.

Example

• A capable analytical tool should have the ability to surmount the following difficulties:

• (a) handle an extremely large number of variables (SNPs and other variables in hundreds of thousands or millions) in the data.

• (b) detect the so-called “module effect”, which refers to the phenomenon where removing one variable from the current module renders it useless in prediction.

• (c) identify interaction ( often higher orders effects) : the effect of one variable on a response variable depends on the values of other variables in the same module.

• (d) extract and utilize nonlinear effects (or non-additive effects).

• Let , the response variable Y, and X , the explanatory variables (30 Xs, all independent), all be binary, taking values 0 or 1 with 50% chance each. We independently generate 200 observations and Y is related to X via the model

The task is to predict Y based on the information in X.

We use 150 observations as the training set and 50 as the test set. This example has a 25% theoretical lower bound for prediction error rates since we do not know which of the two causal variable modules generates the response Y.

Method LDA SVM RF LogicFS LL LASSO Elastic net Proposed

Train error .14 ± .03 .00 ± .00 .00 ± .00 .13 ± .02 .23 ± .05 .27 ± .06 .27 ± .06 .21 ± .01

Test error .47 ± .02 .50 ± .01 .44 ± .04 .34 ± .04 .45 ± .03 .48 ± .04 .48 ± .04 .24 ± .03

Table 1. Classification error rates for the toy example.

Diagrams of conventional approach and the variable-module enabled approach.

Screening based on significance test modeling

Stage 1: variable module discovery

Stage 2: Dissection: predictive modeling and functional relevance

Significance evaluationInteraction

basedSystems-oriented

Functional relevance

Conventional approach

Proposed variable-module enabled approach

Basic tool: the Backward dropping algorithm (BDA). BDA is a “greedy” algorithm that seeks the variable subset that maximizes the I-score through stepwise elimination of variables from an initial subset (k variables) sampled from the variable space (p variables). K << p

• Training Set : Consider a training set of n observations, where = X is a p -dimensional vector of discrete variables. Typically p is very large (thousands).

• Sampling randomly from Variable Space: Select an initial subset of k explanatory variables k << p.

• Compute I-Score based on k variables.• Drop Variables: Tentatively drop each variable and

recalculate the -score with one variable less. Then permanently drop the variable that results in the highest -score when tentatively dropped.

• Return Set: Continue the next round of dropping on until only one variable left. Keep the subset that yields the highest -score in the whole dropping process. Refer to this subset as the return set.

Figure 5. Change of I-Score

Structural diagram of proposed methodology .

Proposed research on rare variants (rG) continuous variables Environmental variables:

GxE, rareGxE

Discovery Stage

Functional relevance dissection

Predictive classifier dissection

Data

Results

Automated analysis w

orkflows (w

ith case studies)

Step 1: Interaction-based variable screening*

Step 2: Variable module generation

Dissection stage

*Optional for studies with small numbers of genes.

Proposed research on Predictive modeling and

classifier integration Network enrichment

analysis of discovered modules

Core codes in C/JAVAR functionsDatabase tools and R routines for data processing

Classification based on van’t Veer’s Data (2002) .

In applying procedures described in Discovery stage , we successfully identified 18 influential modules with sizes ranging from 2 to 6. The purpose of the original study was to predict breast cancer relapse using gene expression data. The original data contains the expression levels of 24,187 genes for 97 patients, 46 relapse (distant metastasis < 5 year) and 51 non-relapse (no distant metastasis ≥ 5 year). We used 4,918 genes for the classification task, which were reduced by Tibashirani and Efron (2002). 78 cases out of 97 were used as the training set (34 relapse and 44 non-relapse) and 19 (12 relapse and 7 non-relapse) as the test set. The best error rates (biased or not) on this particular test set in the literature is around 10% (2 errors). Proposed method yields a zero error rate (no error) on the test set

The CV error rates of the van’t Veer data are typically around 30%. The proposed method yields an average error rate of 8% over 10 randomly selected CV test samples representing a 74% reduction of error rate (30%-8%/ 30%= 74%) when compared with existing methods. We run the CV experiment by randomly partitioning the 97 patients into a training sample of size 87 and a test sample of 10, then repeated the experiment ten times

•In case-control design when there are n cases and n controls in a study, the last line of the equations, divided by will converge to Is the class probability. (two classes, case vs control). This expression is directly related to the correct predictive rate corresponding to the partition . Thus searching for cluster with larger I-score has the automatic effect of seeking clusters with stronger predictive ability---- a very desirable property.

Example Using Breast Cancer Data

Gene Locus SNPs Genes Locus SNPsCASP8 2q33-q34 12 SLC22A18 11p15.5 16

TP53 17p13.1 6 BARD1 2q34-q35 27

ATM 11q22-q23 12 BRCA1 17q21 13

PIK3CA 3q26.3 8 KRAS2 12p12.1 28

PHB 17q21 10 ESR1 6q25 78

BRCA2 13q12.3 31 BRIP1 17q22-q24 19

RB1CC1 8q11 9 RAD51 15q15.1 4

PPM1D 17q23.2 2 TSG101 11p15 11

PALB2 16p12.1 7 CHEK2 22q12.1 11

• Case-Control Sporadic Breast Cancer data from NCI Cancer Genetic Markers of Susceptibility (CGEMS)

• 2287 postmenopausal women 1145 cases and 1142 controls

• 18 genes with 304 SNPS selected from literatures:

Under the null estimated by permutations. P-values of the observed marginal effects

ResultsMean-Ratio Method Quantile-Ratio Method

Gene Pair Curve p-value Rank p-value Gene Pair Curve p-value Rank p-value1 ESR1 – BRCA1 0.017 ≤ 0.001 ESR1 – BRCA1 0.013 0.001

2 BRCA1 – PHB 0.026 0.040 BRCA1 – PHB 0.029 0.073

3 KRAS2 – BRCA1 0.002 0.006 KRAS2 – BRCA1 0.002 0.004

4 SLC22A18 – BRCA1 0.032 0.072 SLC22A18 – BRCA1 0.019 0.079

5 RAD51 – BRCA1 0.052 0.090 RAD51 – BRCA1 0.005 0.032

6 RB1CC1 – SLC22A18 0.024 0.026 ESR1 – SLC22A18 0.033 0.016

7 CASP8 – KRAS2 0.043 0.038 RB1CC1 – SLC22A18 0.009 0.008

8 CASP8 – SLC22A18 0.042 0.048 CASP8 – KRAS2 0.038 0.036

9 PIK3CA – BRCA1 0.030 0.048 CASP8 – SLC22A18 0.021 0.012

10 PIK3CA – ESR1 0.047 0.032 PIK3CA – BRCA1 0.014 0.049

11 PIK3CA – RB1CC1 0.047 0.051 PIK3CA – ESR1 0.021 0.005

12 PIK3CA – SLC22A18 0.025 0.036 PIK3CA – RB1CC1 0.044 0.053

13 BRCA1 – CHEK2 0.016 0.031 CASP8 – PIK3CA 0.007 0.009

14 BARD1 – BRCA1 0.032 0.057 BRCA1 – CHEK2 0.007 0.022

15 BARD1 – ESR1 0.044 0.025 BARD1 – BRCA1 0.003 0.015

16 BARD1 – TP53 0.019 0.019 BARD1 – ESR1 0.017 0.003

17 BARD1 – TP53 0.015 0.010

18 BARD1 – SLC22A18 0.056 0.063

CASP8 – ESR1 0.071 0.048 CASP8 – ESR1 0.066 0.031

BARD1 – KRAS2 0.055 0.036

ESR1 – KRAS2 0.145 ≤ 0.001 ESR1 – KRAS2 0.103 ≤ 0.001

ESR1 – PPM1D 0.252 0.021 ESR1 – PPM1D 0.348 ≤ 0.001

Two-way Interaction Networks

Pair-wise network based on16 pairs of genes identified by Mean-ratioMethod.

Pair-wise network based on 18 pairs of genes identified by Quantile-ratiomethod.

Three-way Interaction Networks

3-way interaction network based on 10 genes identified by Mean-ratiomethod

3-way interaction network based on 8 genes identified by Quantile-ratio method

Pairwise Interaction

(M, R)-plane: observed data and permutation quantiles1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4- SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-RB1CC1SLC22A18, 7-CASP8 KRAS2, 8-CASP8 SLC22A18, 9-PIK3CA BRCA1, 10-PIK3CA ESR1, 11-PIK3CA RB1CC1,12-PIK3CA SLC22A18, 13-BRCA1 CHEK2, 14-BARD1 BRCA1, 15-BARD1 ESR1, 16-BARD1 TP53

(M, Q)-plane: observed data and permutation quantiles1-ESR1 BRCA1, 2-BRCA1 PHB, 3-KRAS2 BRCA1, 4-SLC22A18 BRCA1, 5-RAD51 BRCA1, 6-ESR1 SLC22A18,7-RB1CC1 SLC22A18, 8-CASP8 KRAS2, 9-CASP8 SLC22A18, 10-PIK3CA BRCA1,11-PIK3CA ESR1, 12-PIK3CARB1CC1, 13-CASP8 PIK3CA, 14-BRCA1 CHEK2, 15-BARD1 BRCA1, 16-BARD1 ESR1, 17-BARD1 TP53,18-BARD1 SLC22A18

Remarks

• One limitation of marginal approaches is due in part that only a fractional information from the data is used;

• The proposed approach intends to draw more relevant information ; Improving prediction;

• Additional scientific findings are likely if data already collected be suitably reanalyzed;

• The proposed approach is particularly useful when a large number of dense markers becomes available;

• Information about gene-gene interactions and their disease-networks can be derived and constructed.

Collaborators

• Herman Chernoff, Tian Zheng, lulian lonita-Laza, Inchi Hu, Hongyu Zhao,Hui Wang, Xin Yan, Yuejing Ding, Chien-Hsun Huang, Bo Qiao, Ying Liu, Michael Agne, Ruixue Fan, Maggie Wang, Lei Cong, Hugh Arnolds, Jun Xie, Kjell Doksum

Key References

• Lo SH, Zheng T (2002) Backward haplotype transmission association (BHTA) algorithm—a fast multiple-marker screening method. Human Heredity 53 (4): 197-215.

• Lo SH, Zheng T (2004) A demonstration and findings of a statistical approach through reanalysis of inflammatory bowel disease data. PNAS U S A 101(28):10386-91

• Lo SH, Chernoff,H., Cong,L., Ding,Y.,Zheng,T.(2008) Discovering Interactions Among BRCA1 and Other Candidate Genes Involved in Sporadic Breast Cancer. PNAS 105:12387-12392.

• Chernoff H, Lo SH, Zheng T (2009) Discovering Influential Variables: A Method of Partitions. Annals of Applied Statistics. 3.(4): 1335-1369.

• Wang H ., Lo SH, Zheng T &Hu I (2012) Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics 28(21): 2834-2842.

http://www.pnas.org/content/105/34/12387.full.pdf+html

interaction-based learning in genomics

Documents