strong heredity models in high dimensional data

53
A Model for Interpretable High Dimensional Interactions . Sahir Rai Bhatnagar Joint work with Yi Yang, Mathieu Blanchette and Celia Greenwood McGill University sahirbhatnagar.com

Upload: sahirbhatnagar

Post on 16-Apr-2017

12 views

Category:

Science


1 download

TRANSCRIPT

Page 1: Strong Heredity Models in High Dimensional Data

A Model for Interpretable High DimensionalInteractions.

Sahir Rai BhatnagarJoint work with Yi Yang, Mathieu Blanchette and Celia Greenwood

McGill Universitysahirbhatnagar.com

Page 2: Strong Heredity Models in High Dimensional Data

Motivation.

Page 3: Strong Heredity Models in High Dimensional Data

one predictor variable at a time

..

.

.

.

.

.

.

.

.

.

.

..

Predictor Variable

.

Phenotype

.

Test 1

.

Test 2

.

Test 3

.

Test 4

.

Test5

1/25

Page 4: Strong Heredity Models in High Dimensional Data

one predictor variable at a time

..

.

.

.

.

.

.

.

.

.

.

..

Predictor Variable

.

Phenotype

.

Test 1

.

Test 2

.

Test 3

.

Test 4

.

Test5

1/25

Page 5: Strong Heredity Models in High Dimensional Data

a network based view

.

.

..

.

.

.

.

..

..

..

Predictor Variable

.

Phenotype

.Test 1

2/25

Page 6: Strong Heredity Models in High Dimensional Data

a network based view

.

.

..

.

.

.

.

..

..

..

Predictor Variable

.

Phenotype

.Test 1

2/25

Page 7: Strong Heredity Models in High Dimensional Data

a network based view

.

.

..

.

.

.

.

..

..

..

Predictor Variable

.

Phenotype

.Test 1

2/25

Page 8: Strong Heredity Models in High Dimensional Data

system level changes due to environment

.

.

..

.

.

.

.

..

.

.

..

Predictor Variable

.

Phenotype

.

Environment

.

.

.

.

.

.

.

.

.

..A .

B

.

Test 1

3/25

Page 9: Strong Heredity Models in High Dimensional Data

system level changes due to environment

.

.

..

.

.

.

.

..

.

.

..

Predictor Variable

.

Phenotype

.

Environment

.

.

.

.

.

.

.

.

.

..A .

B

.

Test 1

3/25

Page 10: Strong Heredity Models in High Dimensional Data

Motivating Dataset: Newborn epigenetic adaptations to gesta-tional diabetes exposure (Luigi Bouchard, Sherbrooke)

...

EnvironmentGestationalDiabetes

..

Large DataChild's epigenome

(p ≈ 450k)

.

.

.

PhenotypeObesity measures

4/25

Page 11: Strong Heredity Models in High Dimensional Data

Differential Correlation between environments

(a) Gestational diabetes affected pregnancy (b) Controls

5/25

Page 12: Strong Heredity Models in High Dimensional Data

Differential Networking

6/25

Page 13: Strong Heredity Models in High Dimensional Data

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective

• Which elements of X that are associated with Y, depend on E?

7/25

Page 14: Strong Heredity Models in High Dimensional Data

formal statement of initial problem

• n: number of subjects• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective

• Which elements of X that are associated with Y, depend on E?

7/25

Page 15: Strong Heredity Models in High Dimensional Data

formal statement of initial problem

• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective

• Which elements of X that are associated with Y, depend on E?

7/25

Page 16: Strong Heredity Models in High Dimensional Data

formal statement of initial problem

• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X

Objective

• Which elements of X that are associated with Y, depend on E?

7/25

Page 17: Strong Heredity Models in High Dimensional Data

formal statement of initial problem

• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective

• Which elements of X that are associated with Y, depend on E?

7/25

Page 18: Strong Heredity Models in High Dimensional Data

formal statement of initial problem

• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective

• Which elements of X that are associated with Y, depend on E?

7/25

Page 19: Strong Heredity Models in High Dimensional Data

Methods.

Page 20: Strong Heredity Models in High Dimensional Data

ECLUST - our proposed method: 3 phases

...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.

2) ClusterRepresentation

..

n × 1

.

n × 1

.

3) PenalizedRegression

.

Yn×1

.

.

+

.

×E

8/25

Page 21: Strong Heredity Models in High Dimensional Data

ECLUST - our proposed method: 3 phases

...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.

2) ClusterRepresentation

..

n × 1

.

n × 1

.

3) PenalizedRegression

.

Yn×1

.

.

+

.

×E

8/25

Page 22: Strong Heredity Models in High Dimensional Data

ECLUST - our proposed method: 3 phases

...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.

2) ClusterRepresentation

..

n × 1

.

n × 1

.

3) PenalizedRegression

.

Yn×1

.

.

+

.

×E

8/25

Page 23: Strong Heredity Models in High Dimensional Data

ECLUST - our proposed method: 3 phases

...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.

2) ClusterRepresentation

.

.

n × 1

.

n × 1

.

3) PenalizedRegression

.

Yn×1

.

.

+

.

×E

8/25

Page 24: Strong Heredity Models in High Dimensional Data

ECLUST - our proposed method: 3 phases

...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.

2) ClusterRepresentation

..

n × 1

.

n × 1

.

3) PenalizedRegression

.

Yn×1

.

.

+

.

×E

8/25

Page 25: Strong Heredity Models in High Dimensional Data

ECLUST - our proposed method: 3 phases

...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.

2) ClusterRepresentation

..

n × 1

.

n × 1

.

3) PenalizedRegression

.

Yn×1

.

.

+

.

×E

8/25

Page 26: Strong Heredity Models in High Dimensional Data

the objective of statisti-cal methods is the reduction ofdata. A quantity of data . . . is to bereplaced by relatively few quantitieswhich shall adequately represent. . . the relevant informationcontained in the original data.- Sir R. A. Fisher, 1922

8/25

Page 27: Strong Heredity Models in High Dimensional Data

Underlying model

Y = β0 + β1U+ β2U · E+ ε (1)

X ∼ F(α0 + α1U,ΣE) (2)

• U: unobserved latent variable• X: observed data which is a function of U• ΣE: environment sensitive correlation matrix

9/25

Page 28: Strong Heredity Models in High Dimensional Data

Measure of similarity: topological overlap matrix (TOM)

10/25

Page 29: Strong Heredity Models in High Dimensional Data

Method to detect gene clusters

Table 1: Method to detect gene clusters

General Approach Formula

TOM Scoring |TOME=1 − TOME=0|

11/25

Page 30: Strong Heredity Models in High Dimensional Data

Cluster Representation

Table 2: Methods to create cluster representations

General Approach Type

Unsupervised average1st principal component

12/25

Page 31: Strong Heredity Models in High Dimensional Data

Model

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects

+α1E(X1E) + · · ·+ αpE(XpE)︸ ︷︷ ︸interactions

Reparametrization1: αjE = γjEβjβE.

Strong heredity principle2:

α̂jE ̸= 0 ⇒ β̂j ̸= 0 and β̂E ̸= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

13/25

Page 32: Strong Heredity Models in High Dimensional Data

Model

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects

+α1E(X1E) + · · ·+ αpE(XpE)︸ ︷︷ ︸interactions

Reparametrization1: αjE = γjEβjβE.

Strong heredity principle2:

α̂jE ̸= 0 ⇒ β̂j ̸= 0 and β̂E ̸= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

13/25

Page 33: Strong Heredity Models in High Dimensional Data

Model

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects

+α1E(X1E) + · · ·+ αpE(XpE)︸ ︷︷ ︸interactions

Reparametrization1: αjE = γjEβjβE.

Strong heredity principle2:

α̂jE ̸= 0 ⇒ β̂j ̸= 0 and β̂E ̸= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

13/25

Page 34: Strong Heredity Models in High Dimensional Data

Strong Heredity Model with Penalization

argminβ0,β,γ

12 ∥Y− g(µ)∥2 +

λβ (w1β1 + · · ·+ wqβq + wEβE)+

λγ (w1Eγ1E + · · ·+ wqEγqE)

wj =

∣∣∣∣∣ 1β̂j∣∣∣∣∣ , wjE =

∣∣∣∣∣ β̂jβ̂Eα̂jE

∣∣∣∣∣

14/25

Page 35: Strong Heredity Models in High Dimensional Data

Results.

Page 36: Strong Heredity Models in High Dimensional Data

Simulation Study

15/25

Page 37: Strong Heredity Models in High Dimensional Data

TOM based on all subjects

(a) TOM(Xall) 16/25

Page 38: Strong Heredity Models in High Dimensional Data

TOM based on unexposed subjects

(a) TOM(XE=0) 17/25

Page 39: Strong Heredity Models in High Dimensional Data

TOM based on exposed subjects

(a) TOM(XE=1) 18/25

Page 40: Strong Heredity Models in High Dimensional Data

Difference of TOMs

(a) |TOM(XE=1) − TOM(XE=0)| 19/25

Page 41: Strong Heredity Models in High Dimensional Data

Results: Test set MSE

20/25

Page 42: Strong Heredity Models in High Dimensional Data

Results: Variable Selection

21/25

Page 43: Strong Heredity Models in High Dimensional Data

Open source software

• Software implementation in R:http://sahirbhatnagar.com/eclust/

• Allows user specified interaction terms• Automatically determines the optimal tuning parametersthrough cross validation

• Can also be applied to genetic data

22/25

Page 44: Strong Heredity Models in High Dimensional Data

Conclusions.

Page 45: Strong Heredity Models in High Dimensional Data

Conclusions and Contributions

• Large system-wide changes are observed in manyenvironments

• Dimension reduction is achieved through leveraging theenvironmental-class-conditional correlations

• R software: http://sahirbhatnagar.com/eclust/

23/25

Page 46: Strong Heredity Models in High Dimensional Data

Conclusions and Contributions

• Large system-wide changes are observed in manyenvironments

• Dimension reduction is achieved through leveraging theenvironmental-class-conditional correlations

• R software: http://sahirbhatnagar.com/eclust/

23/25

Page 47: Strong Heredity Models in High Dimensional Data

Conclusions and Contributions

• Large system-wide changes are observed in manyenvironments

• Dimension reduction is achieved through leveraging theenvironmental-class-conditional correlations

• R software: http://sahirbhatnagar.com/eclust/

23/25

Page 48: Strong Heredity Models in High Dimensional Data

Limitations

• There must be a high-dimensional signature of the exposure

• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

Page 49: Strong Heredity Models in High Dimensional Data

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised

• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

Page 50: Strong Heredity Models in High Dimensional Data

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters

• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

Page 51: Strong Heredity Models in High Dimensional Data

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies

• Need more samples . . . Got data?

24/25

Page 52: Strong Heredity Models in High Dimensional Data

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

Page 53: Strong Heredity Models in High Dimensional Data

acknowledgements

• Dr. Celia Greenwood• Dr. Blanchette and Dr. Yang• Dr. Luigi Bouchard, André Anne

Houde• Dr. Steele, Dr. Kramer,

Dr. Abrahamowicz• Maxime Turgeon, Kevin

McGregor, Lauren Mokry,Dr. Forest

• Greg Voisin, Dr. Forgetta,Dr. Klein

• Mothers and children from thestudy

25/25