strong heredity models in high dimensional data

A Model for Interpretable High DimensionalInteractions.

Sahir Rai BhatnagarJoint work with Yi Yang, Mathieu Blanchette and Celia Greenwood

McGill Universitysahirbhatnagar.com

sahirbhatnagar.com

Motivation.

one predictor variable at a time

..

.

.

.

.

.

.

.

.

.

.

..

Predictor Variable

.

Phenotype

.

Test 1

.

Test 2

.

Test 3

.

Test 4

.

Test5

1/25

a network based view

.

.

..

.

.

.

.

..

..

..

Predictor Variable

.

Phenotype

.Test 1

2/25

system level changes due to environment

.

.

..

.

.

.

.

..

.

.

..

Predictor Variable

.

Phenotype

.

Environment

.

.

.

.

.

.

.

.

.

..A .

B

.

Test 1

3/25

Motivating Dataset: Newborn epigenetic adaptations to gesta-tional diabetes exposure (Luigi Bouchard, Sherbrooke)

...

EnvironmentGestationalDiabetes

..

Large DataChild's epigenome

(p ≈ 450k)

.

.

.

PhenotypeObesity measures

4/25

Differential Correlation between environments

(a) Gestational diabetes affected pregnancy (b) Controls

5/25

Differential Networking

6/25

formal statement of initial problem

• n: number of subjects

• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective

• Which elements of X that are associated with Y, depend on E?

7/25


• n: number of subjects• p: number of predictor variables

• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective


7/25


• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)

• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective


7/25


• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype

• En×1: environmental factor that has widespread effect on X

Objective


7/25


• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X

Objective


7/25

Methods.

ECLUST - our proposed method: 3 phases

...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.

2) ClusterRepresentation

..

n × 1

.

n × 1

.

3) PenalizedRegression

.

Yn×1

.

∼

.

+

.

×E

8/25


...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.


.

.

n × 1

.

n × 1

.


.

Yn×1

.

∼

.

+

.

×E

8/25


...

Original Data

..

E = 0

.

1) Gene Similarity

..

E = 1

.


..

n × 1

.

n × 1

.


.

Yn×1

.

∼

.

+

.

×E

8/25

the objective of statisti-cal methods is the reduction ofdata. A quantity of data . . . is to bereplaced by relatively few quantitieswhich shall adequately represent. . . the relevant informationcontained in the original data.- Sir R. A. Fisher, 1922

8/25

Underlying model

Y = β0 + β1U+ β2U · E+ ε (1)

X ∼ F(α0 + α1U,ΣE) (2)

• U: unobserved latent variable• X: observed data which is a function of U• ΣE: environment sensitive correlation matrix

9/25

Measure of similarity: topological overlap matrix (TOM)

10/25

Method to detect gene clusters

Table 1: Method to detect gene clusters

General Approach Formula

TOM Scoring |TOME=1 − TOME=0|

11/25

Cluster Representation

Table 2: Methods to create cluster representations

General Approach Type

Unsupervised average1st principal component

12/25

Model

g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸︷︷︸main effects

+α1E(X1E) + · · ·+ αpE(XpE)︸︷︷︸interactions

Reparametrization1: αjE = γjEβjβE.

Strong heredity principle2:

α̂jE ̸= 0 ⇒ β̂j ̸= 0 and β̂E ̸= 0

1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics

13/25

Strong Heredity Model with Penalization

argminβ0,β,γ

12 ∥Y− g(µ)∥2 +

λβ (w1β1 + · · ·+ wqβq + wEβE)+

λγ (w1Eγ1E + · · ·+ wqEγqE)

wj =

∣∣∣∣∣ 1β̂j∣∣∣∣∣ , wjE =

∣∣∣∣∣ β̂jβ̂Eα̂jE

∣∣∣∣∣

14/25

Results.

Simulation Study

15/25

TOM based on all subjects

(a) TOM(Xall) 16/25

TOM based on unexposed subjects

(a) TOM(XE=0) 17/25

TOM based on exposed subjects

(a) TOM(XE=1) 18/25

Difference of TOMs

(a) |TOM(XE=1) − TOM(XE=0)| 19/25

Results: Test set MSE

20/25

Results: Variable Selection

21/25

Open source software

• Software implementation in R:http://sahirbhatnagar.com/eclust/

• Allows user specified interaction terms• Automatically determines the optimal tuning parametersthrough cross validation

• Can also be applied to genetic data

22/25

http://sahirbhatnagar.com/eclust/

Conclusions.

Conclusions and Contributions

• Large system-wide changes are observed in manyenvironments

• Dimension reduction is achieved through leveraging theenvironmental-class-conditional correlations

• R software: http://sahirbhatnagar.com/eclust/

23/25

Limitations

• There must be a high-dimensional signature of the exposure

• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised

• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters

• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies

• Need more samples . . . Got data?

24/25

Limitations

• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?

24/25

acknowledgements

• Dr. Celia Greenwood• Dr. Blanchette and Dr. Yang• Dr. Luigi Bouchard, André Anne

Houde• Dr. Steele, Dr. Kramer,

Dr. Abrahamowicz• Maxime Turgeon, Kevin

McGregor, Lauren Mokry,Dr. Forest

• Greg Voisin, Dr. Forgetta,Dr. Klein

• Mothers and children from thestudy

25/25

strong heredity models in high dimensional data

Science