strong heredity models in high dimensional data
TRANSCRIPT
A Model for Interpretable High DimensionalInteractions.
Sahir Rai BhatnagarJoint work with Yi Yang, Mathieu Blanchette and Celia Greenwood
McGill Universitysahirbhatnagar.com
Motivation.
one predictor variable at a time
..
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Test 1
.
Test 2
.
Test 3
.
Test 4
.
Test5
1/25
one predictor variable at a time
..
.
.
.
.
.
.
.
.
.
.
..
Predictor Variable
.
Phenotype
.
Test 1
.
Test 2
.
Test 3
.
Test 4
.
Test5
1/25
a network based view
.
.
..
.
.
.
.
..
..
..
Predictor Variable
.
Phenotype
.Test 1
2/25
a network based view
.
.
..
.
.
.
.
..
..
..
Predictor Variable
.
Phenotype
.Test 1
2/25
a network based view
.
.
..
.
.
.
.
..
..
..
Predictor Variable
.
Phenotype
.Test 1
2/25
system level changes due to environment
.
.
..
.
.
.
.
..
.
.
..
Predictor Variable
.
Phenotype
.
Environment
.
.
.
.
.
.
.
.
.
..A .
B
.
Test 1
3/25
system level changes due to environment
.
.
..
.
.
.
.
..
.
.
..
Predictor Variable
.
Phenotype
.
Environment
.
.
.
.
.
.
.
.
.
..A .
B
.
Test 1
3/25
Motivating Dataset: Newborn epigenetic adaptations to gesta-tional diabetes exposure (Luigi Bouchard, Sherbrooke)
...
EnvironmentGestationalDiabetes
..
Large DataChild's epigenome
(p ≈ 450k)
.
.
.
PhenotypeObesity measures
4/25
Differential Correlation between environments
(a) Gestational diabetes affected pregnancy (b) Controls
5/25
Differential Networking
6/25
formal statement of initial problem
• n: number of subjects
• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X
Objective
• Which elements of X that are associated with Y, depend on E?
7/25
formal statement of initial problem
• n: number of subjects• p: number of predictor variables
• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X
Objective
• Which elements of X that are associated with Y, depend on E?
7/25
formal statement of initial problem
• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)
• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X
Objective
• Which elements of X that are associated with Y, depend on E?
7/25
formal statement of initial problem
• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype
• En×1: environmental factor that has widespread effect on X
Objective
• Which elements of X that are associated with Y, depend on E?
7/25
formal statement of initial problem
• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X
Objective
• Which elements of X that are associated with Y, depend on E?
7/25
formal statement of initial problem
• n: number of subjects• p: number of predictor variables• Xn×p: high dimensional data set (p >> n)• Yn×1: phenotype• En×1: environmental factor that has widespread effect on X
Objective
• Which elements of X that are associated with Y, depend on E?
7/25
Methods.
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) ClusterRepresentation
..
n × 1
.
n × 1
.
3) PenalizedRegression
.
Yn×1
.
∼
.
+
.
×E
8/25
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) ClusterRepresentation
..
n × 1
.
n × 1
.
3) PenalizedRegression
.
Yn×1
.
∼
.
+
.
×E
8/25
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) ClusterRepresentation
..
n × 1
.
n × 1
.
3) PenalizedRegression
.
Yn×1
.
∼
.
+
.
×E
8/25
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) ClusterRepresentation
.
.
n × 1
.
n × 1
.
3) PenalizedRegression
.
Yn×1
.
∼
.
+
.
×E
8/25
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) ClusterRepresentation
..
n × 1
.
n × 1
.
3) PenalizedRegression
.
Yn×1
.
∼
.
+
.
×E
8/25
ECLUST - our proposed method: 3 phases
...
Original Data
..
E = 0
.
1) Gene Similarity
..
E = 1
.
2) ClusterRepresentation
..
n × 1
.
n × 1
.
3) PenalizedRegression
.
Yn×1
.
∼
.
+
.
×E
8/25
the objective of statisti-cal methods is the reduction ofdata. A quantity of data . . . is to bereplaced by relatively few quantitieswhich shall adequately represent. . . the relevant informationcontained in the original data.- Sir R. A. Fisher, 1922
8/25
Underlying model
Y = β0 + β1U+ β2U · E+ ε (1)
X ∼ F(α0 + α1U,ΣE) (2)
• U: unobserved latent variable• X: observed data which is a function of U• ΣE: environment sensitive correlation matrix
9/25
Measure of similarity: topological overlap matrix (TOM)
10/25
Method to detect gene clusters
Table 1: Method to detect gene clusters
General Approach Formula
TOM Scoring |TOME=1 − TOME=0|
11/25
Cluster Representation
Table 2: Methods to create cluster representations
General Approach Type
Unsupervised average1st principal component
12/25
Model
g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects
+α1E(X1E) + · · ·+ αpE(XpE)︸ ︷︷ ︸interactions
Reparametrization1: αjE = γjEβjβE.
Strong heredity principle2:
α̂jE ̸= 0 ⇒ β̂j ̸= 0 and β̂E ̸= 0
1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics
13/25
Model
g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects
+α1E(X1E) + · · ·+ αpE(XpE)︸ ︷︷ ︸interactions
Reparametrization1: αjE = γjEβjβE.
Strong heredity principle2:
α̂jE ̸= 0 ⇒ β̂j ̸= 0 and β̂E ̸= 0
1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics
13/25
Model
g(µ) =β0 + β1X1 + · · ·+ βpXp + βEE︸ ︷︷ ︸main effects
+α1E(X1E) + · · ·+ αpE(XpE)︸ ︷︷ ︸interactions
Reparametrization1: αjE = γjEβjβE.
Strong heredity principle2:
α̂jE ̸= 0 ⇒ β̂j ̸= 0 and β̂E ̸= 0
1Choi et al. 2010, JASA2Chipman 1996, Canadian Journal of Statistics
13/25
Strong Heredity Model with Penalization
argminβ0,β,γ
12 ∥Y− g(µ)∥2 +
λβ (w1β1 + · · ·+ wqβq + wEβE)+
λγ (w1Eγ1E + · · ·+ wqEγqE)
wj =
∣∣∣∣∣ 1β̂j∣∣∣∣∣ , wjE =
∣∣∣∣∣ β̂jβ̂Eα̂jE
∣∣∣∣∣
14/25
Results.
Simulation Study
15/25
TOM based on all subjects
(a) TOM(Xall) 16/25
TOM based on unexposed subjects
(a) TOM(XE=0) 17/25
TOM based on exposed subjects
(a) TOM(XE=1) 18/25
Difference of TOMs
(a) |TOM(XE=1) − TOM(XE=0)| 19/25
Results: Test set MSE
20/25
Results: Variable Selection
21/25
Open source software
• Software implementation in R:http://sahirbhatnagar.com/eclust/
• Allows user specified interaction terms• Automatically determines the optimal tuning parametersthrough cross validation
• Can also be applied to genetic data
22/25
Conclusions.
Conclusions and Contributions
• Large system-wide changes are observed in manyenvironments
• Dimension reduction is achieved through leveraging theenvironmental-class-conditional correlations
• R software: http://sahirbhatnagar.com/eclust/
23/25
Conclusions and Contributions
• Large system-wide changes are observed in manyenvironments
• Dimension reduction is achieved through leveraging theenvironmental-class-conditional correlations
• R software: http://sahirbhatnagar.com/eclust/
23/25
Conclusions and Contributions
• Large system-wide changes are observed in manyenvironments
• Dimension reduction is achieved through leveraging theenvironmental-class-conditional correlations
• R software: http://sahirbhatnagar.com/eclust/
23/25
Limitations
• There must be a high-dimensional signature of the exposure
• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?
24/25
Limitations
• There must be a high-dimensional signature of the exposure• Clustering is unsupervised
• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?
24/25
Limitations
• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters
• Cautionary note on simulation studies• Need more samples . . . Got data?
24/25
Limitations
• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies
• Need more samples . . . Got data?
24/25
Limitations
• There must be a high-dimensional signature of the exposure• Clustering is unsupervised• Two tuning parameters• Cautionary note on simulation studies• Need more samples . . . Got data?
24/25
acknowledgements
• Dr. Celia Greenwood• Dr. Blanchette and Dr. Yang• Dr. Luigi Bouchard, André Anne
Houde• Dr. Steele, Dr. Kramer,
Dr. Abrahamowicz• Maxime Turgeon, Kevin
McGregor, Lauren Mokry,Dr. Forest
• Greg Voisin, Dr. Forgetta,Dr. Klein
• Mothers and children from thestudy
25/25