based predictive learning for the genetic etiology of...
TRANSCRIPT
Interaction‐based predictive learning for the genetic etiology of complex diseasesTIAN ZHENGDEPARTMENT OF STATISTICS, DATA SCIENCE INSTITUTECOLUMBIA UNIVERSITY
1
Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences
National University of SingaporeJuly 10-14, 2017
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore 2
The Conference on Statistical Learning and Data Science / Nonparametric StatisticsJune 4-6, 2018
Columbia University.Keynote speakers: • Michael I. Jordan• Liza Levina• David Madigan
Banquet speaker: Cathy O'Neil.Program chairs Annie Qu ([email protected]) and Cynthia Rudin([email protected]). Local chair: Tian Zheng ([email protected])
Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 3
Interaction‐based predictive learning for the genetic etiology of complex diseasesTIAN ZHENGDEPARTMENT OF STATISTICS, DATA SCIENCE INSTITUTECOLUMBIA UNIVERSITY
5
Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences
National University of SingaporeJuly 10-14, 2017
AcknowledgementsJoint work with
Shaw-Hwa Lo, Columbia University
Herman Chernoff, Columbia University
Adeline Lo, Princeton University
Funding from NSF
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 6
Complex genetic diseases “… are caused by multiple genes interacting with each other and with environmental factors to create a gradient of genetic susceptibility to disease. “ (Weeks and Lathrop 1995)
Gene‐gene interactions play an important role in common human disorders, in both disease risks and responses to treatments.
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
7
Ganapathiraju, Madhavi K., et al. "Schizophrenia interactome with 504 novel protein–protein interactions." npj Schizophrenia 2 (2016): 16012.
Gene x Gene Interactions “Interaction” is a biological term and a statistical term. Biology: two or more genes jointly affect an outcome of interest. Statistics: defined under a specific model.
Identify potentially interacting genes via association mapping: sets of genetic loci with statistically
significant association with the disease outcome. via predictive learning: sets of genetic loci that are predictive of the
disease outcome.
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
8
Significance versus predictivity
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
9
Significance versus Predictivity
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
10
Lo, A., Chernoff, H., Zheng, T., & Lo, S. H. (2015). Proceedings of the National Academy of Sciences, 112(45), 13892-13897.
Significance versus predictivity
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
11
-2 0 2 4 6 8 10
0.0
0.1
0.2
0.3
0.4
N(0, 1)N(3, 32)
Med. Sig. Level: sX= 0.0014Pred. Rate: 1 eX= 0.83
-3 -2 -1 0 1 2 30
24
68
N(0, 1)N(0, 0.052)
Med. Sig. Level: sY= 0.5Pred. Rate: 1 eY= 0.94
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
12
0.0
0.1
0.2
0.3
0.4
10-1 10-2 10-3 10-4 10-5 10-6 10-7
Predictive VSSignificant VS
https://github.com/tz33cu/PartitionRetention
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
13
0.0
0.1
0.2
0.3
0.4
10-1 10-2 10-3 10-4 10-5 10-6 10-7 10-8 10-9 10-10 10-11 10-12
Predictive VSSignificant VS
https://github.com/tz33cu/PartitionRetention
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
14
OR
MAF
True
Pre
dict
ion
Rat
e
0.54
0.56
0.58
0.60
0.62
0.64
0.66
0.68
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
15
OR
MAF
Pre
dict
ion
rate
in tr
aini
ng s
et
OR
MAF
Pre
dict
ion
rate
in tr
aini
ng s
et
OR
MAF
Pre
dict
ion
rate
in tr
aini
ng s
et
0.64
0.66
0.68
0.70
0.72
0.74
https://github.com/tz33cu/PartitionRetention
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
16
OR
MAF
Chi
sq-T
est P
-val
ue (-
Log
sca
le)
OR
MAF
Chi
sq-T
est P
-val
ue (-
Log
sca
le)
OR
MAF
Chi
sq-T
est P
-val
ue (-
Log
sca
le)
0
5
10
15
20
25
30
35
https://github.com/tz33cu/PartitionRetention
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
17
Significant)set)by)a)test)using)a)small)sample)
Significant)set)by)a)test)using)a)large)sample)
Set)of)variable)modules)with)predic6ve)power)above)certain)threshold)
Significant)set)by)a)test)using)a)huge)sample)
Prediction‐oriented measure
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
18
Predictivity of a variable setNotation
o , … , is a set of dichotomous variables under evaluation.
o Π is the partition based on .
o ∈ , is the disease outcome of interest.
o is conditional distribution of given .
o is conditional distribution of given .
Assume 0.5
Bayes rate for predicting Y12 max
∈
,
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
19
Predictivity of a variable setBayes rate for predicting Y
12 max
∈
,
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
20
,12
14
∈
Maximal potential ability to predict.
Sample estimate? , number of cases; , number of controls
, , number of cases with
, , number of controls with
, , ,
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
21
,12
14
∈
This is the naïve training prediction rate based on Π .
A better sample measure for predictivity?
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
22
A better sample measure for predictivity?
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
23
Chernoff, Herman, Shaw-Hwa Lo, and Tian Zheng. “Discovering influential variables: a method of partitions. ”The Annals of Applied Statistics (2009): 1335-1369.
I � =X
j 2 �
n2j (Yj − Y )2
=3mX
i= 1(nd,i + nu,i )2
✓nd,i
nd,i + nu,i−
ndnd + nu
◆2
=✓
ndnu
nd + nu
◆2 3mX
i = 1
✓nd,i
nd−
nu,i
nu
◆2.
A lower bound for predictivity
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
24
,12
14
∈
12
14 2 lim
→ 1
where
Lo, A., Chernoff, H., Zheng, T., & Lo, S. H. (2016). Framework for making better predictions by directly estimating variables’ predictivity. Proceedings of the National Academy of Sciences, 113(50), 14277-14282.
Example 1Variable sets as partitions
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore25
Senario 1:and are independent with
• 1 1 1/2, • 1 2.
1 -1
1 = 1 = -1
-1 = -1
= 1
Overall mean of Y = 0
Senario 2:, are independent with
• 1 1 1/2, • 1 2.
1 1 -1
1 = 1 = -1
-1 = -1
= 1
Overall mean of Y = 0
11 -1
1 = 1 = -1
-1 = -1 = 1
A data set of 50 observations
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore26
x1=1-2*rbinom(nn.use, 1, 0.5)x2=1-2*rbinom(nn.use, 1, 0.5)x3=1-2*rbinom(nn.use, 1, 0.5)yy=x1*x2+rnorm(nn.use, 0, 1)yy=1*(yy>0) > ftable(yy, x1, x2, x3)
x3 -1 1yy x1 x2 0 -1 -1 1 1
1 2 81 -1 3 4
1 3 11 -1 -1 8 2
1 1 01 -1 0 2
1 7 7
yy0 1 22 28
Influence on Y
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore27
X3=-1 X3=1
X1=-1 X2=-1 Y=0: 1/22Y=1: 8/28
Y=0: 1/22Y=1: 2/28
X2= 1 Y=0: 2/22Y=1: 1/28
Y=0: 8/22Y=1: 0/28
X1= 1 X2=-1 Y=0: 3/22Y=1: 0/28
Y=0: 4/22Y=1: 2/28
X2= 1 Y=0: 3/22Y=1: 7/28
Y=0: 1/22Y=1: 7/28
X3=-1 X3=1
X1=-1 X2=-1 Y=0: 1/22Y=1: 8/28
Y=0: 1/22Y=1: 2/28
X2= 1 Y=0: 2/22Y=1: 1/28
Y=0: 8/22Y=1: 0/28
X1= 1 X2=-1 Y=0: 3/22Y=1: 0/28
Y=0: 4/22Y=1: 2/28
X2= 1 Y=0: 3/22Y=1: 7/28
Y=0: 1/22Y=1: 7/28
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore 28
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore 29
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
30
A.
x1 x1
x2
x1
x2
x3
x1
x2
x3
x4
x1
x2
x3
x4
x5
x1
x2
x3
x4
x5
x6
x1
x2
x3
x4
x5
x6
x7
x1
x2
x3
x4
x5
x6
x7
x8
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x1 x1
x2
x1
x2
x3
x1
x2
x3
x4
x1
x2
x3
x4
x5
x1
x2
x3
x4
x5
x6
x1
x2
x3
x4
x5
x6
x7
x1
x2
x3
x4
x5
x6
x7
x8
x1
x2
x3
x4
x5
x6
x7
x8
x9
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
2 4 6 8 10
0.50
0.55
0.60
0.65
0.70
Variable module size (k)
True
Bay
es ra
te
with influential Xsw/o influential Xsinfluentalnoise
B.‐
0 2 4 6 8 10
0.50
0.55
0.60
0.65
0.70
Size of variable module (k)
PR
's I
scor
e
Distributions of estimated prediction using PR's I. Reflecting different rat between scenarios (1) and (2) with largest difference at k=5
0 2 4 6 8 10
020
060
010
0014
00
Size of variable module (k)
Chi
-squ
are
test
sta
tistic
Due to the small sample size Chi-square test does not have power for detecting influential Xs.
0 2 4 6 8 10
0.5
0.6
0.7
0.8
0.9
1.0
Size of variable module (k)
Trai
ning
rate
Due to the small sample size the empirical train rate does not reflect the true prediction rate
Simulation studies
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
31
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore32
A six‐gene network
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore33
Multiplicative odds ratio
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 34
0.50
0.55
0.60
0.65
0.70
250 cases, 250 controls
Pre
dict
ion
Rat
e
Ref: theoretical Bayes rateRef: outsample pred. rateLower bound based on I score
0.50
0.55
0.60
0.65
0.70
500 cases, 500 controls
Pre
dict
ion
Rat
e
0.50
0.55
0.60
0.65
0.70
1000 cases, 1000 controls
Pre
dict
ion
Rat
e
0.4
0.6
0.8
1.0
variable sets
Pre
dict
ion
Rat
e
x2x3
x1x2x3
x1x2x3
x7 x1x2x3
x7x8 x1
x2x3
x7x8x9
x1x2x3
x7x8x9x10
x1x2x3
x7x8x9x10x11
x1x2x3
x7x8x9x10x11x12
0.4
0.6
0.8
1.0
variable sets
Pre
dict
ion
Rat
e
x2x3
x1x2x3
x1x2x3
x7 x1x2x3
x7x8 x1
x2x3
x7x8x9
x1x2x3
x7x8x9x10
x1x2x3
x7x8x9x10x11
x1x2x3
x7x8x9x10x11x12
0.4
0.6
0.8
1.0
variable sets
Pre
dict
ion
Rat
e
x2x3
x1x2x3
x1x2x3
x7 x1x2x3
x7x8 x1
x2x3
x7x8x9
x1x2x3
x7x8x9x10
x1x2x3
x7x8x9x10x11
x1x2x3
x7x8x9x10x11x12
Ref: theoretical Bayes rateRef: outsample pred. r ateTraining set pred. r ate
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore 35
0.50
0.55
0.60
0.65
0.70
250 cases, 250 controls
Pre
dict
ion
Rat
e
Ref: theoretical Bayes rateRef: outsample pred. rateLower bound based on I score
0.50
0.55
0.60
0.65
0.70
500 cases, 500 controls
Pre
dict
ion
Rat
e
0.50
0.55
0.60
0.65
0.70
1000 cases, 1000 controls
Pre
dict
ion
Rat
e
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
variable sets
Pre
dict
ion
Rat
e
x1x2x3x4x5x6
x1x2x3x4x5x6
x7 x1x2x3x4x5x6
x7x8 x1
x2x3x4x5x6
x7x8x9
x1x2x3x4x5x6
x7x8x9x10
x1x2x3x4x5x6
x7x8x9x10x11
x1x2x3x4x5x6
x7x8x9x10x11x12
Ref: theoretical Bayes rateRef: outsample pred. rateTraining set pred. rate
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
variable sets
Pre
dict
ion
Rat
e
x1x2x3x4x5x6
x1x2x3x4x5x6
x7 x1x2x3x4x5x6
x7x8 x1
x2x3x4x5x6
x7x8x9
x1x2x3x4x5x6
x7x8x9x10
x1x2x3x4x5x6
x7x8x9x10x11
x1x2x3x4x5x6
x7x8x9x10x11x12
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
variable sets
Pre
dict
ion
Rat
e
x1x2x3x4x5x6
x1x2x3x4x5x6
x7 x1x2x3x4x5x6
x7x8 x1
x2x3x4x5x6
x7x8x9
x1x2x3x4x5x6
x7x8x9x10
x1x2x3x4x5x6
x7x8x9x10x11
x1x2x3x4x5x6
x7x8x9x10x11x12
Application to genetic studies
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore 36
Backward dropping algorithm (BDA)
a candidate set of k variables
Returned influential set
(possibly empty)
Returned influential set
(possibly empty)
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore37
• A greedy backward screening based on I-score.• At each step, the variable that leads to the most
gain in I-score is removed. • Stops when there is no more gain.
Example: Application to rheumatoid arthritis (RA) Rheumatoid Arthritis (RA) is a heterogeneous disease that exhibits a complex genetic
component.
We studied 349 controls and 474 cases with genotypes on 5407 SNPs throughout the
genome. We used a two-stage screening for this data set.
First stage: use standard BGTA screening and select top approximately 20% important
markers.
Second stage: further screening to identify important marker clusters.
Significant markers were selected based on FDR estimated using permutations.
For 39 identified loci that showed strong association with the RA, of which about 2/3 were
found in the RA literature, we constructed an association network among them using
association scores.
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore38
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore 39
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore 40
July 13, 2017Workshop on Perspectives & Analysis Methods for Personalized Medicine
Institute for Mathematical Sciences, National University of Singapore 41
Relation to big data prediction
Feature selectionFeature selection• Model-free evaluation of joint influence from multiple x
variables on Y
Feature generationFeature generation• Jointly selected variable sets suggest interactions among the x
variables. • Interaction terms within each selected VS can be viewed as a
feature in a predictor at the construction stage.
July 13, 2017 WORKSHOP ON PERSPECTIVES & ANALYSIS METHODS FOR PERSONALIZED MEDICINEINSTITUTE FOR MATHEMATICAL SCIENCES, NATIONAL UNIVERSITY OF SINGAPORE
42
Predictive learning workflow
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
43
raw data
Prediction oriented
(VSG)
Prediction oriented Variable set generation
(VSG)
Predictive modeling (PM)
Outcome prediction
Predictive Gene Set identified for breast cancer (Wang et al 2012)
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
44
Wang, H., Lo, S. H., Zheng, T., & Hu, I. (2012). Interaction-based feature selection and classification for high-dimensional biological data. Bioinformatics, 28(21), 2834-2842.
van’t Veer et al (2002) data set
Conclusion Significance does not automatically mean high
predictivity.
Predictivity of variable sets can be treated as a parameter of interest.
We propose a potential sample low bound for predictivity.
A better measure of predictivity can lead to more reliable findings, especially for interactions.
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
45
July 13, 2017 Workshop on Perspectives & Analysis Methods for Personalized MedicineInstitute for Mathematical Sciences, National University of Singapore
46
Thanks!