large-scale causal inference of gene regulatory...
TRANSCRIPT
Large-Scale Causal Inference of Gene Regulatory RelationshipsIoan Gabriel Bucur, Tom Claassen, Tom Heskes
Institute for Computing and Information Sciences, Radboud University Nijmegen, Nijmegen, the Netherlands
Motivation
I Gene regulatory networks (GRNs) play a crucial role incontrolling an organism’s biological processes.
I If we knew the causal structure of a GRN, we could inter-vene in the developmental process of the organism.
I We wish to find reliable causal regulatory relationshipsamong a very large number of candidate gene pairs.
Idea
I We look for local causal patterns of the form:
Lk
Genetic marker k
Ti
Gene i expression
Tj
Gene j expression
I For triplets, there are five canonical patterns to consider:Independence
ModelMarkov Equivalence
Class (PAG)Covariance
MatrixPrecision
Matrix
‘X1 ⊥6⊥ X2 ⊥6⊥ X3’(Full)
X1 X2 X3
‘X1 ⊥⊥ X3’(Acausal)
X1 X2 X3
0
0
‘X1 ⊥⊥ X3|X2’(Causal)
X1 X2 X3
0
0
‘(X1,X3)⊥⊥ X2’(Independent)
X1 X2 X3
00 0
0
00 0
0
‘X1 ⊥⊥ X2 ⊥⊥ X3’(Empty)
X1 X2 X3
0 00 00 0
0 00 00 0
Bayes Factors of Covariance Structures
I We assume a (latent) linear-Gaussian model on triplets,which is completely characterized by the covariance matrix Σ.
I We put an inverse Wishart prior on the covariance matrix.I To facilitate computation, we derive the Bayes factors of each
independence model relative to a reference (X1 ⊥6⊥ X2 ⊥6⊥ X3):
B(X1 ⊥⊥ X2 ⊥⊥ X3) = f (n, ν)g(n, ν)|C|n+ν
2
B(X3 ⊥⊥ (X1,X2)) = f (n, ν)
(|C|
1− C 212
)n+ν2
B(X1 ⊥⊥ X2|X3) = g(n, ν)
(|C|
(1− C 213)(1− C 2
23)
)n+ν2
B(X1 ⊥⊥ X2) =f (n, ν)
g(n, ν)(1− C 2
12)n+ν−1
2 ,
where f (n, ν) = n+ν−2ν−2 , g(n, ν) ≈
(2n+2ν−3
2ν−3
)12, C is the sample
correlation matrix, n is the number of observations and ν is thenumber of degrees of freedom for the inverse Wishart prior.
I By defining simple priors and by using background knowledge,we get the posterior probabilities of local causal structures.
Ranking Regulatory Relationships with BFCS
Algorithm 1 Running BFCS on a data set from an experiment on yeast
1: Input: Yeast data set (3244 markers, 6216 gene expression measurements)2: for all expression traits Ti do3: for all expression traits Tj, j 6= i do4: for all genetic markers Lk do5: Compute the Bayes factors for the triplet (Lk,Ti ,Tj)6: Derive the posterior probability of the structure Lk → Ti → Tj
7: end for8: Save maxk p(Lk → Ti → Tj) as the probability of i regulating j9: end for
10: end for11: Output: Matrix of regulation probabilities
Rank Gene Chen et al. trigger BFCS1 MDM35 0.973 0.999 0.6782 CBP6 0.968 0.997 0.6833 QRI5 0.960 0.985 0.6784 RSM18 0.959 0.984 0.6725 RSM7 0.953 0.977 0.6846 MRPL11 0.924 0.999 0.670
a Genes regulated by NAM9, sorted by ‘Chen et al.’.
Rank Gene Chen et al. trigger BFCS1 FMP39 0.176 0.401 0.6912 DIA4 0.493 0.987 0.6913 MRP4 0.099 0.260 0.6914 MNP1 0.473 0.999 0.6915 MRPS18 0.527 0.974 0.6906 MTG2 0.000 0.000 0.690
b Genes regulated by NAM9, sorted by ‘BFCS’.
Table 1: The column ‘Chen et al.’ shows the original results of applying the Trigger algorithm to an experiment on
yeast, as reported in Chen et al. (2007). The ‘trigger’ column contains the probabilities we obtained when running
the algorithm from the Bioconductor trigger package (Chen et al., 2017) on the entire yeast data set with default
parameters. The column ‘BFCS’ contains the output of running Algorithm 1 on the yeast data set, for which we
took a uniform prior over directed MAGs (maximal ancestral graphs without undirected edges).
Performance on Simulated Data
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.001 − Specificity
Sen
sitiv
ity
ROC
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.001 − Specificity
Sen
sitiv
ity
ROC
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Recall
Pre
cisi
on
PRC
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Recall
Pre
cisi
on
PRC
0%
25%
50%
75%
100%
0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage
Obs
erve
d E
vent
Per
cent
age
Calibration
0%
25%
50%
75%
100%
0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage
Obs
erve
d E
vent
Per
cent
age
Calibration
trigger BFCS DAG BFCS DMAG BFCS loclink BGe
(a) We generated 100 (left column) and 1000 (rightcolumn) samples from a sparser GRN consisting of 51
regulatory relationships.
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.001 − Specificity
Sen
sitiv
ity
ROC
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.001 − Specificity
Sen
sitiv
ity
ROC
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Recall
Pre
cisi
on
PRC
0.00
0.25
0.50
0.75
1.00
0.00 0.25 0.50 0.75 1.00Recall
Pre
cisi
on
PRC
0%
25%
50%
75%
100%
0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage
Obs
erve
d E
vent
Per
cent
age
Calibration
0%
25%
50%
75%
100%
0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage
Obs
erve
d E
vent
Per
cent
age
Calibration
trigger BFCS DAG BFCS DMAG BFCS loclink BGe
(b) We generated 100 (left column) and 1000 (rightcolumn) samples from a denser GRN consisting of 247
regulatory relationships.
Figure 1: Evaluating the performance of BFCS in detecting (direct or indirect) causal regulatory relationships in
terms of ROC (top row), precision-recall (middle row), and calibration (bottom row). We ran Trigger three
times on the simulated data and averaged the results (‘trigger’) to account for differences when sampling the null
statistics. We report the results of three BFCS versions: two described in Algorithm 1 for which we take a uniform
prior over DAGs (‘BFCS DAG’) and DMAGs (‘BFCS DMAG’), respectively, and one (‘BFCS loclink’) in which we
use the Trigger local-linkage strategy for pre-selecting genetic markers. For reference, we also show the performance
of an equivalent method that uses the Bayesian Gaussian equivalent score (‘BGe’) of Geiger and Heckerman (1994).
Conclusions
I We propose a novel Bayesian approach for estimating theprobability of local causal structures from observational data.
I Our method is simple, efficient, and inherently parallel,which makes it applicable to very large data sets.
I We use the posterior probabilities to obtain a well-calibratedranking of the most meaningful causal regulatory relationships.
I The inferred causal relationships can then be used to (partially)reconstruct the underlying GRN structure.
[ ]Chen, L. S., F. Emmert-Streib, and J. D. Storey2007. Harnessing naturally randomized transcription to infer regulatory rela-tionships among genes. Genome Biology, 8:R219.
[ ]Chen, L. S., D. P. Sangurdekar, and J. D. Storey2017. trigger.
[ ]Geiger, D. and D. Heckerman1994. Learning Gaussian Networks. UAI’94, Pp. 235–243, San Francisco, CA,USA. Morgan Kaufmann Publishers Inc.