large-scale causal inference of gene regulatory...

1
Large-Scale Causal Inference of Gene Regulatory Relationships Ioan Gabriel Bucur, Tom Claassen, Tom Heskes Institute for Computing and Information Sciences, Radboud University Nijmegen, Nijmegen, the Netherlands Motivation I Gene regulatory networks (GRNs) play a crucial role in controlling an organism’s biological processes. I If we knew the causal structure of a GRN, we could inter- vene in the developmental process of the organism. I We wish to find reliable causal regulatory relationships among a very large number of candidate gene pairs. Idea I We look for local causal patterns of the form: L k Genetic marker k T i Gene i expression T j Gene j expression I For triplets, there are five canonical patterns to consider: Independence Model Markov Equivalence Class (PAG) Covariance Matrix Precision Matrix X 1 ⊥6⊥ X 2 ⊥6⊥ X 3 (Full) X 1 X 2 X 3 X 1 ⊥⊥ X 3 (Acausal) X 1 X 2 X 3 0 0 X 1 ⊥⊥ X 3 |X 2 (Causal) X 1 X 2 X 3 0 0 (X 1 , X 3 ) ⊥⊥ X 2 (Independent) X 1 X 2 X 3 0 0 0 0 0 0 0 0 X 1 ⊥⊥ X 2 ⊥⊥ X 3 (Empty) X 1 X 2 X 3 0 0 0 0 0 0 0 0 0 0 0 0 B ayes F actors of C ovariance S tructures I We assume a (latent) linear-Gaussian model on triplets, which is completely characterized by the covariance matrix Σ. I We put an inverse Wishart prior on the covariance matrix. I To facilitate computation, we derive the Bayes factors of each independence model relative to a reference (X 1 ⊥6⊥ X 2 ⊥6⊥ X 3 ): B (X 1 ⊥⊥ X 2 ⊥⊥ X 3 )= f (n )g (n )|C| n+ν 2 B (X 3 ⊥⊥ (X 1 , X 2 )) = f (n ) |C| 1 - C 2 12 n+ν 2 B (X 1 ⊥⊥ X 2 |X 3 )= g (n ) |C| (1 - C 2 13 )(1 - C 2 23 ) n+ν 2 B (X 1 ⊥⊥ X 2 )= f (n ) g (n ) (1 - C 2 12 ) n+ν -1 2 , where f (n )= n +ν -2 ν -2 , g (n ) ( 2n +2ν -3 2ν -3 ) 1 2 , C is the sample correlation matrix, n is the number of observations and ν is the number of degrees of freedom for the inverse Wishart prior. I By defining simple priors and by using background knowledge, we get the posterior probabilities of local causal structures. Ranking Regulatory Relationships with BFCS Algorithm 1 Running BFCS on a data set from an experiment on yeast 1: Input: Yeast data set (3244 markers, 6216 gene expression measurements) 2: for all expression traits T i do 3: for all expression traits T j , j 6= i do 4: for all genetic markers L k do 5: Compute the Bayes factors for the triplet (L k , T i , T j ) 6: Derive the posterior probability of the structure L k T i T j 7: end for 8: Save max k p (L k T i T j ) as the probability of i regulating j 9: end for 10: end for 11: Output: Matrix of regulation probabilities Rank Gene Chen et al. trigger BFCS 1 MDM35 0.973 0.999 0.678 2 CBP6 0.968 0.997 0.683 3 QRI5 0.960 0.985 0.678 4 RSM18 0.959 0.984 0.672 5 RSM7 0.953 0.977 0.684 6 MRPL11 0.924 0.999 0.670 a Genes regulated by NAM9, sorted by ‘Chen et al.’. Rank Gene Chen et al. trigger BFCS 1 FMP39 0.176 0.401 0.691 2 DIA4 0.493 0.987 0.691 3 MRP4 0.099 0.260 0.691 4 MNP1 0.473 0.999 0.691 5 MRPS18 0.527 0.974 0.690 6 MTG2 0.000 0.000 0.690 b Genes regulated by NAM9, sorted by ‘BFCS’. Table 1: The column ‘Chen et al.’ shows the original results of applying the Trigger algorithm to an experiment on yeast, as reported in Chen et al. (2007). The ‘trigger’ column contains the probabilities we obtained when running the algorithm from the Bioconductor trigger package (Chen et al., 2017) on the entire yeast data set with default parameters. The column ‘BFCS’ contains the output of running Algorithm 1 on the yeast data set, for which we took a uniform prior over directed MAGs (maximal ancestral graphs without undirected edges). Performance on Simulated Data 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1 - Specificity Sensitivity ROC 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1 - Specificity Sensitivity ROC 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision PRC 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision PRC 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Bin Midpoint Average Estimated Percentage Observed Event Percentage Calibration 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Bin Midpoint Average Estimated Percentage Observed Event Percentage Calibration trigger BFCS DAG BFCS DMAG BFCS loclink BGe (a) We generated 100 (left column) and 1000 (right column) samples from a sparser GRN consisting of 51 regulatory relationships. 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1 - Specificity Sensitivity ROC 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 1 - Specificity Sensitivity ROC 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision PRC 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 Recall Precision PRC 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Bin Midpoint Average Estimated Percentage Observed Event Percentage Calibration 0% 25% 50% 75% 100% 0% 25% 50% 75% 100% Bin Midpoint Average Estimated Percentage Observed Event Percentage Calibration trigger BFCS DAG BFCS DMAG BFCS loclink BGe (b) We generated 100 (left column) and 1000 (right column) samples from a denser GRN consisting of 247 regulatory relationships. Figure 1: Evaluating the performance of BFCS in detecting (direct or indirect) causal regulatory relationships in terms of ROC (top row), precision-recall (middle row), and calibration (bottom row). We ran Trigger three times on the simulated data and averaged the results (‘trigger’) to account for differences when sampling the null statistics. We report the results of three BFCS versions: two described in Algorithm 1 for which we take a uniform prior over DAGs (‘BFCS DAG’) and DMAGs (‘BFCS DMAG’), respectively, and one (‘BFCS loclink’) in which we use the Trigger local-linkage strategy for pre-selecting genetic markers. For reference, we also show the performance of an equivalent method that uses the Bayesian Gaussian equivalent score (‘BGe’) of Geiger and Heckerman (1994). Conclusions I We propose a novel Bayesian approach for estimating the probability of local causal structures from observational data. I Our method is simple, efficient, and inherently parallel, which makes it applicable to very large data sets. I We use the posterior probabilities to obtain a well-calibrated ranking of the most meaningful causal regulatory relationships. I The inferred causal relationships can then be used to (partially) reconstruct the underlying GRN structure.

Upload: others

Post on 08-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Large-Scale Causal Inference of Gene Regulatory Relationships2019.ds3-datascience-polytechnique.fr/wp...Ioan Gabriel Bucur, Tom Claassen, Tom Heskes Institute for Computing and Information

Large-Scale Causal Inference of Gene Regulatory RelationshipsIoan Gabriel Bucur, Tom Claassen, Tom Heskes

Institute for Computing and Information Sciences, Radboud University Nijmegen, Nijmegen, the Netherlands

Motivation

I Gene regulatory networks (GRNs) play a crucial role incontrolling an organism’s biological processes.

I If we knew the causal structure of a GRN, we could inter-vene in the developmental process of the organism.

I We wish to find reliable causal regulatory relationshipsamong a very large number of candidate gene pairs.

Idea

I We look for local causal patterns of the form:

Lk

Genetic marker k

Ti

Gene i expression

Tj

Gene j expression

I For triplets, there are five canonical patterns to consider:Independence

ModelMarkov Equivalence

Class (PAG)Covariance

MatrixPrecision

Matrix

‘X1 ⊥6⊥ X2 ⊥6⊥ X3’(Full)

X1 X2 X3

‘X1 ⊥⊥ X3’(Acausal)

X1 X2 X3

0

0

‘X1 ⊥⊥ X3|X2’(Causal)

X1 X2 X3

0

0

‘(X1,X3)⊥⊥ X2’(Independent)

X1 X2 X3

00 0

0

00 0

0

‘X1 ⊥⊥ X2 ⊥⊥ X3’(Empty)

X1 X2 X3

0 00 00 0

0 00 00 0

Bayes Factors of Covariance Structures

I We assume a (latent) linear-Gaussian model on triplets,which is completely characterized by the covariance matrix Σ.

I We put an inverse Wishart prior on the covariance matrix.I To facilitate computation, we derive the Bayes factors of each

independence model relative to a reference (X1 ⊥6⊥ X2 ⊥6⊥ X3):

B(X1 ⊥⊥ X2 ⊥⊥ X3) = f (n, ν)g(n, ν)|C|n+ν

2

B(X3 ⊥⊥ (X1,X2)) = f (n, ν)

(|C|

1− C 212

)n+ν2

B(X1 ⊥⊥ X2|X3) = g(n, ν)

(|C|

(1− C 213)(1− C 2

23)

)n+ν2

B(X1 ⊥⊥ X2) =f (n, ν)

g(n, ν)(1− C 2

12)n+ν−1

2 ,

where f (n, ν) = n+ν−2ν−2 , g(n, ν) ≈

(2n+2ν−3

2ν−3

)12, C is the sample

correlation matrix, n is the number of observations and ν is thenumber of degrees of freedom for the inverse Wishart prior.

I By defining simple priors and by using background knowledge,we get the posterior probabilities of local causal structures.

Ranking Regulatory Relationships with BFCS

Algorithm 1 Running BFCS on a data set from an experiment on yeast

1: Input: Yeast data set (3244 markers, 6216 gene expression measurements)2: for all expression traits Ti do3: for all expression traits Tj, j 6= i do4: for all genetic markers Lk do5: Compute the Bayes factors for the triplet (Lk,Ti ,Tj)6: Derive the posterior probability of the structure Lk → Ti → Tj

7: end for8: Save maxk p(Lk → Ti → Tj) as the probability of i regulating j9: end for

10: end for11: Output: Matrix of regulation probabilities

Rank Gene Chen et al. trigger BFCS1 MDM35 0.973 0.999 0.6782 CBP6 0.968 0.997 0.6833 QRI5 0.960 0.985 0.6784 RSM18 0.959 0.984 0.6725 RSM7 0.953 0.977 0.6846 MRPL11 0.924 0.999 0.670

a Genes regulated by NAM9, sorted by ‘Chen et al.’.

Rank Gene Chen et al. trigger BFCS1 FMP39 0.176 0.401 0.6912 DIA4 0.493 0.987 0.6913 MRP4 0.099 0.260 0.6914 MNP1 0.473 0.999 0.6915 MRPS18 0.527 0.974 0.6906 MTG2 0.000 0.000 0.690

b Genes regulated by NAM9, sorted by ‘BFCS’.

Table 1: The column ‘Chen et al.’ shows the original results of applying the Trigger algorithm to an experiment on

yeast, as reported in Chen et al. (2007). The ‘trigger’ column contains the probabilities we obtained when running

the algorithm from the Bioconductor trigger package (Chen et al., 2017) on the entire yeast data set with default

parameters. The column ‘BFCS’ contains the output of running Algorithm 1 on the yeast data set, for which we

took a uniform prior over directed MAGs (maximal ancestral graphs without undirected edges).

Performance on Simulated Data

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.001 − Specificity

Sen

sitiv

ity

ROC

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.001 − Specificity

Sen

sitiv

ity

ROC

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Recall

Pre

cisi

on

PRC

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Recall

Pre

cisi

on

PRC

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage

Obs

erve

d E

vent

Per

cent

age

Calibration

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage

Obs

erve

d E

vent

Per

cent

age

Calibration

trigger BFCS DAG BFCS DMAG BFCS loclink BGe

(a) We generated 100 (left column) and 1000 (rightcolumn) samples from a sparser GRN consisting of 51

regulatory relationships.

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.001 − Specificity

Sen

sitiv

ity

ROC

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.001 − Specificity

Sen

sitiv

ity

ROC

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Recall

Pre

cisi

on

PRC

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00Recall

Pre

cisi

on

PRC

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage

Obs

erve

d E

vent

Per

cent

age

Calibration

0%

25%

50%

75%

100%

0% 25% 50% 75% 100%Bin Midpoint Average Estimated Percentage

Obs

erve

d E

vent

Per

cent

age

Calibration

trigger BFCS DAG BFCS DMAG BFCS loclink BGe

(b) We generated 100 (left column) and 1000 (rightcolumn) samples from a denser GRN consisting of 247

regulatory relationships.

Figure 1: Evaluating the performance of BFCS in detecting (direct or indirect) causal regulatory relationships in

terms of ROC (top row), precision-recall (middle row), and calibration (bottom row). We ran Trigger three

times on the simulated data and averaged the results (‘trigger’) to account for differences when sampling the null

statistics. We report the results of three BFCS versions: two described in Algorithm 1 for which we take a uniform

prior over DAGs (‘BFCS DAG’) and DMAGs (‘BFCS DMAG’), respectively, and one (‘BFCS loclink’) in which we

use the Trigger local-linkage strategy for pre-selecting genetic markers. For reference, we also show the performance

of an equivalent method that uses the Bayesian Gaussian equivalent score (‘BGe’) of Geiger and Heckerman (1994).

Conclusions

I We propose a novel Bayesian approach for estimating theprobability of local causal structures from observational data.

I Our method is simple, efficient, and inherently parallel,which makes it applicable to very large data sets.

I We use the posterior probabilities to obtain a well-calibratedranking of the most meaningful causal regulatory relationships.

I The inferred causal relationships can then be used to (partially)reconstruct the underlying GRN structure.

[ ]Chen, L. S., F. Emmert-Streib, and J. D. Storey2007. Harnessing naturally randomized transcription to infer regulatory rela-tionships among genes. Genome Biology, 8:R219.

[ ]Chen, L. S., D. P. Sangurdekar, and J. D. Storey2017. trigger.

[ ]Geiger, D. and D. Heckerman1994. Learning Gaussian Networks. UAI’94, Pp. 235–243, San Francisco, CA,USA. Morgan Kaufmann Publishers Inc.