reverse engineering gene regulatory networks
DESCRIPTION
Reverse engineering gene regulatory networks. Dirk Husmeier Adriano Werhli Marco Grzegorczyk. Systems biology Learning signalling pathways and regulatory networks from postgenomic data. unknown. unknown. high-throughput experiments. postgenomic data. unknown. data. data. - PowerPoint PPT PresentationTRANSCRIPT
Reverse engineering gene regulatory networks
Dirk Husmeier
Adriano Werhli
Marco Grzegorczyk
Systems biology
Learning signalling pathways and regulatory networks from
postgenomic data
unknown
unknown
high-throughput experiment
s
postgenomic data
unknown
data data
machine learning
statistical methods
true network extracted network
Does the extracted network provide a good prediction of the true interactions?
Reverse Engineering of Regulatory Networks
• Can we learn the network structure from postgenomic data themselves?
• Statistical methods to distinguish between– Direct interactions– Indirect interactions
• Challenge: Distinguish between– Correlations– Causal interactions
• Breaking symmetries with active interventions:– Gene knockouts (VIGs, RNAi)
direct
interaction
common
regulator
indirect
interaction
co-regulation
• Relevance networks
• Graphical Gaussian models
• Bayesian networks
• Relevance networks
• Graphical Gaussian models
• Bayesian networks
Relevance networks(Butte and Kohane, 2000)
1. Choose a measure of association A(.,.)
2. Define a threshold value tA
3. For all pairs of domain variables (X,Y) compute their association A(X,Y)
4. Connect those variables (X,Y) by an undirected edge whose association A(X,Y) exceeds the predefined threshold value tA
Association scores
1 2
X
21
X
21
‘direct interaction’
‘common regulator’
‘indirect interaction’X
21
1 2
strong
correlation σ12
Pairwise associations without taking the context of the system
into consideration
• Relevance networks
• Graphical Gaussian models
• Bayesian networks
Graphical Gaussian Models
jjii
ijij
)()(
)(111
1
2
2
1
1
direct interaction
Partial correlation, i.e. correlation
conditional on all other domain variables
Corr(X1,X2|X3,…,Xn)
strong partial
correlation π12
direct
interaction
common
regulator
indirect
interaction
co-regulation
Distinguish between direct and indirect interactions
A and B have a low partial correlation
Graphical Gaussian Models
jjii
ijij
)()(
)(111
1
2
2
1
1
direct interaction
Partial correlation, i.e. correlation
conditional on all other domain variables
Corr(X1,X2|X3,…,Xn)
Problem: #observations < #variables
strong partial
correlation π12
Shrinkage estimation and the lemma of Ledoit-Wolf
Shrinkage estimation and the lemma of Ledoit-Wolf
Graphical Gaussian Models
direct
interaction
common
regulator
indirect
interaction
P(A,B)=P(A)·P(B)
But: P(A,B|C)≠P(A|C)·P(B|C)
Undirected versus directed edges
• Relevance networks and Graphical Gaussian models can only extract undirected edges.
• Bayesian networks can extract directed edges.
• But can we trust in these edge directions? It may be better to learn undirected edges than learning directed edges with false orientations.
• Relevance networks
• Graphical Gaussian models
• Bayesian networks
Bayesian networks
A
CB
D
E F
NODES
EDGES
•Marriage between graph theory and probability theory.
•Directed acyclic graph (DAG) representing conditional independence relations.
•It is possible to score a network in light of the data: P(D|M), D:data, M: network structure.
•We can infer how well a particular network explains the observed data.
),|()|(),|()|()|()(
),,,,,(
DCFPDEPCBDPACPABPAP
FEDCBAP
Bayesian networks versus causal networks
Bayesian networks represent conditional (in)dependence relations - not necessarily causal interactions.
Bayesian networks versus causal networks
A
CB
A
CB
True causal graph
Node A unknown
Bayesian networks versus causal networks
A
CB
• Equivalence classes: networks with the same scores: P(D|M).
• Equivalent networks cannot be distinguished in light of the data.
A
CB
A
CB
A
CB
Equivalence classes of BNs
)|()()|(
)()|()()()|( 1
BCPBPCAP
CPCAPCPBPBCP
11 )(),()(),()(
)|()|()(
APACPCPCBPAP
ACPCBPAP
),|()()( BACPBPAP
A
B
C
A
B
A
B
A
B
C
C
C
)()|()|(
),()|(
CPCBPCAP
CBPCAP
A
B
C
completed partially directed graphs (CPDAGs)
A
C
B
v-structure
P(A,B)=P(A)·P(B)
P(A,B|C)≠P(A|C)·P(B|C)
P(A,B)≠P(A)·P(B)
P(A,B|C)=P(A|C)·P(B|C)
Symmetry breaking
A
CB
•Interventions
•Prior knowledge
A
CB
A
CB
A
CB
Symmetry breaking
A
CB
•Interventions
•Prior knowledge
A
CB
A
CB
A
CB
Interventional data
A B
A B A B
inhibition of A
A B
n
iXpaiii i
DXpaDXPMDP1
][ )][|()|(
n
i
iXpai
iii i
DXpaDXP1
}{][
}{ )][|(
down-regulation of B
no effect on B
A and B are correlated
Learning Bayesian networks from data
P(M|D) = P(D|M) P(M) / Z
M: Network structure. D: Data
Learning Bayesian networks from data
P(M|D) = P(D|M) P(M) / Z
M: Network structure. D: Data
Evaluation
• On real experimental data, using the gold standard network from the literature
• On synthetic data simulated from the gold-standard network
Evaluation
• On real experimental data, using the gold standard network from the literature
• On synthetic data simulated from the gold-standard network
From Sachs et al., Science 2005
Evaluation: Raf signalling pathway
• Cellular signalling network of 11 phosphorylated proteins and phospholipids in human immune systems cell
• Deregulation carcinogenesis
• Extensively studied in the literature gold standard network
Raf regulatory network
From Sachs et al Science 2005
Flow cytometry data
• Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins
• 5400 cells have been measured under 9 different cellular conditions (cues)
• Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments
Two types of experiments
Evaluation
• On real experimental data, using the gold standard network from the literature
• On synthetic data simulated from the gold-standard network
Comparison with simulated data 1
Raf pathway
Comparison with simulated data 2
Comparison with simulated data 2
Steady-state approximation
Real versus simulated data
• Real biological data: full complexity of biological systems.
• The “gold-standard” only represents our current state of knowledge; it is not guaranteed to represent the true network.
• Simulated data: Simplifications that might be biologically unrealistic.
• We know the true network.
How can we evaluate the reconstruction accuracy?
true network extracted network
biological knowledge
(gold standard network)
Evaluation of
learning
performance
Performance evaluation:ROC curves
•We use the Area Under the Receiver Operating
Characteristic Curve (AUC).
0.5<AUC<1
AUC=1AUC=0.5
Performance evaluation:ROC curves
Alternative performance evaluation: True positive (TP) scores
We set the threshold such that we obtain 5 spurious edges (5 FPs) and count the corresponding number of true edges (TP count).
5 FP counts
BN
GGM
RN
Alternative performance evaluation: True positive (TP) scores
data
Directed graph evaluation - DGE
true regulatory network
Thresholding
edge scores
TP:1/2
FP:0/4
TP:2/2
FP:1/4
concrete networkpredictions
lowhigh
data
Undirected graph evaluation - UGE
skeleton of the
true regulatory network
Thresholding
undirected edge scores
TP:1/2
FP:0/1
TP:2/2
FP:1/1
high low
concrete network(skeleton) predictions
Synthetic data, observations
Synthetic data, interventions
Cytometry data, interventions
How can we explain the difference between synthetic
and real data ?
Simulated data are “simpler”.
No mismatch between models used for data generation and inference.
Complications with real data
Can we trust our gold-standard network?
Raf regulatory network
From Sachs et al Science 2005
Regulation of Raf-1 by Direct Feedback Phosphorylation. Molecular Cell, Vol. 17, 2005 Dougherty et al
Disputed structure of the gold-standard network
Stabilisationthrough negative feedback loops inhibition
Complications with real data
Interventions might not be “ideal” owing to negative feedback loops.
Conclusions 1
• BNs and GGMs outperform RNs, most notably on Gaussian data.
• No significant difference between BNs and GGMs on observational data.
• For interventional data, BNs clearly outperform GGMs and RNs, especially when taking the edge direction (DGE score) rather than just the skeleton (UGE score) into account.
Conclusions 2
Performance on synthetic data better than on real data.
• Real data: more complex• Real interventions are not ideal• Errors in the gold-standard
network
How do we model feedback loops?
Unfolding in time