on identifying significant edges in graphical models
TRANSCRIPT
On Identifying Significant Edgesin Graphical Models
Marco Scutari1 and Radhakrishnan Nagarajan2
1Genetics InstituteUniversity College London
2Division of Biomedical InformaticsUniversity of Arkansas for Medical Sciences
July 2, 2011
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Graphical Models: Definitions& Learning
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Graphical Models: Definitions & Learning
Graphical Models
Graphical models are defined by two components:
• a network structure, either an undirected graph (Markovnetworks [2, 19], gene association networks [14], correlationnetworks [17], etc.) or a directed graph (Bayesian networks[7, 8]). Each node corresponds to a random variable;
• a global probability distribution, which can be factorised intoa small set of local probability distributions according to thetopology of the graph.
This combination allows a compact representation of the jointdistribution of large numbers of random variables and simplifiesinference on the parameters of the model.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Graphical Models: Definitions & Learning
Structure and Parameter Learning
Likewise, learning a graphical model is a two-stage process:
1. structure learning: learning the structure of the networkunderlying the graphical model, i.e. estimating thedependencies present in the data and adding the associatededges to the model;
2. parameter learning: using the decomposition into localprobabilities given by the network structure learned in theprevious step to estimate the parameters of the localdistributions.
Several approaches have been proposed for both steps [1, 7],covering all aspects of graphical model estimation.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Graphical Models: Definitions & Learning
Network Structure Validation
Model validation techniques have not been developed at a similarpace, particularly in the case of network structures:
• the few available measures of structural difference arecompletely descriptive in nature (i.e. Hamming distance [6] orSHD [18]), and are difficult to interpret;
• unless the true global probability distribution is known it isdifficult to assess the quality of graphical models withoutad-hoc solutions; this limits the study of the properties ofnetwork structures to few reference data sets [3, 9].
A more systematic approach to model validation, and in particularto the problem of identifying statistically significant edges in anetwork, is required for graphical models learned from real data.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Identifying Significant Edges
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Identifying Significant Edges
Friedman’s Confidence
Friedman et al. [4] proposed an approach to model validationbased on bootstrap resampling and model averaging:
1. For b = 1, 2, . . . ,m:
1.1 sample a new data set X∗b from the original data X usingeither parametric or nonparametric bootstrap;
1.2 learn the structure of the graphical model Gb = (V, Eb) fromX∗b .
2. Estimate the confidence that each possible edge ei is presentin the true network structure G0 = (V, E0) as
pi = P(ei) =1
m
m∑b=1
1l{ei∈Eb},
where 1l{ei∈Eb} is equal to 1 if ei ∈ Eb and 0 otherwise.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Identifying Significant Edges
Evaluating Confidence Values
• The confidence values p = {pi} do not sum to one and aredependent on one another in a nontrivial way; the value of theconfidence threshold (i.e. the minimum confidence for an edge to beaccepted as an edge of G0) is an unknown function of both the dataand the structure learning algorithm.
• The ideal/asymptotic configuration p of confidence values would be
pi =
{1 if ei ∈ E0
0 otherwise,
i.e. all the networks Gb have exactly the same structure.
• Therefore, identifying the configuration p “closest” to p provides astatistically-motivated way of identifying significant edges and theconfidence threshold.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Identifying Significant Edges
The Confidence Threshold
Consider the order statistics p(·) and p(·) and the cumulativedistribution functions (CDFs) of their elements:
Fp(·)(x) =1
k
k∑i=1
1l{p(i)<x}
and
Fp(·)(x; t) =
0 if x ∈ (−∞, 0)
t if x ∈ [0, 1)
1 if x ∈ [1,+∞)
.
t corresponds to the fraction of elements of p(·) equal to zero andis a measure of the fraction of non-significant edges, and providesa threshold for separating the elements of p(·):
e(i) ∈ E0 ⇐⇒ p(i) > F−1p(·)(t).
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Identifying Significant Edges
The CDFs Fp(·)(x) and Fp(·)(x; t)
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.4 0.8
0.0
0.2
0.4
0.6
0.8
1.0
One possible estimate of t is the value t that minimises somedistance between Fp(·)(x) and Fp(·)(x; t); an intuitive choice isusing the L1 norm of their difference (i.e. the shaded area in thepicture on the right).
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Identifying Significant Edges
An L1 Estimator for the Confidence Threshold
Since Fp(·) is piecewise constant and Fp(·) is constant in [0, 1], the L1
norm of their difference simplifies to
L1
(t; p(·)
)=
∫ ∣∣Fp(·)(x)− Fp(·)(x; t)∣∣ dx
=∑
xi∈{{0}∪p(·)∪{1}}
∣∣Fp(·)(xi)− t∣∣ (xi+1 − xi).
This form has two important properties:
• can be computed in linear time from p(·);
• its minimisation is straightforward using linear programming [11].
Furthermore, the L1 norm does not place as much weight on largedeviations as other norms (L2, L∞), making it robust against a widevariety of configurations of p(·).
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Identifying Significant Edges
A Simple Example0.
00.
20.
40.
60.
81.
0
0.0 0.2 0.4 0.6 0.8 1.0
0.2
0.3
0.4
0.5
0.0 0.2 0.4 0.6 0.8 1.0
●
●
●
●
●
●
●
●
Consider a graph with 4 nodes and confidence values
p(·) = {0.0460, 0.2242, 0.3921, 0.7689, 0.8935, 0.9439}
Then t = mint L1
(t; p(·)
)= 0.4999816 and F−1p(·)
(0.4999816) = 0.3921;
only three edges are considered significant.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Applications to Gene Networks
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Applications to Gene Networks
Analysis of Functional Relationships
We measured the effectiveness of the proposed method on twogene networks from Nagarajan et al. [10] and Sachs et al. [13]using the bnlearn package [16, 15] for R [12].
• Functional relationships have been investigated using Bayesiannetworks, as in the original papers;
• 500 bootstrapped network structures Gb have been learnedfrom each data set, with the same learning algorithms, scoresand parameters as in the original papers;
• Following Imoto et al. [5], we will consider the edges of theBayesian networks disregarding their direction. Edgesidentified as significant will be oriented according to thedirection observed with the highest frequency in thebootstrapped networks Gb.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Applications to Gene Networks
Differentiation Potential of Aged Myogenic Progenitors
The clonal gene expression data in Nagarajan et al. [10] wasgenerated (for 12 genes) from RNA isolated from 34 clones ofmyogenic progenitors obtained from 24-months old mice. Theobjective was to study the interplay between crucial myogenic,adipogenic, and Wnt-related genes orchestrating aged myogenicprogenitor differentiation.
In the same study, the authors estimated the significance thresholdby randomly permuting the expression of each gene and learningBayesian network structures from the resulting data sets. Modelaveraging of these networks provided the noise floor distribution forthe edges; confidence values falling outside its range were deemedsignificant. This approach, however, is slower than just computingan L1 norm and may result in a large number of false positives onlarge data sets.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Applications to Gene Networks
Differentiation Potential of Aged Myogenic Progenitors
PPARγ
Myogenin
Myo-D1
Myf-5
CEBPα
FoxC2
LRP5
Wnt5a
DDIT3
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
threshold=0.504
All edges identified as significant in the earlier study are alsoidentified by the proposed approach; directionality of the edges isalso revealed, unlike the original network in Nagarajan et al. [10].
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Applications to Gene Networks
Protein Signalling in Flow Cytometry Data
Sachs et al. [13] used Bayesian networks as a tool for identifyingcausal influences in cellular signalling networks from simultaneousmeasurement of 11 phosphorylated proteins and phospholipidsacross single cells.
Significant edges were selected using model averaging but with anad-hoc significance threshold of 0.85, first on 854 non-perturbedobservations and then on several sets of perturbed data. Thiscombination cannot be analysed with our approach, because eachsubset of the data follows a different probability distribution andtherefore there is no single “true” network G0; therefore we limitourselves to the unperturbed data.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Applications to Gene Networks
Protein Signalling in Flow Cytometry Data
Erk
Mek
RAF
Akt
PKA
P38
PKC
J NK
PIP3
PIP2
PLCγ
threshold=0.93
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Again all edges identified as significant in the observational dataare also identified by the proposed approach; directionality of theedges is also revealed, unlike the original network, and agrees withwith the network learned with the help of perturbed data in Sachset al. [13].
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Conclusions
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Conclusions
Conclusions
• Model validation is often performed using an ad-hocthresholds for the identification of significant edges. Suchad-hoc approaches can have a pronounced effect on theresulting networks and biological conclusions.
• The minimisation of the L1 norm of the difference betweenthe CDF of the observed confidence levels and the CDF theirideal/asymptotic configuration provides straightforward andstatistically-motivated approach for identifying significantedges.
• The proposed approach is defined in a very general settingand can be applied to many classes of graphical modelslearned from any kind of data.
• The effectiveness of the proposed approach is demonstratedon two different gene networks different studies.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
Thanks!
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
References
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
References
References I
E. Castillo, J. M. Gutierrez, and A. S. Hadi.Expert Systems and Probabilistic Network Models.Springer, 1997.
D. I. Edwards.Introduction to Graphical Modelling.Springer, 2nd edition, 2000.
G. Elidan.Bayesian Network Repository, 2001.
N. Friedman, M. Goldszmidt, and A. Wyner.Data Analysis with Bayesian Networks: A Bootstrap Approach.In Proceedings of the 15th Annual Conference on Uncertainty in ArtificialIntelligence (UAI-99), pages 206 – 215. Morgan Kaufmann, 1999.
S. Imoto, S. Y. Kim, H. Shimodaira, S. Aburatani, K. Tashiro, S. Kuhara, andS. Miyano.Bootstrap Analysis of Gene Networks Based on Bayesian Networks andNonparametric Regression.Genome Informatics, 13:369–370, 2002.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
References
References II
D. Jungnickel.Graphs, Networks and Algorithms.Springer-Verlag, 3rd edition, 2008.
D. Koller and N. Friedman.Probabilistic Graphical Models: Principles and Techniques.MIT Press, 2009.
K. Korb and A. Nicholson.Bayesian Artificial Intelligence.Chapman & Hall, 2nd edition, 2010.
P. Murphy and D. Aha.UCI Machine Learning Repository, 1995.
R. Nagarajan, S. Datta, M. Scutari, M. L. Beggs, G. T. Nolen, and C. A.Peterson.Functional Relationships Between Genes Associated with DifferentiationPotential of Aged Myogenic Progenitors.Frontiers in Physiology, 1(21):1–8, 2010.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
References
References III
J. Nocedal and S. J. Wright.Numerical Optimization.Springer-Verlag, 1999.
R Development Core Team.R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2010.
K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan.Causal Protein-Signaling Networks Derived from Multiparameter Single-CellData.Science, 308(5721):523–529, 2005.
J. Schafer and K. Strimmer.An Empirical Bayes Approach to Inferring Large-Scale Gene AssociationNetworks.Bioinformatics, 21:754–764, 2004.
M. Scutari.Learning Bayesian Networks with the bnlearn R Package.Journal of Statistical Software, 35(3):1–22, 2010.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS
References
References IV
M. Scutari.bnlearn: Bayesian Network Structure Learning, 2011.R package version 2.4.
R. Steuer.On the Analysis and Interpretation of Correlations in Metabolomic Data.Briefings in Bioinformatics, 7(2):151–158, 2006.
I. Tsamardinos, L. E. Brown, and C. F. Aliferis.The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm.Machine Learning, 65(1):31–78, 2006.
J. Whittaker.Graphical Models in Applied Multivariate Statistics.Wiley, 1990.
Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS