on identifying significant edges in graphical models

On Identifying Significant Edgesin Graphical Models

Marco Scutari1 and Radhakrishnan Nagarajan2

1Genetics InstituteUniversity College London

[email protected]

2Division of Biomedical InformaticsUniversity of Arkansas for Medical Sciences

[email protected]

July 2, 2011

Marco Scutari and Radhakrishnan Nagarajan UCL & UAMS

mailto:[email protected]

mailto:[email protected]

Graphical Models: Definitions& Learning


Graphical Models: Definitions & Learning

Graphical Models

Graphical models are defined by two components:

• a network structure, either an undirected graph (Markovnetworks [2, 19], gene association networks [14], correlationnetworks [17], etc.) or a directed graph (Bayesian networks[7, 8]). Each node corresponds to a random variable;

• a global probability distribution, which can be factorised intoa small set of local probability distributions according to thetopology of the graph.

This combination allows a compact representation of the jointdistribution of large numbers of random variables and simplifiesinference on the parameters of the model.



Structure and Parameter Learning

Likewise, learning a graphical model is a two-stage process:

1. structure learning: learning the structure of the networkunderlying the graphical model, i.e. estimating thedependencies present in the data and adding the associatededges to the model;

2. parameter learning: using the decomposition into localprobabilities given by the network structure learned in theprevious step to estimate the parameters of the localdistributions.

Several approaches have been proposed for both steps [1, 7],covering all aspects of graphical model estimation.



Network Structure Validation

Model validation techniques have not been developed at a similarpace, particularly in the case of network structures:

• the few available measures of structural difference arecompletely descriptive in nature (i.e. Hamming distance [6] orSHD [18]), and are difficult to interpret;

• unless the true global probability distribution is known it isdifficult to assess the quality of graphical models withoutad-hoc solutions; this limits the study of the properties ofnetwork structures to few reference data sets [3, 9].

A more systematic approach to model validation, and in particularto the problem of identifying statistically significant edges in anetwork, is required for graphical models learned from real data.


Identifying Significant Edges



Friedman’s Confidence

Friedman et al. [4] proposed an approach to model validationbased on bootstrap resampling and model averaging:

1. For b = 1, 2, . . . ,m:

1.1 sample a new data set X∗b from the original data X usingeither parametric or nonparametric bootstrap;

1.2 learn the structure of the graphical model Gb = (V, Eb) fromX∗b .

2. Estimate the confidence that each possible edge ei is presentin the true network structure G0 = (V, E0) as

pi = P(ei) =1

m

m∑b=1

1l{ei∈Eb},

where 1l{ei∈Eb} is equal to 1 if ei ∈ Eb and 0 otherwise.



Evaluating Confidence Values

• The confidence values p = {pi} do not sum to one and aredependent on one another in a nontrivial way; the value of theconfidence threshold (i.e. the minimum confidence for an edge to beaccepted as an edge of G0) is an unknown function of both the dataand the structure learning algorithm.

• The ideal/asymptotic configuration p of confidence values would be

pi =

{1 if ei ∈ E0

0 otherwise,

i.e. all the networks Gb have exactly the same structure.

• Therefore, identifying the configuration p “closest” to p provides astatistically-motivated way of identifying significant edges and theconfidence threshold.



The Confidence Threshold

Consider the order statistics p(·) and p(·) and the cumulativedistribution functions (CDFs) of their elements:

Fp(·)(x) =1

k

k∑i=1

1l{p(i)<x}

and

Fp(·)(x; t) =

0 if x ∈ (−∞, 0)

t if x ∈ [0, 1)

1 if x ∈ [1,+∞)

.

t corresponds to the fraction of elements of p(·) equal to zero andis a measure of the fraction of non-significant edges, and providesa threshold for separating the elements of p(·):

e(i) ∈ E0 ⇐⇒ p(i) > F−1p(·)(t).



The CDFs Fp(·)(x) and Fp(·)(x; t)

0.0 0.4 0.8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.4 0.8

0.0

0.2

0.4

0.6

0.8

1.0

0.0 0.4 0.8

0.0

0.2

0.4

0.6

0.8

1.0

One possible estimate of t is the value t that minimises somedistance between Fp(·)(x) and Fp(·)(x; t); an intuitive choice isusing the L1 norm of their difference (i.e. the shaded area in thepicture on the right).



An L1 Estimator for the Confidence Threshold

Since Fp(·) is piecewise constant and Fp(·) is constant in [0, 1], the L1

norm of their difference simplifies to

L1

(t; p(·)

)=

∫ ∣∣Fp(·)(x)− Fp(·)(x; t)∣∣ dx

=∑

xi∈{{0}∪p(·)∪{1}}

∣∣Fp(·)(xi)− t∣∣ (xi+1 − xi).

This form has two important properties:

• can be computed in linear time from p(·);

• its minimisation is straightforward using linear programming [11].

Furthermore, the L1 norm does not place as much weight on largedeviations as other norms (L2, L∞), making it robust against a widevariety of configurations of p(·).



A Simple Example0.

00.

20.

40.

60.

81.

0

0.0 0.2 0.4 0.6 0.8 1.0

0.2

0.3

0.4

0.5

0.0 0.2 0.4 0.6 0.8 1.0

●

●

●

●

●

●

●

●

Consider a graph with 4 nodes and confidence values

p(·) = {0.0460, 0.2242, 0.3921, 0.7689, 0.8935, 0.9439}

Then t = mint L1

(t; p(·)

)= 0.4999816 and F−1p(·)

(0.4999816) = 0.3921;

only three edges are considered significant.


Applications to Gene Networks



Analysis of Functional Relationships

We measured the effectiveness of the proposed method on twogene networks from Nagarajan et al. [10] and Sachs et al. [13]using the bnlearn package [16, 15] for R [12].

• Functional relationships have been investigated using Bayesiannetworks, as in the original papers;

• 500 bootstrapped network structures Gb have been learnedfrom each data set, with the same learning algorithms, scoresand parameters as in the original papers;

• Following Imoto et al. [5], we will consider the edges of theBayesian networks disregarding their direction. Edgesidentified as significant will be oriented according to thedirection observed with the highest frequency in thebootstrapped networks Gb.



Differentiation Potential of Aged Myogenic Progenitors

The clonal gene expression data in Nagarajan et al. [10] wasgenerated (for 12 genes) from RNA isolated from 34 clones ofmyogenic progenitors obtained from 24-months old mice. Theobjective was to study the interplay between crucial myogenic,adipogenic, and Wnt-related genes orchestrating aged myogenicprogenitor differentiation.

In the same study, the authors estimated the significance thresholdby randomly permuting the expression of each gene and learningBayesian network structures from the resulting data sets. Modelaveraging of these networks provided the noise floor distribution forthe edges; confidence values falling outside its range were deemedsignificant. This approach, however, is slower than just computingan L1 norm and may result in a large number of false positives onlarge data sets.



Differentiation Potential of Aged Myogenic Progenitors

PPARγ

Myogenin

Myo-D1

Myf-5

CEBPα

FoxC2

LRP5

Wnt5a

DDIT3

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

threshold=0.504

All edges identified as significant in the earlier study are alsoidentified by the proposed approach; directionality of the edges isalso revealed, unlike the original network in Nagarajan et al. [10].



Protein Signalling in Flow Cytometry Data

Sachs et al. [13] used Bayesian networks as a tool for identifyingcausal influences in cellular signalling networks from simultaneousmeasurement of 11 phosphorylated proteins and phospholipidsacross single cells.

Significant edges were selected using model averaging but with anad-hoc significance threshold of 0.85, first on 854 non-perturbedobservations and then on several sets of perturbed data. Thiscombination cannot be analysed with our approach, because eachsubset of the data follows a different probability distribution andtherefore there is no single “true” network G0; therefore we limitourselves to the unperturbed data.



Protein Signalling in Flow Cytometry Data

Erk

Mek

RAF

Akt

PKA

P38

PKC

J NK

PIP3

PIP2

PLCγ

threshold=0.93

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Again all edges identified as significant in the observational dataare also identified by the proposed approach; directionality of theedges is also revealed, unlike the original network, and agrees withwith the network learned with the help of perturbed data in Sachset al. [13].


Conclusions


Conclusions

Conclusions

• Model validation is often performed using an ad-hocthresholds for the identification of significant edges. Suchad-hoc approaches can have a pronounced effect on theresulting networks and biological conclusions.

• The minimisation of the L1 norm of the difference betweenthe CDF of the observed confidence levels and the CDF theirideal/asymptotic configuration provides straightforward andstatistically-motivated approach for identifying significantedges.

• The proposed approach is defined in a very general settingand can be applied to many classes of graphical modelslearned from any kind of data.

• The effectiveness of the proposed approach is demonstratedon two different gene networks different studies.


Thanks!


References


References

References I

E. Castillo, J. M. Gutierrez, and A. S. Hadi.Expert Systems and Probabilistic Network Models.Springer, 1997.

D. I. Edwards.Introduction to Graphical Modelling.Springer, 2nd edition, 2000.

G. Elidan.Bayesian Network Repository, 2001.

N. Friedman, M. Goldszmidt, and A. Wyner.Data Analysis with Bayesian Networks: A Bootstrap Approach.In Proceedings of the 15th Annual Conference on Uncertainty in ArtificialIntelligence (UAI-99), pages 206 – 215. Morgan Kaufmann, 1999.

S. Imoto, S. Y. Kim, H. Shimodaira, S. Aburatani, K. Tashiro, S. Kuhara, andS. Miyano.Bootstrap Analysis of Gene Networks Based on Bayesian Networks andNonparametric Regression.Genome Informatics, 13:369–370, 2002.


References

References II

D. Jungnickel.Graphs, Networks and Algorithms.Springer-Verlag, 3rd edition, 2008.

D. Koller and N. Friedman.Probabilistic Graphical Models: Principles and Techniques.MIT Press, 2009.

K. Korb and A. Nicholson.Bayesian Artificial Intelligence.Chapman & Hall, 2nd edition, 2010.

P. Murphy and D. Aha.UCI Machine Learning Repository, 1995.

R. Nagarajan, S. Datta, M. Scutari, M. L. Beggs, G. T. Nolen, and C. A.Peterson.Functional Relationships Between Genes Associated with DifferentiationPotential of Aged Myogenic Progenitors.Frontiers in Physiology, 1(21):1–8, 2010.


References

References III

J. Nocedal and S. J. Wright.Numerical Optimization.Springer-Verlag, 1999.

R Development Core Team.R: A Language and Environment for Statistical Computing.R Foundation for Statistical Computing, Vienna, Austria, 2010.

K. Sachs, O. Perez, D. Pe’er, D. A. Lauffenburger, and G. P. Nolan.Causal Protein-Signaling Networks Derived from Multiparameter Single-CellData.Science, 308(5721):523–529, 2005.

J. Schafer and K. Strimmer.An Empirical Bayes Approach to Inferring Large-Scale Gene AssociationNetworks.Bioinformatics, 21:754–764, 2004.

M. Scutari.Learning Bayesian Networks with the bnlearn R Package.Journal of Statistical Software, 35(3):1–22, 2010.


References

References IV

M. Scutari.bnlearn: Bayesian Network Structure Learning, 2011.R package version 2.4.

R. Steuer.On the Analysis and Interpretation of Correlations in Metabolomic Data.Briefings in Bioinformatics, 7(2):151–158, 2006.

I. Tsamardinos, L. E. Brown, and C. F. Aliferis.The Max-Min Hill-Climbing Bayesian Network Structure Learning Algorithm.Machine Learning, 65(1):31–78, 2006.

J. Whittaker.Graphical Models in Applied Multivariate Statistics.Wiley, 1990.


on identifying significant edges in graphical models

Documents