graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the...

15
doi: 10.1098/rsif.2012.0220 published online 2 May 2012 J. R. Soc. Interface Thomas Thorne and Michael P. H. Stumpf evolution Graph spectral analysis of protein interaction network References ref-list-1 http://rsif.royalsocietypublishing.org/content/early/2012/05/01/rsif.2012.0220.full.html# This article cites 31 articles, 12 of which can be accessed free P<P Published online 2 May 2012 in advance of the print journal. This article is free to access Subject collections (96 articles) systems biology (160 articles) computational biology (26 articles) bioinformatics Articles on similar topics can be found in the following collections Email alerting service here right-hand corner of the article or click Receive free email alerts when new articles cite this article - sign up in the box at the top publication. Citations to Advance online articles must include the digital object identifier (DOIs) and date of initial online articles are citable and establish publication priority; they are indexed by PubMed from initial publication. the paper journal (edited, typeset versions may be posted when available prior to final publication). Advance Advance online articles have been peer reviewed and accepted for publication but have not yet appeared in http://rsif.royalsocietypublishing.org/subscriptions go to: J. R. Soc. Interface To subscribe to on October 10, 2012 rsif.royalsocietypublishing.org Downloaded from

Upload: others

Post on 23-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

doi: 10.1098/rsif.2012.0220 published online 2 May 2012J. R. Soc. Interface

 Thomas Thorne and Michael P. H. Stumpf evolutionGraph spectral analysis of protein interaction network  

Referencesref-list-1http://rsif.royalsocietypublishing.org/content/early/2012/05/01/rsif.2012.0220.full.html#

This article cites 31 articles, 12 of which can be accessed free

P<P Published online 2 May 2012 in advance of the print journal.

This article is free to access

Subject collections

(96 articles)systems biology   � (160 articles)computational biology   �

(26 articles)bioinformatics   � Articles on similar topics can be found in the following collections

Email alerting service hereright-hand corner of the article or click Receive free email alerts when new articles cite this article - sign up in the box at the top

publication. Citations to Advance online articles must include the digital object identifier (DOIs) and date of initial online articles are citable and establish publication priority; they are indexed by PubMed from initial publication.the paper journal (edited, typeset versions may be posted when available prior to final publication). Advance Advance online articles have been peer reviewed and accepted for publication but have not yet appeared in

http://rsif.royalsocietypublishing.org/subscriptions go to: J. R. Soc. InterfaceTo subscribe to

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

Page 2: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

J. R. Soc. Interface

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

*Author for c

doi:10.1098/rsif.2012.0220Published online

Received 19 MAccepted 10 A

Graph spectral analysis of proteininteraction network evolution

Thomas Thorne* and Michael P. H. Stumpf

Centre of Integrative Systems Biology and Bioinformatics, Division of Molecular Biosciences,Imperial College London, London SW7 2AZ, UK

We present an analysis of protein interaction network data via the comparison of models ofnetwork evolution to the observed data. We take a Bayesian approach and perform posteriordensity estimation using an approximate Bayesian computation with sequential Monte Carlomethod. Our approach allows us to perform model selection over a selection of potential net-work growth models. The methodology we apply uses a distance defined in terms of graphspectra which captures the network data more naturally than previously used summary stat-istics such as the degree distribution. Furthermore, we include the effects of sampling into theanalysis, to properly correct for the incompleteness of existing datasets, and have analysedthe performance of our method under various degrees of sampling. We consider a numberof models focusing not only on the biologically relevant class of duplication models, butalso including models of scale-free network growth that have previously been claimed todescribe such data. We find a preference for a duplication-divergence with linear preferentialattachment model in the majority of the interaction datasets considered. We also illustratehow our method can be used to perform multi-model inference of network parameters to esti-mate properties of the full network from sampled data.

Keywords: protein interaction networks; graph spectra; approximate Bayesiancomputation; network evolution; sequential Monte Carlo

1. INTRODUCTION

Protein–protein interactions are one of the mechanismsby which biological organisms build complicated andflexible molecular machineries from relatively modestnumbers of protein-coding genes. Similar to the wayin which we can derive information about the evolutionof genes and genomes through currently available high-throughput genome sequencing data, the availability ofhigh-throughput protein interaction data from Yeast-2-Hybrid experiments and various other protocols givesus a snapshot of the evolutionary process by whichthe rich and complex structure of protein interactionsin the cell is formed.

The nature of current protein interaction network(PIN) data presents challenges in analysing the dataand performing inference that takes into account theglobal network structure. When considering evolution-ary models we are faced with the problem ofcomparing the network structure produced by themodel to that of the observed interaction network. Apossible way of overcoming this problem is to calculatesummary statistics describing some aspect of the dataand compare these with predictions from evolutionarymodels. Several previous studies in the literature haveapplied summary statistics to compare the fit of net-work models to observed data [1–4], shedding somelight on aspects of network evolution and organization.

orrespondence ([email protected]).

arch 2012pril 2012 1

Early studies suggested that the scale-free (SF) networkmodels [5] might fit the observed PIN data well [3,6],but there have since been several and statisticallyrobust challenges to this claim [7–9].

Considering more realistic biologically groundedmodels of network evolution has provided insightsinto potential mechanisms of PIN formation, and pro-vides more readily interpretable and applicable resultsthan those found by considering more general randomgraph models; in particular, it has become apparentthat it is important to consider models of networkgrowth (instead of static random graph models) eventhough they are vastly oversimplified compared withthe real process of network evolution. A number ofmodels have been proposed and analysed with respectto the observed data [1,2,10–12], all with the same gen-eral mechanism of node duplication, corresponding togene duplication and subsequent divergence in functionand of interactions.

Assessing the fit of various network growth models tothe Drosophila melanogaster protein interaction network,Middendorf et al. [13] found that a duplication model bestdescribes the data. A similar result was found in Ratmannet al. [4], where combining several different network stat-istics to compare the fit of models with the Treponemapallidum PIN, a model combining duplication divergencescheme with linear preferential attachment (LPA) wasfound to best explain the data. Plausible models shouldtherefore include aspects of duplication followed by theability of interactions to diverge and change with time.

This journal is q 2012 The Royal Society

Page 3: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

2 Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

Comparing models of network evolution—even ifthey are (by design) vastly oversimplified comparedwith the true process—holds the promise of allowingus to weigh up the relative contributions of differentprocesses. For example, we may assess the relative rolethat duplication of individual proteins might haveplayed in the evolution of natural systems. Ultimately,we would like to understand different processes andtheir roles in network evolution in a way that mirrorswhat is possible for sequence-based comparative ana-lyses. Here, too, models are oversimplified (even if lessseverely) but have allowed us to disentangle differentaspects affecting sequence evolution (codon usage,secondary structure constraints, etc.). More immedi-ately, however, such evolutionary models also allow usto apply the comparative method to networks moremeaningfully than mere lists of network characteristicswould be. Comparative biology predates the availabilityof sequence information, of course, and here we willdiscuss models of network evolution in a manner akinto that used in classical morphologically basedcomparative studies [14].

Evolutionary analysis at the level of network organiza-tion is fraught with considerable technical challenges: thedata are often noisy and incomplete; networks are notor-iously hard to describe in terms of summary statistics;and calibrating evolutionary models against the availabledata (or summary statistics) is also non-trivial. Here, wedevelop a flexible and robust inferential frameworkto deal with these three issues. Our approach is aimedat estimating the ‘effective’ parameters of models of net-work evolution against network data, and choosingbetween different plausible models of network evolutionwhenever possible. We employ a Bayesian frameworkthat allows us to deal with different candidate modelsand the uncertainties and problems inherent to thePIN data; and we use concepts from spectral graphtheory to describe the networks, rather than relying onsummary statistics.

Because the likelihood of general network growthmodels is computationally difficult to evaluate, weadopt an approximate Bayesian computation (ABC)approach; in ABC procedures the data (or summarystatistics thereof) of model simulations (with par-ameters, u, drawn from the prior) are compared withthe real data and if a suitable distance measure betweenthe data/summary statistics falls below a tolerancelevel, e, then u is accepted as a draw from the (ABC)posterior distribution. If e ! 0, then the ABC posteriorwill be in agreement with the exact posterior, as long asthe whole data are used. Use of summary statistics canbe problematic for parameter inference and model selec-tion if statistics are not sufficient. This is unlikely everto be the case for networks and therefore the spectralperspective taken here, which captures the wholedata, is particularly pertinent.

Below we outline the ABC framework employed hereand its use in parameter estimation, model selection andmodel averaging contexts. After discussing the spectralgraph measures, we outline different evolutionarymodels, and describe how we can analyse incompletenetwork datasets. We then illustrate our approachagainst simulated data before considering real

J. R. Soc. Interface

protein–protein interaction data. We conclude with adiscussion of the results and will make the case for thestatistically informed analysis of such simple modelsin the context of evolutionary systems biology.

2. METHODS

2.1. Approximate Bayesian computation andsequential Monte Carlo methods

Models of network evolution differ in their complexityand in the details of the evolutionary process thatthey capture. Statistical model selection techniquesare therefore required in order to compare their relativeability to capture the observed network data andexplain the underlying evolutionary mechanisms. Inparticular, such approaches allow us to strike a compro-mise between the complexity of a model, and its abilityto describe observed data. Here, we adopt a Bayesianframework, which treats the problems of parameter esti-mation and model selection analogously and does notrequire the post hoc use of, for example, an informationcriterion (in fact, the popular Bayesian information cri-terion, BIC, is an approximation to the conventionalBayesian model selection framework).

Given an observed protein interaction dataset, D,and a set of models mi, i ¼ 1, 2, . . ., M, the Bayesianapproach requires us to calculate the posterior prob-ability distributions of the different models and theirrespective parameter sets ui. We hence seek to evaluatePmiðuijDÞ, given by

PmiðuijDÞ ¼PmiðDjuiÞPmiðuiÞ

PmiðDÞ; ð2:1Þ

where PmiðDÞ is the evidence for the data undermodel mi

PmiðDÞ ¼ðVi

PðDjuiÞPðuiÞdui:

The complexity of the data and the models, however,makes evaluation of the likelihood terms, PðDjuÞ, diffi-cult and often impractical. To this end, ABC schemeshave recently gained in popularity, especially in thefields of population, evolutionary and systems biology.In ABC frameworks, we forego evaluation of the likeli-hood in favour of comparing simulation outputs,D 0 � miðu 0iÞ (for parameters sampled from the prior,u 0i � PðuiÞ), with the actual data, via a suitable distancemeasure, d(D 0, D). This allows us to approximate theposterior distribution as

PðujDÞ � PðujdðD0;DÞ � eÞ: ð2:2Þ

Here, it is important to note that the distance measured(D 0, D) can also be applied to summary statistics ofthe data, t(D), rather than the actual data. This isespecially attractive if the data are sufficiently complexsuch that the probability of observing the data is mark-edly reduced compared with observing the realizedvalue of the summary statistic. But if the statistic is notsufficient (in the sense that PðujDÞ ¼ PðujtðDÞÞ), thenparameter estimation and model selection becomeskewed compared with the full Bayesian approach.

Page 4: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

Algorithm 1. Basic ABC-SMC algorithm.N Number of particles;T Number of steps;for t 1 to T do

i 1;while i , ¼N do

if t ¼ 1 thenSample u00 � P(u);

elseSample u0 from ut21

i according to wt21i ;

Perturb u0 by K(u0) to u 00;endSimulate D0 from u 00 R times;s(u 00Þ 1

R

PRr I ðdðD0r ;DÞ , htÞ;

if s . 0 thenui

t u 00

if t ¼ 1 thenwi

1 sðu 00Þelse

wit

Pðu00Þsðu00ÞPNj¼1 wj

t�1Kðujt�1; u

00Þ;

endi i þ 1;

endendNormalize w;

end

Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf 3

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

Although it is possible to perform ABC using asimple rejection scheme, such a method will of coursenot be able to cope with models that have manyparameters; however, several improved computationalschemes exist and here we have chosen to apply thesequential Monte Carlo (SMC) method of Toni &Stumpf [15], which allows us to combine model selectionand posterior density estimation in a single framework.SMC methods [16,17] operate on a population ofweighted particles that correspond to points in theparameter space, with the particle weights set so thatthe empirical distribution of the weighted particles con-verges asymptotically to the desired target distributionas the number of particles N!1. The basic ABC-SMC approach taken from Toni et al. [18] is outlinedin algorithm 1. In brief, we proceed by constructing aset of intermediate distributions that start from theprior, P(ui), and converge towards the (ABC) posterior,equation (2.2). Each intermediate distribution Pt(u) ischaracterized by a population of particles which fulfilthe criterion

PtðujDÞ ¼ PðuÞ 1R

XRr

1ðdðD0r ;DÞ � etÞ; ð2:3Þ

where R is the number of repeated simulations for fixedparameters and e1 . e2 . . . . . eT ensures that succes-sive populations increasingly resemble the posterior (forwhich 1T has to be sufficiently small).

This sequence is generated in practice through asequential importance sampling procedure, whichweights the different particles appropriately. To startwith all particles are sampled independently from theprior, P(u), and accepted or rejected according towhether simulated datasets agree with the observeddata within the tolerance e1, giving an initial set of Nparticles u1

i for i [ f1; . . . ;Ng.In order to construct the next population of particles

(for tolerance et), we have to propose new particlesfrom the ut21

i making up population t 2 1. To doso, we resample particles from the population at stept 2 1 based on the particle weights, and then perturbthese particles using a kernel in order to explore par-ameter space and reduce the degeneracy of thesample. Since our model parameters all take continuousvalues, we can construct our kernel by simply displacingparticles by a distance drawn from a multivariateGaussian distribution with zero mean and an appropri-ately selected variance to perturb the population ofparticles between successive iterations, so that

u00 � N ðu 0;SÞ ð2:4Þ

for some diagonal bandwidth matrix S, where u 0 is aparticle drawn from the present population, and u 00 isa new proposal. Other transition kernels are alsopossible, however.

Having selected and perturbed a particle to give usthe proposed new parameters u 00 for the particle, themodel is simulated with the new parameters to generatea test dataset D0, and the distance between this simu-lated data and the observed data D is calculated,using the distance measure d(D0, D) that we describein §2.4. Then the proposed particle is accepted as

J. R. Soc. Interface

being representative of the desired distribution only ifit falls within a distance et of the observed data andwe can use equation (2.3) as an approximation of thelikelihood,

sðuÞ ¼ 1R

XRr

1ðdðD0r ;DÞ � etÞ: ð2:5Þ

If s(u) ¼ 0, the particle is rejected and instead anew particle from the ut21

i is sampled and a newperturbation proposed.

To calculate the weight of the perturbed particle themethod described in Del Moral et al. [17] is applied,whereby wt

i for the new parameters u 00 is calculatedusing our approximation of the likelihood s(u) fromequation (2.5) as

wit ¼

sðu00Þ if t ¼ 0;Pðu00Þsðu00ÞPN

j¼1 wjt�1Kðu

jt�1; u

00Þotherwise:

8><>: ð2:6Þ

This is repeated until the desired number of particleshave been sampled to give a new population ut

i and theprocess is repeated using a progressively strictersequence of distances e1 . e2 . � � � . eT at each step.This procedure is outlined in algorithm 1.

2.2. ABC-SMC model selection

As mentioned previously, in order to include the differ-ent models under consideration into the inferenceprocedure, we may simply treat models and parametersanalogously and we can encode the choice of the modelas a discrete parameter, following the methodology ofToni & Stumpf [15].

Page 5: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

Algorithm 2. ABC-SMC model selection algorithm.N Number of particles;T Number of steps;for t 1 to T do

i 1;while i , ¼N do

if t ¼ 1 thenSample m00 � P0(m);Sample u00 � Pmi

(u);else

Sample m0 �Pt21 (m);Sample u0 from um0,t21

i according to wm0,t21i ;

Perturb m0 by KM(m) to m00;Perturb u0 by Ku(u) to u00;

endSimulate D0 from m00, u00 R times;

sðm00; u00Þ 1R

PRr

IðdðD0r ;DÞ , htÞ;if s . 0 then

mti m00;

uti u00;

if t ¼ 1 thenw0

i s(m00, u00);else

wit

Pðu00Þsðm00; u00ÞW ðm00; u00Þ ;

endi i þ 1;

endendNormalize w;

end

4 Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

In such a context, model selection is then performedby considering the posterior density of each modelmarginalized over the parameters,

PðmiÞ ¼ P0ðmiÞðui

PðDjuiÞPðuiÞ; ð2:7Þ

under some prior distribution P0(mi) over the modelsmi [ M. Taking this approach we can simply add anordinal parameter indicating the model mj of each par-ticle j [ f1, . . ., Ng used in the ABC-SMC algorithm,and doing so enables us to approximate equation(2.7), the marginal posterior probability distributionof the models for the population of particles at step t, as

PtðmÞ ¼X

ijmit¼m

wit : ð2:8Þ

Then the procedure outlined in algorithm 1 is modi-fied so that to generate a particle from population t,first a model is chosen according to its (marginal) prob-ability, Pt21(m), before one of the correspondingparticles is chosen. To perturb the resampled particle,two separate kernels, KM and Ku, are used; the first isused to propose a new model and the second to perturbthe model parameters. Here, for our kernel on the choiceof model, we propose to move to a new model chosenuniformly at random with probability p, or to staywith the current model with probability 1 2 p,although other choices of kernel are again possible.The update of the particle weights then also takes themodel parameter into account,

wii ¼

sðu00Þ if t ¼ 0;Pðu00Þsðu00ÞW ðm00; u00Þ otherwise;

8<: ð2:9Þ

where

W ðm00; u00Þ ¼XM

j

Pt�1ðmjÞKM ðm00;mjÞ

�X

kjmkt�1¼m00

wkt�1Kuðu00; ukÞ

Ptðm00Þ: ð2:10Þ

This procedure is outlined in algorithm 2, anddescribed in more detail in Toni & Stumpf [15]. Afterperforming the inference procedure, the final populationof particles at step T can then be used with equation(2.8) to derive the posterior model probabilities.

2.3. Model averaging

In many circumstances, it is not possible to decide infavour of any particular model; in such cases, the pos-terior probability of several candidates is appreciableand comparable and analysis should proceed by poolingthe results/predictions from these models, weightedby the relative evidence in their favour. This is preciselythe aim of the Bayesian model averaging. As had pre-viously been explored in Stumpf & Thorne [19], whenfitting several different models to interaction networkdata, it is possible to improve the accuracy of predictionsby averaging inferred statistics over all of the models [20].

J. R. Soc. Interface

Our method gives us the posterior probabilities foreach model under consideration as

PðMÞ ¼X

xjmðxÞ¼M

wðxÞ; ð2:11Þ

and so we can easily average an inferred parameter orstatistic over all of the models by simply taking theweighted average of the value given by each model

uav ¼Xm

PðMÞum: ð2:12Þ

It has to be borne in mind, however, that the roleof parameters (such as the rate of duplication) candiffer quite considerably between models dependingon the other factors considered by different models.Nevertheless even then predictions (e.g. for the totalnumber of interactions in a network, or any aspects ofthe graph spectrum) can improve under such a modelaveraging scheme.

2.4. Network distance measure

Given a graph G comprising a set of nodes N and edges(i,j) [ E with i,j [ N, the adjacency matrix A of thegraph is defined as the jNj.jNj matrix having entries

ai;j ¼1 if ði;jÞ [ E;0 otherwise:

�ð2:13Þ

The adjacency matrix captures all aspects of thenetwork structure and is therefore a complete represen-tation of the observed data, rather than a summary

Page 6: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf 5

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

statistic (such as degree distribution, clustering or cen-trality measures, motifs or graphlets). Here, we onlyconsider undirected networks and so taking edges (i,j)as an unordered pair the adjacency matrix A will be areal symmetric matrix. Clearly, the structure of the adja-cency matrix depends on some ordering of the nodes Nand will not be unique for an unlabelled graph. Thus, iso-morphic graphs may not necessarily have identicaladjacency matrices, even though their network structuresare the same. Of course, a simple relabelling of nodes willlead to identical adjacency matrices.

A simple distance measure between graphs havingadjacency matrices A and B, known as the edit dis-tance, is to enumerate the number of edges that arenot shared by both graphs,

DðA;BÞ ¼X

i

Xj

ðai;j � bi;jÞ2: ð2:14Þ

However, for unlabelled graphs, we are interested insome mapping h from i [ NA to i0 [ NB that minimizesthe distance

D0hðA;BÞ ¼X

i

Xj

ðai;j � bhðiÞ;hð jÞÞ2 ð2:15Þ

over all possible mappings of nodes between the twographs, since there is no fixed correspondence betweenthe unlabelled nodes. This mapping can be formulatedby applying some permutation matrix P to the matrixB. Then we seek to evaluate D 0P for P equal to the(unknown) optimal permutation matrix P, correspond-ing to the mapping that minimizes the distance D 0P,

D 0PðA;BÞ ¼ jjA� PB PT jj2 ¼ min

PD 0PðA;BÞ: ð2:16Þ

Since the evolutionary models we consider produceunlabelled graphs (although we could label them, any suchlabelling would necessarily be arbitrary and this wouldimpose an undesirable loss of generality in the models), werequire the latter form of the distance measure, D 0P .

Considering all possible permutations for pairs of net-works of some 5000 nodes, such as the Saccharomycescerevisae PIN, would be prohibitively expensive, butfortunately it is possible to approximate an optimal per-mutation. Considering the distance measure D 0P we canthen apply the theorem of Umeyama [21], which givesus an approximate lower bound on the edit distancebetween two graphs as

D 0PðA;BÞ �X

i

ðai � biÞ2; ð2:17Þ

where it is assumed that both A and B are Hermitianmatrices and ai and bi are the ordered eigenvalues of Aand B. Although this distance measure only gives us alower bound on the edit distance between the graphs, ithas been shown in Wilson & Zhu [22] that this distancemeasure is an excellent approximation and accuratelyreflects the edit distance measure between two graphs.The distance given by equation (2.17) is an approxi-mation of the distance between the complete data andnot summary statistics of the network as had been usedpreviously in composite likelihood [19] and ABC analyses[4] of networks.

J. R. Soc. Interface

The matrix eigenvalue calculation can itself be com-putationally expensive, and so in our implementation,we have used highly optimized commercial LAPACKroutines running on GPGPU hardware that providesperformance several orders of magnitude faster than aregular CPU for problems of the size we consider here.

2.5. Network growth models

Many random network models have been proposed inthe literature, and here we have chosen models with apreference for those with some biological relevance, inthe hope that they may help us to elucidate the pro-cesses of network evolution. It would entirely bepossible to also consider static network models, suchas Erdos–Renyi [23] or geometric graphs [24], butthese provide no insights into the generative mechan-isms underlying the evolution of biological networks.

We have chosen to take the prior probabilities P(u)of the model parameters and the prior P(m) of themodels themselves to be uniform over some appropriaterange, since in the absence of any prior knowledgedirectly corresponding to the model parameters or aconcrete preference for any particular model thisseems to be the most parsimonious approach. Below,we will discuss the models in the necessary detailrequired to understand our results and discussion.

2.5.1. Duplication modelsWe have considered two different duplication divergencemodels based on those proposed in the literature. The sim-plest model we examine is the duplication–divergence–heterodimerization model [2,12], allowing for interactionsto form between the original and the duplicated node, cor-responding to heterodimerization. This model, which wewill refer to as a duplication attachment (DA), illustratedinfigure 1, selects a nodeuniformlyat random fromthenet-work and duplicates the node, keeping each edgewith someprobability 1 2 d or diverging and losing the interactionwith probability d, always leaving he edges of the originalnode intact. Furthermore, an edge between the origi-nal and duplicated nodes is added with probability a,corresponding to the duplication of a heterodimer.

We also consider a DA preserving complementarity(DAC) model, similar in construction to those describedin Ispolatov et al. [12] and Vazquez et al. [2], where thecomplementarity of edges is preserved, but allowingedges to be lost from both the original and duplicatednode when divergence occurs, rather than only asymme-trically from the original node. As shown in figure 1,either the edge of the original node and its counterpartfrom the duplicated node are both kept with probability1 2 d, or in the case that divergence occurs (with prob-ability d) one of the edges is selected at random anddeleted, so that at least one of the pair is always kept.Again an edge is also added between the original andduplicated nodes with probability a.

2.5.2. Scale-free modelsA network is considered SF if the degree distribution ofthe nodes follows a power-law distribution in the limitof infinite network size, so that for node degrees k,P(k) � k2a, for some scaling coefficient a. We consider

Page 7: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

wi

wj

(a) (b)

(c) (d )

Figure 1. (a) Duplication attachment (DA) model. A node (red) is chosen to be duplicated and the duplicated node inherits theinteractions of the original with probability 1 2 d, or diverges and loses the interaction with probability d. An edge between theoriginal and duplicated nodes is added with probability a, modelling the possibility of a self-interaction that is preserved. (b)Duplication attachment with complementarity (DAC) model. The model proceeds as the DA model, except that at least oneedge in each of the green/blue pairs will be kept in the case of a divergence event, but either the interaction of the original orthe duplicated node may be lost. (c) Linear preferential attachment (LPA) model [5]. At each time step, a new node is addedto the network, and edges formed to the existing nodes with a probability proportional to their degree, so that edges are prefer-entially added to existing nodes of high degree. (d) General SF model [25]. The general SF model is a more sophisticatedpreferential attachment scheme whereby the scaling coefficient of the resulting degree distribution can be altered by themodel parameters. Edges begin with weight 1, and as new nodes are added to the network at each time step, a random existingedge is chosen based on the edge weights, and a node at one end of the edge selected at random is chosen for the new node to beconnected to. The edge weight of the chosen edge is then increased by the parameter m.

6 Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

both the LPA scheme of Barabasi & Albert [5] and themore complex scheme of Dorogovtsev & Mendes [25]that allows for the specification of the scalingcoefficient as a model parameter.

The LPA model of Barabasi & Albert [5] grows thenetwork by adding a single node at each time step,and attaching edges from this node to those in the exist-ing network with a probability proportional to theirdegree, and is illustrated in figure 1. Thus for a nodein the network of degree k the probability of attach-ment is k/2M, where M is the total number of edgesin the network. We also allow for multiple edges to beadded at each step by sampling the number of edgesto add from a Poisson distribution with mean m. Ifwe did not do so, then it would not be possible togrow networks with a ratio of nodes to edges otherthan 1:1; the Poisson distribution is a convenient wayof ensuring that the number of nodes and edges in thereal network can be achieved in the simulated data.

Such a scheme will produce a network with a degreedistribution whose scaling coefficient is always 3 [5],

J. R. Soc. Interface

whereas the generalized SF method of Dorogovtsev &Mendes [25] allows us to further parametrize themodel to vary the scaling coefficient of the degree distri-bution. We omit the details but describe the modelbriefly below, and illustrate the growth step infigure 1. Edges in the network are assigned weights,all of which are initially set to 1. At each time step, anew node is added to the network, and again anumber of edges sampled from Pois(m) is added fromthis node. Rather than adding edges preferentiallybased on the degree of nodes, the edge weights areused to select an edge, and the new node is attachedto a randomly selected end of the chosen edge. Finally,the weight of the selected edge is increased by (the par-ameter) v. Such a scheme generates a network withscaling coefficient 2 þ (1/(1 þ 2v)) [25].

2.5.3. Generalized modelsWe also consider two alternative models that allow forboth duplication-divergence dynamics, as well as

Page 8: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

Algorithm 3. Network growth model simulation withsampling.Input: model m, parameters u, NT proteins in organism,

NS proteins in interactome data of organismOutput: Sampled network of NS nodes, grown from

model m with parameters uGrow network a with NT nodes, according to model mand parameters u;Create empty network b;for i 1 to NS do

Sample node x from Nodes(a) \ Nodes(b);Add node x to b;

endfor (x,y) [ Edges(a) do

if x [ Nodes(b) and y [ Nodes(b) thenAdd edge (x,y) to b;

endendreturn b

Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf 7

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

random addition of edges by either a uniform randomattachment [26] or a preferential attachment scheme[5]. Both models employ a parameter p, the probabilityof performing a duplication move at each step, while arandom edge addition move is performed with prob-ability 1 2 p. Again, we allow for multiple edges to beadded during each step for the random edge additionmoves, with the number of edges to be attacheddrawn from a Poisson distribution with parameter m.

The first such model combines the DA preserving com-plementarity scheme described above with a simplerandom addition of edges (DACR). During a randomedge addition step, the new node is added to the network,and then a number of edges is sampled according toPois(m), and each edge is assigned to two nodes chosenuniformly at random from the network.

The second model again uses the duplication diver-gence preserving complementarity scheme but usesLPA for the edge addition steps (DACL). Thus,during a random edge addition step the new node isadded to the network, a number of edges is sampledaccording to Pois(m), and we attach each edge fromthe new node to the existing nodes with a probabilityproportional to their degree.

2.6. Sampling

It has previously been reported [19,27,28] that theeffects of sampling on network data can bias inferencesmade under the assumption that the network structureof the subsample is representative of the structuralproperties of the full network. Since the network datawe are using are in fact only a subnetwork of the inter-action network existing in the organism, we include thisfact in our model to prevent the effects of sampling frombiasing the results. Currently available interactiondatasets only include a subset of the genes known toexist in the respective organism, and so we apply asimple model of a sampling scheme to attempt toinclude the incompleteness of the data in the analysis.In the absence of more detailed information on theexperimental sampling applied in generating the data,we take the parsimonious approach of assuming thateach protein in the full interactome is sampled uni-formly to yield the observed interaction data.

The sampling model is incorporated into our methodby growing networks up to the size of the number ofgenes known to exist in the organism in questionrather than the number of proteins in the interactiondataset. A random subset of the nodes of the networkis then taken to reduce it in size to the same numberof nodes as the observed interaction data and the sub-network induced by these nodes is then used in theanalysis in place of the larger network. While this meth-odology will allow for our induced subnetworks toinclude nodes of degree zero, which are absent in theobserved data, in the absence of any tractable alterna-tive methodology, we feel such an approximation is asuitable trade-off in allowing us to consider the effectof sampling on the inference. Thus, in our inference ofdegree distributions described in §3.2.2, we correct forthe fact that the available data never contain nodes ofdegree zero.

J. R. Soc. Interface

In order to simulate network data for a particularmodel in the ABC-SMC algorithm, we apply themethod outlined in algorithm 3. This gives us a networkof the same number of nodes as the sampled interac-tome data being used, but allows us to infer theparameters of the full network by growing our simu-lated models to the size of the full proteome of theorganism in question.

2.7. Implementation

The method described was implemented in a mixture ofPYTHON (www.python.org) and C þþ code [29], withthe framework of the SMC method implemented inPYTHON, while using C þþ to improve the performanceof the network growth model simulation code. Softwareis publicly available from www.theosysbio.org. As men-tioned previously the matrix eigenvalue computationsbecome prohibitively expensive for networks of thesize considered here (e.g. around 5000 nodes). There-fore, we have used the CULAtools GPU linear algebralibrary (www.culatools.com) to perform the matrix cal-culations on GPGPU hardware, greatly increasing thespeed of the calculations compared to a conventionalCPU implementation. Using the CULAtools GPGPULAPACK library implementation gives approximatelya four times speed-up compared with a CPU optimizedLAPACK implementation. Even with such optimiz-ations producing posterior estimates can be costly,and takes around 12 h using an NVIDIA Tesla C2050GPU and a 6 core 3.3 GHz Intel Core i7 CPU.

3. RESULTS

Network evolution is a highly complex and contingentprocess; by design, the models considered here are vastlyoversimplified compared with the true evolutionary pro-cess. Because of the correlated nature of the data it isnot expected that we can always unambiguously identifythe true data-generating process. To investigate and illus-trate this point—generic to reverse engineering problems

Page 9: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

0

0.5

1.0

1.5

dens

ity

0

0.2

0.4

0.6

d

0

0.2

0.4

0.6

d

0

0.2

0.4

0.6

d

0

0.4

0.8

a

0

0.1

0.2

0.3

0.4de

nsity

0

0.4

0.8

a

0

0.4

0.8

a

0

0.4

0.8

p

0

0.4

0.8

p

0

0.1

0.2

0.3

0.4

dens

ity

0

0.4

0.8

p

0 0.2 0.4 0.6

0

2

4

6

8

10

d

m

0 0.4 0.8

0

2

4

6

8

10

a

m

0 0.4 0.8

0

2

4

6

8

10

p

m

0 2 4 6 8 10

0

0.04

0.08

0.12

m

dens

ity

Figure 2. Posterior densities of each of the four parameters of the DACR model used to generate our test network. The actualparameter values are shown as vertical bars in the marginal density plots for each parameter. Contour plots illustrating theposterior densities of pairs of parameters are shown in the off-diagonal blocks.

8 Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

[30,31]—we first consider synthetic data before ananalysis of real protein–protein interaction data.

3.1. Simulated data

To evaluate the ability of our method to effectivelyapproximate the posterior distribution of the modelparameters, we have performed two tests on simulateddata generated from a known model. We assess the per-formance of the method in estimating the parameters ofa single model, and in performing model selection. Sincesampling may have an impact on the ability to infer theposterior, as the data are deteriorated, we also test theperformance of our two test cases under varying degreesof sampling, discarding a fraction of nodes in thesimulated network.

Our simulated data are taken from the DACR model,with parameters d ¼ 0.4, a ¼ 0.25, p ¼ 0.7 and m ¼ 3,grown to a size of 5000 nodes and having 25 099 edges.

J. R. Soc. Interface

Attempting to infer the known model parametersusing the full dataset, we obtain the posterior densitiesshown in figure 2. It appears that the posterior prob-ability is centred around the correct values, andconsidering the inference is attempting to infer par-ameters of a stochastic model from a single sample,uncertainty in the resulting posterior distribution is tobe expected.

The posterior model probabilities illustrated infigure 3 show that while the model from which thedata were generated does not have the highest prob-ability on average, it is still the second highest andthe distributions of the values are close to overlapping,while the similar DACL model has the highest prob-ability, suggesting that the single sample of networkstructure from the model is not sufficient to correctlydiscriminate the two. Taking samples from the gener-ated data of 25, 50 and 75 per cent of the nodes, theposterior densities for the DACR model from whichthe data were generated shown in figure 4 reveal an

Page 10: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

model

mod

el p

roba

bilit

y

0 ●

DA DAC LPA SF DACL DACR

0.1

0.2

0.3

0.4

0.5

0.6

Figure 3. Distribution of posterior model probabilities for eachof the models for our test data (generated from the DACRmodel) over 10 simulated datasets having identical par-ameters. While the posterior probabilities do not correctlyidentify the model used to generate the data as the mostlikely, there is a large variance in the results and this simplyindicates that the data available are not sufficient to differen-tiate between the models.

Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf 9

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

interesting trend as the sampling fraction decreases. Forthe 75 per cent sample, the posterior distributionappears to be mostly centred around the same valuesas for the full network and almost as specific, whereasfor the 50 and 25 per cent samples, the posterior distri-butions become much broader (and potentially biased),and particularly in the case of a, less specific and spreadacross the parameter range.

These findings reflect the general problems encoun-tered in addressing the so-called inverse problems, andare not specific to the approach developed here. Theconsistency of statistical estimators is only an asympto-tic property and for small data samples (here, we haveonly a single network) inferences are always subject tothe variabilities of the data-generating process and theestimator. This explains also the need to considerBayesian model averaging approaches.

3.2. Protein interaction network data

We applied our method to publicly available PIN data-sets of varying completeness and size to allow us toexamine the results and performance of the inferenceon differing kinds of data. The data used are summar-ized in table 1, and were downloaded from theDatabase of Interacting Proteins [32] (DIP; http://dip.doe-mbi.ucla.edu/dip/). The S. cerevisae datasetis the most complete, with a large fraction of thegenes in the organism being included in the network,and a large number of edges. The D. melanogaster data-set is of a larger size, but represents a smaller fraction ofthe genes known to exist in the organism, while the

J. R. Soc. Interface

Helicobacter pylori and Escherichia coli datasets aremuch smaller, and again represent small samplingfractions of their respective PINs.

3.2.1. Model selection and model parametersApplying our method to the protein interaction datasummarized in table 1 we obtained the posteriormodel probabilities shown in figure 5. The resultsshow a strong preference for the DACL and DACRmodels in almost all cases, except for in S. cerevisaewhere the largest posterior model probability corre-sponds to the DA model. Interestingly, in all cases theLPA and SF models have near zero probabilities,suggesting that these models do not fit the data aswell as previously claimed. The majority of differencesbetween the species appear to be between S. cerevisaeand the other three species, with D. melanogaster,H. pylori and E. coli all exhibiting similar profiles. Inter-estingly, both D. melanogaster and H. pylori have asmall probability for the DAC model not seen in theother species, while E. coli has a larger preference forthe DACL model, and less so for the DAC model.

Looking at the posterior density plots for the differentspecies for the models where there were enough particlesavailable to calculate the densities shown in figure 6, wesee that the difference between the datasets is moreclear. The most striking aspects are the similarity ofthe posterior densities for the D. melanogaster andH. pylori data across all of the models, and the signifi-cantly different shape of the posterior in E. coli formany of the parameters.

In many cases the posterior densities of the commonparameters appear to be centred around similar valuesacross all of the models, for example, with the parameterd, common to all of the duplication models, the peaksare close and of a similar shape in most cases for theS. cerevisae, H. pylori and D. melanogaster data.

3.2.2. Model averaging of network statisticsTo evaluate the performance of our model averagingand sampling scheme, we attempted to infer thedegree distribution of the observed S. cerevisae PINdata from our posterior particles for samples of 25 and50 per cent of the nodes of the S. cerevisae PIN, aswell as the full data.

As can be seen in figure 7, both the degree distri-bution inferred from the full S. cerevisae network andthe 50 per cent sample appear to fit the data well,while the 25 per cent sample does not perform as well.

4. DISCUSSION

The performance of our method on the generated testdata (figure 2) illustrates the efficacy of our approachin reconstructing the model parameters for a given evol-utionary model. While it would be unrealistic to assumeour models correspond to the only mechanisms of net-work evolution at work and completely capture theirbehaviour, comparing such general mechanisms allowsus to distinguish between the probable evolutionaryprocesses at work. Although the model selection resultsin figure 3 show that we do not give the highest

Page 11: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

1.55

4

3

2

1

0

1.2

1.0

0.8

0.6

0.4

0.2

00

0.5

1.0

1.5

dens

ityde

nsity

dens

ityde

nsity

2.0

2.5

0.5

0

0 0.2 0.4 0.6 0.8 1.00 0.2 0.4 0.6 0.8 1.0

0 2 4 6 8 100 0.2 0.4 0.6 0.8 1.0p m

1.0

d a

Figure 4. Posterior densities under different degrees of sampling for the four parameters of the DACR model used to generate thetest network and samples. Sampled networks were generated by uniformly sampling a fraction of nodes and taking the inducedsubnetwork, for sample sizes of 75%, 50% and 25%. For samples of less than 50% of the nodes the posterior densities clearly differfrom the actual model parameters (vertical lines). Samples: orange solid line, full; green dashed line, 75%; blue dashed line, 50%;purple dashed line, 25%.

Table 1. Summary of the PIN data used in the study. Datasets of varying size and sampling fraction were chosen so as to allowus to evaluate the performance of the method on a selection of different kinds of protein interaction data, representative ofthose currently available.

species proteins interactions genome size sampling fraction

S. cerevisae 5035 22118 6532 0.77D. melanogaster 7506 22871 14076 0.53H. pylori 715 1423 1589 0.45E. coli 1888 7008 5416 0.35

10 Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

posterior probability to the correct model, the differ-ence in mechanisms between the two generalizedduplication models (DACL and DACR) is subtle andwe correctly assign much lower probabilities to theother models, so this simply suggests that it may notbe possible to tell them apart from a single networkstructure sampled from the stochastic DACR model.

As shown in figure 4, the effects of our sampling onthe inferred parameters for our test dataset show thatwhile the accuracy is reduced, we can in most cases

J. R. Soc. Interface

reconstruct network parameters of the full networkaccurately for samples of 75 per cent of the full network.As the sampling size is reduced to 25 per cent the pos-terior distribution for the parameter a becomes spreadacross the parameter range, reflecting the inadequacyof the sampled data to allow for inference of the trueparameter value. For the other parameters the resultsappear to be biased, suggesting that the samplingaffects the ability of our inference procedure to infereach of the parameters in different ways.

Page 12: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

model

mod

el p

roba

bilit

y

0

0.1

0.2

0.3

DA DAC LPA SF DACL DACR

0.5

0.4

Figure 5. Model probabilities for each of the six models in thefour species. There is a strong preference for the DACL orDACR models in all cases but for the S. cerevisae dataset,which is fit best by the DA model. Both of the SF modelshave near zero posterior probabilities, suggesting that theyprovide a poor fit to the data (orange, S. cerevisae; green,D. melanogaster; blue, H. pylori; purple, E. coli).

Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf 11

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

The results of the model selection for the four PINdatasets we considered (figure 5) show that both ofthe SF models are poor at explaining the observeddata in all cases. In agreement with the previous analy-sis of Ratmann et al. [4] and Middendorf et al. [13], themodels combining duplication mechanics with somerandom addition of edges (DACL, DACR) seem to pro-vide the best fit to the data, with only S. cerevisaeshowing a slight preference for the simplest DA model.

The posterior probabilities we infer for the modelparameters of the different PIN datasets in figure 6show an interesting pattern whereby the majority ofparameters show similar distributions across thespecies, as well as across the different models. Forexample, the divergence parameter d, describing theprobability of duplicated edges being lost, appears toshare a common value of around 0.4 across all of themodels and the majority of the species, and the par-ameter a shows a similar trend although the posteriordistributions are less specific for the DACL andDACR models. This may be due to the fact that infer-ence of a appears to become increasingly difficult as thesampling fraction decreases, and three of our four PINdatasets represent small sampling fractions of around50 per cent and less. The parameter p describing theprobability of performing a duplication step or arandom edge addition step appears to be unspecificexcept for in the case of E. coli where the posterior

J. R. Soc. Interface

density seems to be centred around 0.5, while thenumber of edges added in each random edge additionstep is around 3 for all the species.

The agreement between the posterior densities ofH. pylori and D. melanogaster is striking especiallydue to the extreme difference in the size andthe number of interactions of the datasets, with theH. pylori PIN being of a much smaller size. TheE. coli dataset has a similarly small number of nodesas the H. pylori data, but far more edges relative tothe number of nodes than any other datasets. Thismay go some way in explaining the differences apparentbetween the posterior parameter distributions forE. coli and the other species, or it may, on the otherhand, be due to the low sampling fraction of the data,most probably a combination of the two.

Looking at the degree distributions, we infer for theS. cerevisae PIN by applying model averaging to theposterior particles inferred based on some sampled sub-sets of the observed network in figure 7, it appears thatwe can accurately reconstruct the degree distribution ofthe observed network based only on a sample of 50 percent of the nodes. Then applying this to the four PINdatasets we have considered, we would expect that theS. cerevisae, D. melanogaster and H. pylori datawould allow us to accurately reconstruct the degree dis-tributions of the full networks from the sampled PINdata we have used. These results demonstrate the uti-lity of our methodology not only in elucidating theevolutionary processes at work, but also in inferringproperties of the as yet unknown full network structurefrom the sampled data by including the samplingprocess in our models.

5. CONCLUSION

We have demonstrated the ability of our method to cor-rectly infer network growth model parameters fromobserved network data and illustrated its applicationto existing PIN data. We feel that our method providesboth novel techniques and results that reveal insightsinto the evolutionary processes at work. As more com-plete and accurate protein interaction data becomeavailable in the future, we would expect these tech-niques to allow us to make progressively more precisepredictions and comparisons between species.

Simple models like the ones considered here arevastly idealized and oversimplified models of a muchmore complicated and contingent evolutionary process.On the one hand, we use such models to gain qualitativeinsights into the evolution of networks; some of ourmodels are somewhat more realistic compared with,for example, simple SF models, in the same sense inwhich Kimura’s 2-parameter (K2P) model for nucleo-tide substitutions is arguably more realistic than thesimpler Jukes–Cantor (JC) model. But neither theJC or the K2P, nor our models of network evolutionconsider functional, structural or indeed any biologicalconstraints on the evolutionary dynamics. This mayseem like a glaring omission, but is done out of neces-sity. First, we have no way of capturing these factorsreliably and without extraneous and difficult to justify

Page 13: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

DA

00

5

10

15

0

5

10

15

2

4

6

8

0

2

4

6

8

0

1

2

3

4

0

2

4

6

8

10

0

2

4

6

8

0

2

4

6

8

10

0

1

2

3

4

0

1

2

3

4

5

p

0

0.2

0.4

0.6

0.8

1.0

0

0.2

0.4

0.6

0.8

1.0

m0 0.4 0.8 0 0.4 0.8 0 0.4 0.8 0 2 4 6 8 10

d a

DA

CR

DA

CL

DA

C

Figure 6. Posterior densities for the model parameters where there were a suitable number of particles in the posterior sample tocalculate the distribution. Only the DA, DAC, DACL and DACR models had sufficient numbers of particles. The datasets are:red, S. cerevisae; orange, D. melanogaster; green, H. pylori; blue, E. coli.

12 Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

assumptions; second, these functionally ‘ignorant’models can be used as null models/hypotheses. Compar-ing and contrasting real networks with those generatedby simplified simulation models can highlight systematicdifferences; these can be caused by functional factors orby events such as whole genome duplications which arenot captured by these evolutionary models.

A slightly more pragmatic use of such modelsand model calibration is for predictive purposes andcomparison of large-scale network features betweenspecies. Bayesian model averaging has been shown topossess considerable predictive power even if the under-lying models are known to be oversimplified orinadequate. Pooling over predictions weighted bythe model fit to the data has the potential to yieldtestable and non-trivial predictions of the propertiesof complete networks (based on incomplete data).

An advantage of our approach is that spectralapproaches allow us to compare networks more

J. R. Soc. Interface

comprehensively than has previously been the case[26,33,34]. They incorporate implicitly the informationcontained in standard network summary statistics—degree distribution, clustering coefficient, distances,etc.—and also allow us a direct means of comparinggraphs (in an ABC framework) rather than resortingto the more coarse-grained summary statisticsthat had been considered in the past [35]. As exactlikelihood-based inferences are only possible for verysimple growth models [36], the use of ABC is not onlyjustified but also in fact unavoidable in model-basedanalysis of network evolution [4].

The statistical tools introduced here allow us to com-pare network data and models of their evolutionarydynamics; in principle, we can also choose to focus oneither quantitative or qualitative aspects of networkdata, depending on the quality of available data (orthe details captured by different models). Ultimately,there is no reason to be disappointed if all that we

Page 14: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

degree

coun

t

10−0.5

1

100.5

101.0

101.5

102.0

102.5

103.0

20 40 60 80

data

observedMMA, full

MMA, 75% sampleMMA, 50% sampleMMA, 25% sample

Figure 7. Degree distribution of the observed S. cerevisae PIN(orange), with plots of the model averaged distributioninferred from the posterior particles across all the models,using the full S. cerevisae dataset (olive), a 75% sample(green), a 50% sample (blue) and a 25% sample (purple).

Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf 13

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

achieve is to reveal the inadequacies of existing modelsof network evolution. Being able to do so will in itselfyield new insights into the evolutionary history ofthese networks.

T.T. and M.P.H.S. gratefully acknowledge support from theBBSRC (BB/F005210/2). M.P.H.S. is a Royal SocietyWolfson Research Merit Award holder.

REFERENCES

1 Sole, R., Satorras, R. P., Smith, E. & Kepler, T. 2002 Amodel of large-scale proteome evolution. Adv. ComplexSyst. 5, 43–54. (doi:10.1142/S021952590200047X)

2 Vazquez, A., Flammini, A., Maritan, A. & Vespignani, A.2003 Modeling of protein interaction networks. Complexus1, 38–44. (doi:10.1159/000067642)

3 Han, J.-D. J. et al. 2004 Evidence for dynamically orga-nized modularity in the yeast protein–protein interactionnetwork. Nature 430, 88–93. (doi:10.1038/nature02555)

4 Ratmann, O., Andrieu, C., Wiuf, C. & Richardson, S. 2009Model criticism based on likelihood-free inference, with anapplication to protein network evolution. Proc. Natl

J. R. Soc. Interface

Acad. Sci. USA 106, 10 576–10 581. (doi:10.1073/pnas.0807882106)

5 Barabasi, A. & Albert, R. 1999 Emergence of scaling inrandom networks. Science 286, 509–512. (doi:10.1126/science.286.5439.509)

6 Albert, R. & Barabasi, A. 2002 Statistical mechanics ofcomplex networks. Rev. Mod. Phys. 74, 47–97. (doi:10.1103/RevModPhys.74.47)

7 Stumpf, M. & Ingram, P. 2005 Probability models fordegree distributions of protein interaction networks. Euro-phys. Lett. 71, 152–158. (doi:10.1209/epl/i2004-10531-8)

8 Tanaka, R., Ti, T. & Doyle, J. 2005 Some protein inter-action data do not exhibit power law statistics. FEBSLett. 579, 5140–5144. (doi:10.1016/j.febslet.2005.08.024)

9 Khanin, R. & Wit, E. 2006 How scale-free are biologicalnetworks? J. Comp. Biol. 13, 810–818. (doi:10.1089/cmb.2006.13.810)

10 Ispolatov, I., Krapivsky, P. L. & Yuryev, A. 2005 Dupli-cation-divergence model of protein interaction network.Phys. Rev. E 71, 061911. (doi:10.1103/PhysRevE.71.061911)

11 Pastor-Satorras, R., Smith, E. & Sole, R. V. 2003 Evolvingprotein interaction networks through gene duplication. J.Theoret. Biol. 222, 199–210. (doi:10.1016/S0022-5193(03)00028-6)

12 Ispolatov, I., Krapivsky, P., Mazo, I. & Yuryev, A. 2005Cliques and duplication-divergence network growth. NewJ. Phys. 7, 145. (doi:10.1088/1367-2630/7/1/145)

13 Middendorf, M., Ziv, E. & Wiggins, C. H. 2005 Inferringnetwork mechanisms: the Drosophila melanogaster proteininteraction network. Proc. Natl Acad. Sci. USA 102,3192–3197. (doi:10.1073/pnas.0409515102)

14 Harvey, P. H. & Pagel, M. D. 1998 The comparativemethod in evolutionary biology. New York, NY: OxfordUniversity Press.

15 Toni, T. & Stumpf, M. P. H. 2010 Simulation-based modelselection for dynamical systems in systems and popula-tion biology. Bioinformatics 26, 104–110. (doi:10.1093/bioinformatics/btp619).

16 Doucet, A., Freitas, N. & Gordon, N. (eds) 2001 Sequen-tial Monte Carlo methods in practice. Berlin, Germany:Springer.

17 Del Moral, P., Doucet, A. & Jasra, A. 2006 SequentialMonte Carlo samplers. J. R. Statist. Soc. Ser. B 68,411–436. (doi:10.1111/j.1467-9868.2006.00553.x)

18 Toni, T., Welch, D., Strelkowa, N., Ipsen, A. & Stumpf,M. P. 2009 Approximate Bayesian computation schemefor parameter inference and model selection in dynamicalsystems. J. R. Soc. Interface 6, 187–202. (doi:10.1098/rsif.2008.0172)

19 Stumpf, M. & Thorne, T. 2006 Multi-model inference ofnetwork properties from incomplete data. J. Integr. Bioin-formatics 3, 32. (doi:10.2390/biecoll-jib-2006-32)

20 Burnham, K. P. & Anderson, D. R. 2002 Model selectionand multi-model inference: a practical information-theor-etic approach. Berlin, Germany: Springer.

21 Umeyama, S. 1988 An eigendecomposition approach toweighted graph matching problems. IEEE Trans. PatternAnal. Mach. Intell. 10, 695–703. (doi:10.1109/34.6778)

22 Wilson, R. C. & Zhu, P. 2008 A study of graph spectra forcomparing graphs and trees. Pattern Recogn. 41, 2833–2841. (doi:10.1016/j.patcog.2008.03.011)

23 Erdos, P. & Renyi, A. 1959 On random graphs I. Publ.Math. Debrecen 5, 290–297.

24 Higham, D. J., Raajski, M. & Prulj, N. 2008 Fitting ageometric graph to a protein–protein interaction network.Bioinformatics 24, 1093–1099. (doi:10.1093/bioinformatics/btn079)

Page 15: Graph spectral analysis of protein interaction network ...lliao/cis889f12/papers/... · to be the case for networks and therefore the spectral perspective taken here, which captures

14 Graph spectral analysis of protein T. Thorne and M. P. H. Stumpf

on October 10, 2012rsif.royalsocietypublishing.orgDownloaded from

25 Dorogovtsev, S. N. & Mendes, J. F. F. 2004 Minimalmodels of weighted scale-free networks. Preprint.(http://arxiv.org/abs/cond-mat/0408343)

26 Callaway, D. S., Hopcroft, J. E., Kleinberg, J. M.,Newman, M. E. & Strogatz, S. H. 2001 Are randomlygrown graphs really random? Phys. Rev. E 64, 041902.(doi:10.1103/PhysRevE.64.041902)

27 Stumpf, M., Wiuf, C. & May, R. 2005 Subnets of scale-freenetworks are not scale-free: the sampling properties of net-works. Proc. Natl Acad. Sci. USA 102, 4221–4224.(doi:10.1073/pnas.0501179102)

28 de Silva, E., Thorne, T., Ingram, P., Agrafioti, I., Swire, J.,Wiuf, C. & Stumpf, M. 2006 The effects of incompleteprotein interaction data on structural and evolutionaryinferences. BMC Biol. 4, 39. (doi:10.1186/1741-7007-4-39)

29 Stroustrup, B. 2000 The Cþþ programming language:special edition, 3rd edn. Reading, MA: Addison-WesleyProfessional.

30 Werhli, A., Grzegorczyk, M. & Husmeier, D. 2006 Com-parative evaluation of reverse engineering generegulatory networks with relevance networks, graphicalgaussian models and bayesian networks. Bioinformatics22, 2523–2531. (doi:10.1093/bioinformatics/btl391)

31 Opgen-Rhein, R. & Strimmer, K. 2007 Learning causalnetworks from systems biology time course data: an

J. R. Soc. Interface

effective model selection procedure for the vector autore-gressive process. BMC Bioinformatics 8(Suppl. 2), S3.(doi:10.1186/1471-2105-8-S2-S3)

32 Xenarios, I., Salwınski, L., Duan, X. J., Higney, P., Kim,S.-M. & Eisenberg, D. 2002 DIP, the database of interact-ing proteins: a research tool for studying cellular networksof protein interactions. Nucleic Acids Res. 30, 303–305.(doi:10.1093/nar/30.1.303)

33 Goh, K.-I., Oh, E., Jeong, H., Kahng, B. & Kim, D. 2002Classification of scale-free networks. Proc. Natl Acad. Sci.USA 99, 12 583–12 588. (doi:10.1073/pnas.202301299)

34 Higham, D. J., Rasajski, M. & Przulj, N. 2008 Fitting ageometric graph to a protein–protein interaction network.Bioinformatics 24, 1093–1099. (doi:10.1093/bioinformatics/btn079)

35 Ratmann, O., Jorgensen, O., Hinkley, T., Stumpf, M.,Richardson, S. & Wiuf, C. 2007 Using likelihood-freeinference to compare evolutionary dynamics of theprotein networks of H. pylori and P. falciparum. PLoSComput. Biol. 3, 2266–2278. (doi:10.1371/journal.pcbi.0030230)

36 Wiuf, C., Brameier, M., Hagberg, O. & Stumpf, M. 2006 Alikelihood approach to the analysis of network data. Proc.Natl Acad. Sci. USA 103, 7566–7570. (doi:10.1073/pnas.0600061103)