a unified representation of multiprotein complex data for modeling interaction networks

10
A Unified Representation of Multiprotein Complex Data for Modeling Interaction Networks Chris Ding, 1 * Xiaofeng He, 1 Richard F. Meraz, 2 and Stephen R. Holbrook 2 1 Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California 2 Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California ABSTRACT The protein interaction network presents one perspective for understanding cellular processes. Recent experiments employing high- throughput mass spectrometric characterizations have resulted in large data sets of physiologically relevant multiprotein complexes. We present a uni- fied representation of such data sets based on an underlying bipartite graph model that is an advance over existing models of the network. Our unified representation allows for weighting of connections between proteins shared in more than one complex, as well as addressing the higher level organization that occurs when the network is viewed as consist- ing of protein complexes that share components. This representation also allows for the application of the rigorous MinMaxCut graph clustering algo- rithm for the determination of relevant protein modules in the networks. Statistically significant annotations of clusters in the protein–protein and complex– complex networks using terms from the Gene Ontology indicate that this method will be useful for posing hypotheses about uncharacterized components of protein complexes or uncharacter- ized relationships between protein complexes. Proteins 2004;57:99 –108. © 2004 Wiley-Liss, Inc. Key words: protein complex; supercomplex; gene ontology; bipartite graphic; cluster anal- ysis; network biology INTRODUCTION Proteins carry out most essential cellular processes in complex multiprotein assemblies. These protein com- plexes perform activities needed for metabolism, communi- cation, growth, and structure. A systematic identification, characterization, and understanding of these molecular machines of life will provide an essential knowledge base and link proteome dynamics and architecture to cellular function and phenotype. A variety of experimental and computational approaches have been used to deduce the constituents of protein macromolecular complexes. Experi- mental approaches such as the yeast two-hybrid genetic screen yield binary interaction data. More recent high- throughput methods combine tagged “bait” proteins and protein complex purification schemes with mass spectro- metric measurements to yield physiologically relevant data on intact multiprotein complexes. 1–4 Taken together, data from these experiments approximate the network of interactions between proteins and protein complexes that govern most cellular processes. The representation of functional relationships within the interaction network is important. 5 Most studies have represented protein interaction data as a map of binary interactions with uniformly weighted connections between interacting proteins. 6,7 When applied to multiprotein com- plex data, this binary model assumes a pairwise interac- tion between all constituents in a complex. This equal weighting, however, is an oversimplification, because physi- cal interactions between constituents cannot be unambigu- ously described for all complexes without rigorous struc- tural analysis. Some efforts have moved beyond the binary interaction model. The “spoke” model 7 assumes pairwise interactions only between the purification “bait” and pro- teins that copurify in the complex. A hypergraph model extends the network structure to allow hyperedges corre- sponding to protein complexes to connect arbitrarily many proteins in the network. 8 Models that search the network for more general topological structures have been used to delineate functional relationships between interacting pro- teins. 9,10 The most important limitation of existing models of the protein interaction network is their inability to represent a higher order organization of the proteome that results from the consideration of network relationships between protein complexes. A recent review by Gavin, Superti- Furga, and coworkers 4 discusses the major issues concern- ing protein complexes and proteome organization, and gives several examples of the modularity of protein com- plexes and their ability to share components and interact in complex cellular processes. Deshaies et al. 11 have used the term megacomplex to describe complex protein assem- blages that are distinguished by the diversity and number of interacting partners. A model of the protein interaction network that adequately deals with relationships between protein complexes would be an important step toward a Grant sponsor: College of Science and Engineering Education at LBNL (to R. Meraz). Grant sponsor: U.S. Department of Energy, Office of Science (Office of Advanced Scientific Computational Research, MICS Division and a LBNL LDRD); Grant number: DE-AC03- 76SF00098. *Correspondence to: Chris Ding, Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720. E-mail: [email protected] Received 11 September 2003; Accepted 18 December 2003 Published online 22 June 2004 in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/prot.20147 PROTEINS: Structure, Function, and Bioinformatics 57:99 –108 (2004) © 2004 WILEY-LISS, INC.

Upload: chris-ding

Post on 06-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

A Unified Representation of Multiprotein Complex Data forModeling Interaction NetworksChris Ding,1* Xiaofeng He,1 Richard F. Meraz,2 and Stephen R. Holbrook2

1Computational Research Division, Lawrence Berkeley National Laboratory, Berkeley, California2Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, California

ABSTRACT The protein interaction networkpresents one perspective for understanding cellularprocesses. Recent experiments employing high-throughput mass spectrometric characterizationshave resulted in large data sets of physiologicallyrelevant multiprotein complexes. We present a uni-fied representation of such data sets based on anunderlying bipartite graph model that is an advanceover existing models of the network. Our unifiedrepresentation allows for weighting of connectionsbetween proteins shared in more than one complex,as well as addressing the higher level organizationthat occurs when the network is viewed as consist-ing of protein complexes that share components.This representation also allows for the applicationof the rigorous MinMaxCut graph clustering algo-rithm for the determination of relevant proteinmodules in the networks. Statistically significantannotations of clusters in the protein–protein andcomplex–complex networks using terms from theGene Ontology indicate that this method will beuseful for posing hypotheses about uncharacterizedcomponents of protein complexes or uncharacter-ized relationships between protein complexes.Proteins 2004;57:99–108. © 2004 Wiley-Liss, Inc.

Key words: protein complex; supercomplex; geneontology; bipartite graphic; cluster anal-ysis; network biology

INTRODUCTION

Proteins carry out most essential cellular processes incomplex multiprotein assemblies. These protein com-plexes perform activities needed for metabolism, communi-cation, growth, and structure. A systematic identification,characterization, and understanding of these molecularmachines of life will provide an essential knowledge baseand link proteome dynamics and architecture to cellularfunction and phenotype. A variety of experimental andcomputational approaches have been used to deduce theconstituents of protein macromolecular complexes. Experi-mental approaches such as the yeast two-hybrid geneticscreen yield binary interaction data. More recent high-throughput methods combine tagged “bait” proteins andprotein complex purification schemes with mass spectro-metric measurements to yield physiologically relevantdata on intact multiprotein complexes.1–4 Taken together,data from these experiments approximate the network of

interactions between proteins and protein complexes thatgovern most cellular processes.

The representation of functional relationships withinthe interaction network is important.5 Most studies haverepresented protein interaction data as a map of binaryinteractions with uniformly weighted connections betweeninteracting proteins.6,7 When applied to multiprotein com-plex data, this binary model assumes a pairwise interac-tion between all constituents in a complex. This equalweighting, however, is an oversimplification, because physi-cal interactions between constituents cannot be unambigu-ously described for all complexes without rigorous struc-tural analysis. Some efforts have moved beyond the binaryinteraction model. The “spoke” model7 assumes pairwiseinteractions only between the purification “bait” and pro-teins that copurify in the complex. A hypergraph modelextends the network structure to allow hyperedges corre-sponding to protein complexes to connect arbitrarily manyproteins in the network.8 Models that search the networkfor more general topological structures have been used todelineate functional relationships between interacting pro-teins.9,10

The most important limitation of existing models of theprotein interaction network is their inability to represent ahigher order organization of the proteome that resultsfrom the consideration of network relationships betweenprotein complexes. A recent review by Gavin, Superti-Furga, and coworkers4 discusses the major issues concern-ing protein complexes and proteome organization, andgives several examples of the modularity of protein com-plexes and their ability to share components and interactin complex cellular processes. Deshaies et al.11 have usedthe term megacomplex to describe complex protein assem-blages that are distinguished by the diversity and numberof interacting partners. A model of the protein interactionnetwork that adequately deals with relationships betweenprotein complexes would be an important step toward a

Grant sponsor: College of Science and Engineering Education atLBNL (to R. Meraz). Grant sponsor: U.S. Department of Energy, Officeof Science (Office of Advanced Scientific Computational Research,MICS Division and a LBNL LDRD); Grant number: DE-AC03-76SF00098.

*Correspondence to: Chris Ding, Computational Research Division,Lawrence Berkeley National Laboratory, Berkeley, CA 94720. E-mail:[email protected]

Received 11 September 2003; Accepted 18 December 2003

Published online 22 June 2004 in Wiley InterScience(www.interscience.wiley.com). DOI: 10.1002/prot.20147

PROTEINS: Structure, Function, and Bioinformatics 57:99–108 (2004)

© 2004 WILEY-LISS, INC.

framework for a systems-level understanding of cellularprocesses.

We propose a novel representation of multiprotein com-plex data that treats proteins and protein complexes in aunified manner. This representation emphasizes the “dual-ity” of the relationship: A protein complex is characterizedby its constituent proteins, while the interaction betweentwo proteins can be gauged by the number of proteincomplexes that contain these proteins. This duality is bestcaptured by a bipartite graph (Fig. 1) specified by anadjacency matrix B � (Bij), in which a protein complex isrepresented by a column and a protein is represented by arow. This bipartite representation of a multiprotein com-plex data set leads to a coherent framework for interactionnetworks:

1. The protein–protein (p-p) interaction network arisesnaturally. If we define the interaction strength betweentwo proteins as the number of complexes that containthe two proteins, this interaction strength is givenprecisely by the adjacency matrix BBT.

2. Importantly, a protein complex–protein complex (c-c)interaction network also arises from this representa-tion. If we define the interaction strength between twoprotein complexes as the number of common proteinsshared between them, then this interaction strength isgiven by the adjacency matrix BTB.

These interaction networks form a unified frameworkthat overcomes two shortcomings of previous work: (1) Thec-c interaction network yields a higher level organizationof cellular processes; and (2) the interaction strength ofconnections in the networks is more realistic than simpleuniform weighting. See the Methods section for moredetails.

The more quantitative interaction strength of networkconnections in our dual representation allows for theapplication of a rigorous graph clustering algorithm.12 Thegoal of clustering the protein interaction network is todetermine its component modules, their functional annota-tions, and the relationships between them. A module in abiological network is loosely defined as a functional unitseparable from the rest of the network.5 In this context,the terms modules and computationally discovered clus-ters are interchangeable. Our hypothesis is that suchcomputationally discovered clusters would encompass pro-

teins related through physical and possibly temporalassociations in functionally coincident macromolecularcomplexes (p-p network), or reveal diverse relationshipsamong cellular processes composed of functionally relatedprotein complexes (c-c network).

METHODSProtein Complex Data Can Be Modeled as aBipartite Graph

The representation of a multiprotein complex data set asa bipartite graph allows us to immediately infer a numberof important quantities and to apply a large body ofexisting graph techniques.

A bipartite graph has two types of nodes: p-type nodesthat denote proteins (or p-nodes) and c-type nodes thatdenote protein complexes (c-nodes). This graph structureonly allows connections between p-nodes and c-nodes.Thus, a protein complex (c-node) has edges connecting toeach of its constituent proteins (p-nodes) (Fig. 1). Abipartite graph is uniquely determined by its adjacencymatrix B � (Bij). Let c1,c2,…,cn denote protein complexesand p1,p2,…,pn denote constituent proteins. Define

bij � � 1 if protein pi is in complex cj

0 otherwise; (1)

that is, a protein complex is represented by a column in B,where each entry is either 1 or 0, with a 1 indicating thatthe complex contains the protein of the corresponding row.Similarly, a protein is represented by a row in B. Forconsistency, we refer to the relationship between proteinsand complexes represented by the bipartite graph as thep-c network. Starting from the p-c network, we can natu-rally obtain the following two networks.

Protein–Protein Interactions (p-p Network)

The interaction strength between two proteins pi,pj is

�BBT�ij � � number of protein complexescontaining both proteins pi, pj

� . (2)

Note that �BBT�ii � �j bij gives the number of proteincomplexes in which protein pi is contained. We call this theweight of protein pi.

Complex–Complex Associations (c-c Network)

The interaction strength between two protein complexesci,cj is

Fig. 1. A bipartite graph representation of a hypothetical protein complex data set. The p-nodes represent proteins and c-nodes representexperimentally determined protein complexes. An edge between a p-node and a c-node indicates that the protein is contained in the protein complex.

100 C. DING ET AL.

�BTB�ij � � number of proteins shared byprotein complexes ci, cj

� . (3)

Note that �BTB�jj � �i bij gives the number of proteinscontained in complex cj. We call this the weight of proteincomplex cj.

MinMaxCut Clustering

The MinMaxCut graph clustering algorithm13 can beapplied equally well to the p-p or c-c networks. Let theweight matrix W � (wij) denote the pairwise connectionstrength between proteins, or between protein complexes.We wish to partition the connection network G into twosubnetworks, G1,G2, based on a min–max clustering prin-ciple. The total connection strength between G1,G2 is

s�G1, G2� � �i�G1

�j�G2

wij. (4)

The total connection strength within a cluster G1 or G2 issimilarly defined. The clustering principle requires mini-mizing s(G1,G2) (weak connections been different clus-ters), while simultaneously maximizing s(G1,G1) ands(G2,G2) (strong connections within each cluster). Theserequirements are satisfied by the objective function,

J�G1, G2� �s�G1, G2�

s�G1, G1��

s�G1, G2�

s�G2, G2�. (5)

The solution of the clustering problem is represented by anindicator vector q, where the ith entry of q is

q(i) � � a if i � G1

�b if i � G2,(6)

where a and b (0 � a,b � 1) are constants. One can provethat

minq J�G1, G2�f

minq

qT�D � W�qqTDq , (7)

where D � (di) is a diagonal matrix, di � �j wij. Now,relaxing q(i) from a discrete indicator in Eq. (6) to continu-ous values in [�1,1], the solution q of the minimizationproblem satisfies

�D � W�q � �Dq (8)

The desired solution is the eigenvector q2 corresponding tothe second smallest eigenvalue. From Eq. (6), we canrecover clusters by the sign of q2, that is, G1 � {i�q2(i) � 0},G2 � {i�q2(i) � 0}. In general, the optimal dividing pointcould shift away from 0; we search the dividing pointq(icut),

G1 � �i� q2�i� � q2�icut�, G2 � �i� q2�i� � q2(icut�}.

(icut � 1,…,n � 1), such that J(G1,G2) is minimized (theminimum value is Jopt). This gives the final clusters G1

and G2.

Hierarchical Divisive Clustering

Divisive clustering starts from the top by treating thewhole data set as a single initial cluster. It recursively

splits the current cluster (a leaf node in a binary clusteringtree) into two subclusters. Two important issues are (1)how to select the next candidate cluster to split, and (2)when to terminate the recursive process.

Given a current cluster Gk, we wish to decide whether tofurther split it into two subclusters. We apply MinMaxCutto Gk. If Jopt is large, then the overlap between tworesulting subclusters is large in comparison to the within-subcluster similarity and hence cluster Gk should not befurther split. Thus, the optimal value Jopt is a measure of“cluster cohesion.”

At each cluster splitting in the divisive process, wecompute the cluster cohesion for each of the subclusters.To select the next cluster to split, we select among allcurrent clusters the one with the smallest cohesion. As thecluster splitting process continues, clusters with smallcohesion are split and the cohesion of the resulting clustersincreases. To terminate the divisive process, we set athreshold for cohesion h � 0.6 (i.e., clusters with cohesiongreater than h will not be further split). A greater cohesionthreshold will lead to “tighter” clusters. The cohesionthreshold is the only parameter in the MinMaxCut algo-rithm.

Resources

1. The February 2003 release of the Gene Ontology (GO)(http://www.geneontology.org) was used to obtain theannotated terms for yeast proteins from the TAP-MSdata set.4

2. A freely distributed Perl library interface to the GOdatabase was employed for all calculations relating toGO annotations.

3. A Perl library interface to the GraphViz package (http://www.cpan.org) was used to create the graph representa-tions.

4. The primary sequences for all proteins analyzed wereobtained from the Saccharomyces Genome Database.14

5. The EMBOSS toolkit15 was used for calculations ofsequence properties.

6. The PsiPred program16 was used for secondary struc-ture determination.

7. A website with additional results related to this articleis located at http://frna.lbl.gov/complex.

RESULTS AND DISCUSSIONMultiprotein Complex Data Set

Two data sets summarizing high-throughput analysis ofmultiprotein complexes are available for the yeast Saccha-romyces cerevisiae. Coupling different purification [immu-noprecipitation and tandem affinity purification (TAP)]and labeling schemes with mass spectrometry (MS), bothstudies used bait proteins to identify physiologically intactprotein complexes. A recent analysis used a maximumlikelihood model and gene expression correlation coeffi-cients to evaluate the reliability of various high-through-put protein–protein interaction data sets and concludedthat the TAP-MS data set had the highest accuracy forpredicting protein function.17 Another analysis comparedthe accuracy and coverage of protein interactions for

MODELING INTERACTION NETWORKS 101

several high-throughput data sets relative to trustedreference sets of manually annotated protein complexesfrom the Munich Information Center for Protein Se-quences (MIPS) and the Yeast Proteome Database (YPD).18

This analysis also revealed a superior accuracy to coveragetrade-off for the TAP-MS data relative to other methods.Hence, we have chosen this data set to illustrate ourmodel. More information about the TAP-MS data set isavailable at http://yeast.cellzome.com.

We represent this data set as a bipartite graph withadjacency matrix B. The symmetric matrix BBT definesthe interaction strength of the protein–protein interactionnetwork from the underlying bipartite graph model. Thisp-p network shows a scale free topology indicating thatproteins in the network have a wide range of connectivity(Fig. 2). Previous work has speculated that connectivity inthe network might correlate with observable biologicalproperties such as the rate of protein evolution.19

Clusters in the p-p Interaction Network DefineModules

Given a network of protein interactions, one can compu-tationally predict modules and annotate these moduleswith a biological context. A computationally predictedprotein module is defined as a highly connected region orstructure in the network. Previous work has employed“k-cores” and other density-based methods to partition theprotein interaction network.7,20 Spirin and Mirny21 used aglobal optimization calculation based on the multibodystructure of the network to find functionally consistentdivisions. Vazquez et al.22 employ a global optimization

calculation that minimizes the number of connections inthe network that occur among different functional catego-ries, thus using physical interactions and functional anno-tations to determine modules. In this article, we identifyclusters in the protein interaction network using a graphclustering algorithm, MinMaxCut, which was shown to beeffective for class discovery in the analysis of gene microar-ray data (see Methods section).12 We apply MinMaxCut tothe protein interaction network specified by the adjacencymatrix BBT. The nonuniform interaction strength betweenproteins gives a more realistic characterization of thenetwork. Following, we present an analysis of the p-pinteraction network, highlighting only the main results. Acomprehensive analysis of these results is deferred to alater article.

Figure 3 shows the interaction strength of the p-pnetwork (the adjacency matrix BBT) sorted after cluster-ing. Several clusters exhibit high overall interactionstrength and most encompass biologically meaningfulcomplexes. To support our supposition that clusters in thep-p network encompass physiologically relevant proteincomplexes, we compared the discovered p-p clusters to theTAP-MS protein complexes that are the basis of thebipartite graph model. To quantify this correspondence,we define the match coefficient

� n�Pk, cj�/min(�Pk�,�cj�)

where �Pk� is the number of proteins in p-p cluster Pk, �cj� isthe number of proteins in TAP-MS protein complex cj, andn(Pk,cj) is the number of shared proteins between Pk and cj.The constituents of a protein cluster Pk may all be con-tained in an experimental protein complex cj or, con-versely, the constituents of cj may all be contained in Pk;both cases result in a perfect match with � 1. Using thismatch coefficient and a threshold of 0.7, we found that 65of 66 predicted p-p clusters match to at least one experimen-tal protein complex (Fig. 4). This is strong evidence thatclusters in the p-p network define modules of physiologi-cally intact protein complexes and furthermore, that anyclustered assemblies with uncharacterized constituentsmight correspond to novel interactions or functional rela-tionships. Clearly those protein clusters that match two ormore TAP-MS protein complexes are most interesting. Forexample, Figure 5 details how the largest cluster in the p-pnetwork denoted P28 (labeled with Smd2 in Fig. 3) matches6 TAP-MS protein complexes. These matching complexesare also shown as the 6 points in Figure 4, indicated by thearrow.

Modules in the p-p Network Have CharacteristicPhysical and Chemical Properties

The assembly, thermodynamic stability, and functional-ity of protein complexes are controlled by various environ-mental conditions in the cell. Surface-accessible aminoacid residues can be covalently modified to regulate thefunctional state of protein complexes. Noncovalent ligandbinding can also modulate the functional state of proteincomplexes. Hence, we would expect that the proteins ofdiscovered clusters in the p-p network would be distinguish-

Fig. 2. Distribution of the degree (number of proteins a given proteininteracts with) in the protein–protein interaction network. This curveapproximates a power-law distribution indicating that it is a scale-freenetwork topology.

102 C. DING ET AL.

Fig. 3. Predicted clusters of the p-p network. The color shows thenormalized interaction strength of connections in the p-p network. Clus-ters with less than 20 proteins are not shown. The most highly connectedprotein in each cluster is shown by its protein name and the number ofTAP-MS protein complexes this cluster matches (with � 0.7) is shownafter the protein name. Axes correspond to proteins. For example, clusterP28 with protein Smd2 has 112 proteins and overlaps 6 TAP-MS proteincomplexes. A larger figure showing all clusters is available [see Methodssection, Resources (item 7)].

Fig. 4. A summary of the overlap between the constituents ofpredicted p-p clusters and TAP-MS protein complexes. Match coefficientsare indicated by the symbols. The solid line indicates where proteincomplexes and p-p clusters are of the same size. The arrow indicates thecomplexes that overlap a p-p cluster, designated P28, which is discussedin the text.

Fig. 5. Protein cluster P28 matches six TAP-MS protein complexes(labeled as originally published). All proteins in the cluster and matchedprotein complexes are listed. Proteins shared by the p-p cluster and atleast one TAP-MS experimental protein complex are listed above thedividing line. Below the line are proteins not shared. The match coeffi-cients are (P28,c128) � 0.83, (P28,c129) � 0.91, (P28,c128) (P28,c155) �1, (P28,c158) � 1, (P28,c160) � 0.98, (P28,c) � 0.98.

MODELING INTERACTION NETWORKS 103

able by intrinsic physical and chemical characteristics. Wecalculated an F statistic for protein physical–chemicalproperties and amino acid composition to see if proteinclusters exhibit any significant trends that might suggestdistinguishing features of their interactions. Given a par-ticular property f across n proteins and K clusters contain-ing these proteins, the F statistic is defined as

F �1

K � 1 �k�1

K

nk�f�k � f��2� 1n � K �

k�1

K

�nk � 1��k2,

where f� is the average across all proteins, f�k and �k are theaverage and variance within p-p cluster Pk, and nk is thesize of cluster Pk. The magnitude of the F statistic is ameasure of how well the given property distinguishesbetween clusters. The various properties and their Fstatistics are listed in Table I. To assess the statisticalsignificance, we compute the F statistic for the same dataset when proteins are randomly assigned to clusters. The Fstatistic for randomly shuffled data is approximately 16 �8 across these quantities. Thus, a value above 30 issignificant.

Protein complexes can be characterized as nonobligate(temporary) or permanent where the native state is oligo-meric. The surfaces that mediate the interactions in thesetwo types of complexes necessarily differ in structural andphysical properties.23 Since using different values for thecluster cohesion parameter (see Methods section) of theMinMaxCut clustering algorithm is likely to result indiscovered protein clusters that encompass differing ratiosof these two types of complexes, we would expect that thecalculated physical properties would be somewhere inter-mediate between those expected for the two types ofcomplexes. Indeed, this seems to be the case if we considerthe statistics for amino acid composition. Interactions intemporary protein complexes that function dynamically incellular processes are often tuned by the effects of polargroups (Lys, Arg, Gln, Asn, Asp) that define a complemen-tary electrostatic surface, hydrogen bonding (Arg), andstabilizing hydrophobic interactions (proline). Methyl-ation of Arg and Lys, and acetylation of Lys, are well-known covalent modifications of surface amino acids thatcould influence complex formation. These residues are alsoabundant in nucleic acid binding proteins, which areabundantly represented in the TAP-MS data set. Cysparticipates in the formation of disulfide bridges that canstabilize more permanent complexes, as well as moredynamic interactions.23–25 Finally, studies have shown

that secondary structural features are often uniformlydistributed at protein interaction interfaces, which isconsistent with their relative unimportance in the abovecalculations.25

Supercomplexes Encompass Modules From the p-pNetwork

In previous analyses of protein complex data, only theresulting pairwise interaction network has been exam-ined.1,7,9,18,21,22 The pairwise interaction network, how-ever, yields an incomplete and noisy version of proteomicorganization. As evidenced by recent high-throughputexperiments for determining protein complexes, proteincomplexes are apt to share components and hence define anetwork of interconnected cellular processes.3,11,26 Nocomputational study to date has adequately representedthe higher order organization of this network. In our dualrepresentation of the data, the adjacency matrix BTBdefines the connectivity between protein complexes wherethe connection is weighted by the number of sharedproteins. Figure 6 shows the result of a MinMaxCutclustering of this network. Clusters are labeled with themost frequently occurring proteins, as well as the numberof TAP-MS protein complexes corresponding to a particu-lar biological process.4 We introduce the terminology super-complex to denote a cluster in the complex–complex asso-ciation network.

Since we expect supercomplexes to represent the diver-sity of interconnected cellular processes, it would beconsistent if each supercomplex showed high match coeffi-cients with various modules from the p-p interactionnetwork. Figure 7 summarizes the overlap between pre-dicted supercomplexes and predicted clusters in the p-pinteraction network. Most supercomplexes show overlapwith several predicted p-p clusters and, in some instances,the same predicted p-p cluster occurs in multiple supercom-plexes. In one instance, the TAP-MS complexes overlap-ping a cluster in the p-p network and the TAP-MS com-plexes contained in a supercomplex are in one-to-onecorrespondence (P28 listed in Fig. 5).

Computationally Discovered Modules AreBiologically Consistent

We provide here preliminary evidence that computation-ally discovered modules in the dual representation arebiologically consistent. To determine a biological context,we used a set of controlled vocabularies defined by the GOfor which most of the proteins in our data set have been

TABLE I. F Statistics of Amino Acid Composition (Top) and PhysicalProperties (Bottom) Across All Clusters in p-p Interaction Network

Lys 100 Asn 56 Val 30 Ile 24Asp 89 Gln 50 Tyr 29 Ser 23Arg 73 Cys 39 Met 29 Leu 22Pro 70 His 33 Trp 28 Gly 21Glu 66 Ala 31 Thr 28 Phe 21pI 169 Basic 149 Acidic 97 MW 60Aromatic 30 Helix 37 -Sheet 33 Coil 27

104 C. DING ET AL.

annotated with at least one term. The GO consists of threeorthogonal ontologies: biological process, molecular func-tion, and cellular component.27 Given that p-p clusters aredefined by the proteins sharing maximal membershipwithin the same experimentally determined protein com-plexes and c-c clusters capture relationships betweenprotein complexes, we would expect the cellular compo-nent and biological process ontology to give the mostcoherent annotations. We map each protein in a p-p clusterto the most specific ontological term assigned to it. For c-cclusters, we determine a nonredundant union of all proteinconstituents and map these to their most specific anno-tated terms. The GO is organized as a set of directed-acyclic graphs. This data structure allows us to ascendeach graph from more specific terms to determine the set ofcommon “parent” terms that describe a predicted cluster’sfunctional categories. We approximate the significance ofthat annotation by calculating the probability that n ormore proteins would be assigned to that term if weassigned proteins randomly to the cluster. This probabilityis calculated as

P � �n�j�N

� Nj � pj�1 � p�N�j,

where p is ratio of proteins in the genome annotated to thegiven term, n is the number of proteins in the cluster

annotated to the term, and N is the number of proteins inthe cluster. This P value allows us to rank annotationsaccording to significance and to reason about the cellularroles for a given cluster. If a subgraph composed from thesignificant terms is biologically consistent, then we mayinfer the validity of the computationally determined mod-ule.

We briefly present two examples: the largest cluster inthe p-p network, denoted P28, and the largest cluster in thec-c network, denoted C47. Since these large clusters encom-pass proteins with a range of connection weights in therespective networks and hence probably encompass pro-teins and complexes of diverse function, we believe thatthey represent difficult examples to assign biological signifi-cance, and therefore adequately demonstrate the robust-ness of our method. P28 contains 112 proteins, as depictedin Figure 5. Figure 8 shows the most significant ontologicalterms from the cellular component ontology correspondingto the proteins in this cluster. Annotations to the generalterms nucleus (76 proteins) and ribonucleoprotein (RNP)complex (81 proteins), as well as more specific terms suchas spliceosome complex (48 proteins), major (U2 depen-dent) spliceosome (22 proteins), and commitment complex(12 proteins), clearly indicate that these proteins arecomponents of the pre-mRNA splicing machinery. It isknown that the transcriptional machinery consists ofseveral coupled multiprotein machines that carry outseparate steps in gene expression coordinated via interac-tions with the carboxy terminal domain of the RNApolymerase II large subunit.28

The predicted protein cluster P28 is also the only p-pcluster that corresponds exactly with a supercomplex.While most of the proteins in the cluster have beenaccounted for in stable complexes, there are also somemore hypothetical relationships suggested by the GOannotations. For example, 10 proteins are annotated to beassociated with the mitochondrial ribosome. Constituents

Fig. 6. Predicted protein supercomplexes (clusters of the c-c net-work). Several large supercomplexes are shown. Each supercomplex islabeled with the most frequently occurring proteins, the number ofnonredundant constituent proteins, and the relevant biological processesinferred from the participating TAP-MS experimental protein complexes.Axes correspond to TAP-MS complexes. Color represents normalizedconnection strength. A larger figure showing all clusters is available [seeMethods section, Resources (item 7)].

Fig. 7. Overlap between computed supercomplexes (clusters of c-cnetwork) and predicted clusters in the p-p network. The match coefficientsdefined by shared protein constituents are indicated.

MODELING INTERACTION NETWORKS 105

of the mitochondrial ribosome are encoded in both thenuclear and mitochondrial genomes. A mechanism thatcoordinates the expression of these constituents has beenhypothesized, given that the stoichiometric synthesis of allmitochondrial ribosomal components is likely to be regu-lated to avoid wasting metabolic energy.29 Hence, theclustering of these proteins suggests a possible couplingbetween gene expression in the nucleus and mitochondria.Additionally, there is evidence that splicing can enhanceexport of mRNA from the nucleus,30 and that combinato-rial binding of heterogeneous ribonucleoproteins to mRNAmay regulate posttranscriptional events such as nuclearexport, mRNA stability, and nonsense-mediated decay.31

That many of our proteins are annotated to these terms(commitment complex, mRNA-nucleus export, translationinitiation, polysome, cytoplasmic transport, mRNA splic-ing) at least suggests these relationships and their interde-pendence. See the Methods section, Resources (7) for acomplete list of annotations.

The largest supercomplex C47 illustrates how diversecellular process can be coupled via a nexus of intercon-nected protein complexes. Figure 9 shows the most signifi-cant GO-process annotations for this supercomplex (210proteins). The GO-process annotations suggest that thissupercomplex encompasses complexes involved in chroma-tin dynamics and transcriptional regulation and initia-tion, as well as cell cycle control, DNA replication andrepair, and signal transduction (for clarity, only a subset ofthe significant annotations are shown in Fig. 9). See theMethods section, Resources (item 7), for a complete annota-tion. We determined a list of protein complexes from theMIPS Catalog (http://mips.gsf.de) that are highly repre-sented in the supercomplex.32 A subset of this list is shownin Table II. Several of these complexes are known chroma-

Fig. 9. Subgraphs of the gene ontology (Process) corresponding to asubset of the most prevalent annotations of proteins in supercomplex C47.Significant nodes are labeled with the number of proteins annotateddirectly or indirectly to that term and the P-value for the term.

Fig. 8. Subgraph of the gene ontology (Component) corresponding to a subset of the most prevalentannotations of proteins in p-p cluster P28. Significant nodes are labeled with the number of proteins annotateddirectly or indirectly to that term and the P-value for the term.

106 C. DING ET AL.

tin regulators in yeast and are necessary for such pro-cesses as transcriptional initiation, certain types of DNArepair and silencing mechanisms, and cell cycle progres-sion.33–35

CONCLUSIONS

In this article, we propose a dual representation thatunifies three interaction networks—the protein–proteincomplex (p-c) network, the protein–protein interaction(p-p) network, and the protein complex–protein complex(c-c) network—under a single framework. The resultingprotein–protein and complex–complex interaction net-works have more realistic interaction strengths comparedto the conventional binary interaction networks with equalweighting. This results in a coherent framework for compu-tational detection of modules that occur as clusters ordensely connected regions in the dual representations. Weapply a rigorous graph clustering algorithm to find thesemodules. Basic statistical analysis revealed that differ-ences between modules in the protein interaction networkare reflected by characteristic physical and chemical prop-erties of the protein interactions. We emphasize the pro-tein complex–protein complex (c-c) network as revealing ahigher order organization of the proteome. The largest

supercomplex has 210 nonredundant constituent proteinsand is involved in a number of cellular processes. Use ofthe GO revealed that the biological annotations of compu-tationally discovered modules are statistically significantand that this method can facilitate the functional annota-tion of uncharacterized constituents in future multiproteincomplex data sets, as well as the discernment of novelfunctional relationships between protein complexes. Asmore and higher quality protein complex data becomesavailable, we expect this unified representation of interac-tion networks and associated clustering methodology toevolve into a useful framework for studying this aspect ofsystems biology.

ACKNOWLEDGMENTS

We thank the anonymous reviewers for their helpfulcomments.

REFERENCES

1. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR,Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-EmiliA, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G,Yang M, Johnston M, Fields S, Rothberg JM. A comprehensiveanalysis of protein–protein interactions in Saccharomyces cerevi-siae. Nature 2000;403:623–627.

2. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. Acomprehensive two-hybrid analysis to explore the yeast proteininteractome. Proc Natl Acad Sci USA 2001;98:4569–4574.

3. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, MillarA, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, DonaldsonI, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M,Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, WillemsAR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, JohansenLE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Craw-ford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC,Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, HogueCW, Figeys D, Tyers M. Systematic identification of proteincomplexes in Saccharomyces cerevisiae by mass spectrometry.Nature 2002;415:180–183.

4. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A,Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C,Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, HudakM, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B,Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E,Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B,Kuster B, Neubauer G, Superti-Furga G. Functional organizationof the yeast proteome by systematic analysis of protein complexes.Nature 2002;415:141–147.

5. Alm E, Arkin AP. Biological networks. Curr Opin Struct Biol2003;13:193–202.

6. Schwikowski B, Uetz P, Fields S. A network of protein–proteininteractions in yeast. Nat Biotechnol 2000;18:1257–1261.

7. Bader GD, Hogue CW. An automated method for finding molecu-lar complexes in large protein interaction networks. BMC Bioinfor-matics 2003;4:2.

8. Pothen A. Graph and hypergraph models of protein interactionnetworks. SIAM Conf Comput Sci Eng 2003.

9. Bu D, Zhao Y, Cai L, Xue H, Zhu X, Lu H, Zhang J, Sun S, Ling L,Zhang N, Li G, Chen R. Topological structure analysis of theprotein–protein interaction network in budding yeast. NucleicAcids Res 2003;31:2443–2450.

10. Krause R, Von Mering C, Bork P. A comprehensive set of proteincomplexes in yeast: mining large scale protein–protein interactionscreens. Bioinformatics 2003;19:1901–1908.

11. Deshaies RJ, Seol JH, McDonald WH, Cope G, Lyapina S,Shevchenko A, Verma R, Yates JR III. Charting the proteincomplexome in yeast by mass spectrometry. Mol Cell Proteomics2002;1:3–10.

12. Ding C. Analysis of gene expression profiles: class discovery andleaf node ordering. Proc 6th Intl Conf Comp Mol Bio (RECOMB)2002;6:127–136.

TABLE II. A Sample of Known Protein Complexes Fromthe Curated MIPS Catalog, Which Have Many

Constituents in Supercomplex C47

MIPS Listing # ORFs# ORFs inCluster

RNA Pol II holoenzyme 35 23Kornberg’s mediator 21 21Other transcription 73 17HAT A 15 14TFIID 13 13SAGA 14 13Ada-Spt 14 13TAFIIs 12 12DNA repair 33 9RSC 10 6ADA 6 6Replication fork 30 6DNA mismatch repair 5 5Cytoplasmic translation initiation 27 4SAGA-like 5 4Nucleotide excision repairosome 16 3RNA Polymerase III 13 3Replication factor A 3 3Actin-associated motorproteins 7 3MSH2/MSH3 3 3Srb10p 4 3NEF4 2 2eIF4A 2 2NuA4 2 2Nuclear pore 24 2Sir 2 2

Listed are the name of the complex, the number of known open readingframes (ORFs) in the complex, and the number of ORFs from thecomplex present in C47. Rows containing complexes implicated inchromatin dynamics are shaded.

MODELING INTERACTION NETWORKS 107

13. Ding C, He X, Zha H, Gu M, Simon H. A min–max cut algorithmfor graph partitioning and data clustering. Proc IEEE Intl ConfData Mining (ICDM) 2001:107–114.

14. Dwight SS, Harris MA, Dolinski K, Ball CA, Binkley G, ChristieKR, Fisk DG, Issel-Tarver L, Schroeder M, Sherlock G, Sethura-man A, Weng S, Botstein D, Cherry JM. Saccharomyces GenomeDatabase (SGD) provides secondary gene annotation using theGene Ontology (GO). Nucleic Acids Res 2002;30:69–72.

15. Rice P, Longden I, Bleasby A. EMBOSS: the European MolecularBiology Open Software Suite. Trends Genet 2000;16:276–277.

16. Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999;292:195–202.

17. Deng M, Sun F, Chen T. Assessment of the reliability of protein–protein interactions and protein function prediction. Pac SympBiocomput 2003:140–151.

18. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S,Bork P. Comparative assessment of large-scale data sets ofprotein–protein interactions. Nature 2002;417:399–403.

19. Fraser HB, Hirsh AE, Steinmetz LM, Scharfe C, Feldman MW.Evolutionary rate in the protein interaction network. Science2002;296:750–752.

20. Bader GD, Hogue CW. Analyzing yeast protein–protein interac-tion data obtained from different sources. Nat Biotechnol 2002;20:991–997.

21. Spirin V, Mirny LA. Protein complexes and functional modules inmolecular networks. Proc Natl Acad Sci USA 2003;100:12123–12128.

22. Vazquez A, Flammini A, Maritan A, Vespignani A. Global proteinfunction prediction from protein–protein interaction networks.Nat Biotechnol 2003;21:697–700.

23. Nooren IM, Thornton JM. Diversity of protein–protein interac-tions. EMBO J 2003;22:3486–3492.

24. Veselovsky AV, Ivanov YD, Ivanov AS, Archakov AI, Lewi P,

Janssen P. Protein–protein interactions: mechanisms and modifi-cation by drugs. J Mol Recognit 2002;15:405–422.

25. Jones S, Thornton JM. Principles of protein–protein interactions.Proc Natl Acad Sci USA 1996;93:13–20.

26. Gavin AC, Superti-Furga G. Protein complexes and proteome organi-zation from yeast to man. Curr Opin Chem Biol 2003;7:21–27.

27. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, CherryJM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, HillDP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, RichardsonJE, Ringwald M, Rubin GM, Sherlock G. Gene ontology: tool forthe unification of biology. The Gene Ontology Consortium. NatGenet 2000;25:25–29.

28. Maniatis T, Reed R. An extensive network of coupling among geneexpression machines. Nature 2002;416:499–506.

29. Graack HR, Wittmann-Liebold B. Mitochondrial ribosomal pro-teins (MRPs) of yeast. Biochem J 1998;329:433–448.

30. Reed R, Hurt E. A conserved mRNA export machinery coupled topre-mRNA splicing. Cell 2002;108:523–531.

31. Keene JD. Ribonucleoprotein infrastructure regulating the flow ofgenetic information between the genome and the proteome. ProcNatl Acad Sci USA 2001;98:7018–7024.

32. Mewes HW, Frishman D, Guldener U, Mannhaupt G, Mayer K,Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B.MIPS: a database for genomes and protein sequences. NucleicAcids Res 2002;30:31–34.

33. Roth SY, Denu JM, Allis CD. Histone acetyltransferases. AnnuRev Biochem 2001;70:81–120.

34. Green CM, Almouzni G. When repair meets chromatin. First inseries on chromatin dynamics. EMBO Rep 2002;3:28–33.

35. Peterson CL. Chromatin remodeling enzymes: taming the ma-chines. Third in review series on chromatin dynamics. EMBO Rep2002;3:319–322.

108 C. DING ET AL.