hkc:analgorithmtopredictproteincomplexesin protein ... · hkc(g). it is the central most densely...

Hindawi Publishing CorporationJournal of Biomedicine and BiotechnologyVolume 2011, Article ID 480294, 14 pagesdoi:10.1155/2011/480294

Research Article

HKC: An Algorithm to Predict Protein Complexes inProtein-Protein Interaction Networks

Xiaomin Wang,1 Zhengzhi Wang,1 and Jun Ye2, 3

1 Institute of Mechanical Engineering and Automation, National University of Defense Technology, Changsha 410073, China2 Department of Software Engineering, Jiangnan Institute of Computing Technology, Wuxi 214083, China3 School of Computer, National University of Defense Technology, Changsha 410073, China

Correspondence should be addressed to Xiaomin Wang, [email protected]

Received 23 May 2011; Accepted 24 August 2011

Academic Editor: Paul W. Doetsch

Copyright © 2011 Xiaomin Wang et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

With the availability of more and more genome-scale protein-protein interaction (PPI) networks, research interests gradually shiftto Systematic Analysis on these large data sets. A key topic is to predict protein complexes in PPI networks by identifying clustersthat are densely connected within themselves but sparsely connected with the rest of the network. In this paper, we present a newtopology-based algorithm, HKC, to detect protein complexes in genome-scale PPI networks. HKC mainly uses the concepts ofhighest k-core and cohesion to predict protein complexes by identifying overlapping clusters. The experiments on two data setsand two benchmarks show that our algorithm has relatively high F-measure and exhibits better performance compared with someother methods.

1. Introduction

With the development of high-throughput methods (such asmass spectrometry [1] and two-hybrid [2]), more and moregenome-scale protein-protein interaction (PPI) networks arenow available, enabling us to systematically analyze thebehaviors and properties of biological molecules. Among thevarious researches on PPI networks, a key topic is to predictprotein complexes in PPI networks. A protein complex is agroup of proteins that interact with each other at the sametime and place, forming a single multimolecular machine[3]. These complexes are a cornerstone of many biologicalprocesses. PPI networks are modular [3, 4] and containmodules that are densely connected within themselves butsparsely connected with the rest of the network. Thesemodules are called cluster and they may represent proteincomplexes [5, 6]. Thus, protein complexes can be detectedby identifying clusters from PPI networks. Detecting proteincomplexes in PPI networks is of vital importance to theunderstanding of the structural and functional properties ofPPI networks and can also help to predict the function ofunknown proteins.

Due to the high level of noise as well as the topologicalfeatures of PPI networks, traditional clustering techniques ina metric space cannot successfully predict protein complexesin PPI networks [3], thus various graph analysis approacheshave been proposed to solve this problem. These approachescan be classified into the following three types. (1) Agglom-erative methods: Bader and Hogue [6] proposed a graphtheoretic clustering algorithm to detect molecular complexesin PPI networks. The method was based on vertex weightingby local neighborhood density and outward traversal froma locally dense seed protein to isolate the dense regionsaccording to given parameters. The problem of this methodis that using only vertex weighting, it would find sparselyconnected subgraphs (such as a rope-like subgraph or amixed subgraph consisting of several connected clusters)instead of dense clusters. (2) Graph partition methods: Kinget al. [7] partitioned networks and found an approximateoptimal solution in the space of partitions using a cost-basedlocal search algorithm. This technique was nondeterministicand got different result in each run, and it consumedhuge space. Chen and Yuan [8] extended a betweenness-based partition algorithm (Girvan-Newman algorithm [9],

2 Journal of Biomedicine and Biotechnology

GN for short) and used it to partition PPI networks intosubgraphs and then obtained function modules by filteringthese subgraphs. The chief drawback of GN algorithm is itis time consuming. Generally, graph partition methods canonly find nonoverlapping clusters, while protein complexestend to overlap with each other. (3) Methods based onclique (see Section 2.2 Concepts): Palla et al. [10] suggestedclique percolation method (CPM) to find overlapping com-munities. The core concept of CPM is k-clique communitywhich is defined as the union of all k-cliques (completesubgraphs of size k) that can be reached from each otherthrough a series of adjacent k-cliques (where adjacencymeans sharing k-1 nodes). Some other researchers [11, 12]predicted protein complexes by identifying cliques or near-cliques in PPI networks. Clique-based approaches were toostringent with the topological structure and cannot detectprotein complexes with other types of topological structure.

In this paper, we present a new topology-based algo-rithm, HKC (as it mainly uses two important concepts,highest k-core and cohesion), to detect protein complexes ingenome-scale PPI networks. Our algorithm uses the conceptsof highest k-core, and cohesion to predict protein complexesby identifying overlapping clusters. It first calculates the scorefor each node in the PPI network based on the concept ofhighest k-core then uses the nodes with high scores and highdegrees as seeds; from each seed, gets a core according tothe corresponding highest k-core of the direct neighborhoodof the seed and expands this core to include nodes whichare highly possible to form a cluster based on the criteriaof node score and cohesion; finally, protein complexes canbe predicted by filtering all the found clusters accordingto predefined features. We apply HKC on two data sets(MIPS and SGD-MC data set) and evaluate the results ontwo benchmarks (complexcat and Gavin benchmarks). Theexperiments show that our algorithm has relatively highprecision and recall, that is, most of the predicted clustersmatch well with known protein complexes, and at thesame time most of the known protein complexes have beenrecalled. HKC exhibits better performance compared withsome other methods, and besides, it is general and can beused in any type of biological interaction networks and evenin nonbiological networks.

2. Materials and Method

2.1. Materials. Protein-protein interactions of the modelorganism Saccharomyces cerevisiae (yeast) has been studiedthoroughly, and the data of yeast protein complexes is themost comprehensive so far, so we test HKC on the yeast PPIdata and use yeast complex data to validate its effectiveness.

We use the following two data sets as the input network:(1) MIPS data set, it is processed based on the MunichInformation Center for Protein Sequences (MIPS) [13] yeastPPI data set (containing 15,456 records) downloaded fromthe Comprehensive Yeast Genome Database (CYGD) [14];after removing those repetitive interactions and taking noaccount of the edge direction, the final input PPI networkcontains 4,554 proteins and 12,526 interactions. (2) SGD-MC data set, which is downloaded from the Literature

Curation Data in SGD database [15]; the original data setcontains 252247 records, including two types of interactions,high-throughput and manually curated interactions, and weonly use the manually curated interactions. After deletingall repetitive interactions and taking no account of the edgedirection, the final SGD-MC data set contains 4,448 proteinsand 29,068 interactions.

To evaluate the algorithm, we collect two data sets asbenchmarks: (1) complexcat benchmark, it is obtained byprocessing the MIPS yeast protein complex catalog in CYGD[14]. The MIPS yeast protein complex catalog was lastmodified in 2006, and has been manually curated fromthe literature; therefore, it is more realistic than other dataobtained by high-throughput methods and has been used inmany researches because of its quality [6, 7, 16]. However, itis not proper to use this data set directly as the benchmark,since it contains many complexes composed of a singleprotein which do not fit the definition of clusters and alsocontains the complexes by Systematic Analysis [17, 18] whichare not so reliable as results by small-scale experiments.After removing those complexes by Systematic Analysis (thetype of 550) and the complexes consisting of only oneprotein, the final obtained complexcat benchmark contains217 protein complexes. (2) Gavin benchmark, it is processedfrom the experiment results by Gavin et al. [19]. Theyuse affinity purification and mass spectrometry to get 491complexes that differentially combine with additional attach-ment proteins or protein modules to enable a diversificationof potential functions. We adopt the core proteins (thosepresent in 2/3 of the isoforms) of the 491 complexes and36 additional known complexes, and after removing those ofsize less than 3, finally we get 204 protein complexes in Gavinbenchmark.

2.2. Concepts

2.2.1. PPI Network. PPI networks can be intuitively modeledas a static graph G = (V ,E), where V is the set of nodes (pro-teins), and E is the set of edges (protein-protein interactions).

An undirected edge is drawn between each pair of nodesfor which there is evidence of a protein-protein interaction.

2.2.2. Basic Concepts in Graph Theory

Degree. The degree of the node n is the number of edgesattached to it and is denoted as deg(n).

Graph Density. There is no standard graph theory definitionof density, but definitions are normally based on the con-nectivity level of a graph [6]. For undirected simple graphs,the graph density is defined as the quotient of the number ofreal edges in the graph divided by the number of all possibleedges:

den(G) = |E||V | ∗ (|V | − 1)/2 , (1)

where den(G) is the density of graph G, |E| is the number ofreal edges in G, and |V | is the number of nodes in G.

Journal of Biomedicine and Biotechnology 3

For undirected graph with loops (a loop is the edge thatconnects a node and the node itself), the number of allpossible edges is |V | ∗ |V |/2, thus the density of graph G isdefined as

den(G) = |E||V | ∗ |V |/2 . (2)

So, the range of graph density is between 0 and 1.

Clique. A clique is a fully connected subgraph, that is, a set ofnodes that are all neighbors of each other [20]. For instance,Figure 1(c) is a clique consisting of 4 nodes.

k-Core. A k-core of a graph is a maximal subgraph such thateach node in the subgraph has at least degree k and is denotedas kc(G). For example, in Figure 1, the graph in (b) is a 2-coreof the graph in (a).

Highest k-Core. The highest k-core of a graph is the one withthe maximal k value among all the k-cores and is denoted ashkc(G). It is the central most densely connected subgraph.The highest k-core of G can be found in the following way[6]: suppose the lowest degree of nodes in G is l, deleteall nodes with degree l, if all remaining nodes have a leastdegree l1 (l1 > l), we will get the l1-core of G; if some ofthe remaining nodes have degree lower than or equals tol, continue delete all nodes with the least degree until allremaining nodes have a degree higher than l, or until allnodes have been deleted. In this way we can find all k-coresand the one with the maximal k value is the highest k-core.For example, Figure 1(a) shows a graph G, if we delete thenode of degree 1 (i.e., node 6), we can obtain a subgraph inwhich each node has a least degree 2, that is, the 2-core ofG, as shown in Figure 1(b); if we continue delete the node ofdegree 2 (node 1), we can get a subgraph in which each nodehas a least degree 3, the 3-core of G, as shown in Figure 1(c),and it is also the highest k-core of graph G.

2.2.3. Cohesion. During the expansion of a core, only basedon the score information, we cannot efficiently decidewhether a node should be included into the core. We define anew concept cohesion to measure the connectivity between anode n and an existing cluster c, and we denote it as co(n, c).Cohesion is calculated in the following way:

co(n, c) = EncNc

, (3)

where Enc is the number of edges between node n and clusterc, and Nc is the number of nodes in cluster c. co(n, c) isa real number between 0 and 1. The larger co(n, c) is, thetighter node n connects to cluster c, and the more likely theybelong to a larger cluster. Therefore, cohesion can be usedas a criterion during the core expansion process. A node canonly be included into a core when the cohesion between thisnode and the core is greater than a specified threshold.

2.3. The Algorithm. The algorithm HKC consists of the fol-lowing three steps: scoring, cluster finding, and filtering.

2.3.1. Scoring. The first step of HKC is to score all nodes inthe PPI network. For each node, firstly we find the highest k-core of its direct neighborhood (the subgraph consisting ofall nodes connecting to the node, including the node itself),which we denote as H. Then we score H using the propertiesof highest k-core. Larger and denser cores will get higherscores, and it is computed in the following way:

score(H) = NH ∗ den(H)∗ kmax, (4)where NH denotes the number of nodes in H , den(H)refers to the density of H , kmax is the maximal k valuecorresponding to the highest k-core H , and the larger thesethree values are, the larger and denser the correspondinghighest k-core is.

As highest k-core is the most densely connected centralcore in the local area, in order to make sure that the nodescore gives better reflection of the connectivity in the localarea, we assign score (H) to each node in H . For node n, itmay be contained in more than one highest k-core, thus itmay be given more than one score, and the final score of thisnode is defined as the maximal score of all highest k-cores inwhich node n is contained (see the pseudocode in line 14–16in Pseudocode 1).

score(n) = max{score(Hi) | ∀Hi,n ∈ Hi}. (5)In this way, we can make sure that all nodes in densely

connected subgraphs will have high score and those withlittle neighbors will have low scores, and therefore we candistinguish the nodes in clusters from those not in clusterswith the help of node score.

2.3.2. Cluster Finding. The second step of this algorithm isto find clusters in the scored graph. The process of findinga cluster is as follows: first choose the seed, then obtain thecore based on the seed, and then expand the core to includethe noncore nodes. The final cluster obtained in this way isthus a circular subgraph with densely connected core andless densely connected noncore, as shown in Figure 2, whichfits our expectation of the topological structure of proteincomplexes.

The detailed process of finding a cluster can be dividedinto the following three steps.

(1) Choose the seed. As node scores represent the localdensity, if a node has very high score, it must havea very dense neighborhood; besides, for the nodeswith the same score value (e.g., according to ourscoring scheme, nodes in the same highest k-core mayhave the same score value), the one with the highestdegree can be deemed as the most densely connectednode among them. Therefore, among all the unseennodes with the highest score we choose the one withthe highest degree as the seed (see the pseudocodein line 2–4 in Pseudocode 2), and this way of seedselection can insure that the seed is in the center ofthe corresponding cluster.

(2) Get the core based on the selected seed. First get thehighest k-core of the direct neighborhood of the seed,


1

2

34

56

(a)

1

2

34

5

(b)

2

34

5

(c)

Figure 1: The illustration of k-cores of a graph. (a) Graph G; (b) 2-core of G; (c) 3-core of G, it is also the highest k-core of G.

Input:PPI network (an indirect simple graph): G = (V ,E)Node score threshold: T1 and T2Cohesion threshold: T3

Output:The predicted protein clusters: Clusters

Call ScoringCall ClusterFindingCall Filtering

// step1: scoringProcedure Scoring

for all node n in Vdoscore(n) = 0

end forfor all node n in Vdo

compute the degree of nN = neighborhood (n) // neighborhood returns the

// direct neighborhood of n (including n)H = hkc(N) // hkc returns the highest k-core of Nkmax = the maximal k value corresponding to HNH = the number of nodes in Hden(H) = the density of Hscore(H) = NH ∗ den(H)∗ kmaxfor all node m in H do

score(m) = max{score(H), score(m)} // compute the score of nodeend for

end forend procedure

Pseudocode 1: Procedure Scoring.

and after removing all nodes that have been alreadyseen and those with too low scores we can get the core(see the pseudocode in line 8–13 in Pseudocode 2).In order to avoid repetitive computation, all nodes inthe core (including the seed) would be marked seenand cannot be used as the core or the seed of anothercluster.

(3) Expand the core. As the node score indicates thelocal connectivity in the neighborhood of this node,it can be used as a criterion during core expanding.When expanding a core, in order to guarantee thelocal density, nodes with too low scores could notbe included into the core; on the other side, to avoidexcessively expanding a little core to include a denserand larger cluster, nodes with extortionate scores

could neither be included into the core. Furthermore,if we use node score threshold as the only criterionto decide whether a node should be included intothe core, connected nodes with similar scores willbe detected as one cluster when they do not actuallymake up a densely connected subgraph, such asthose with rope-like shape. So we adopt cohesionas another criterion. The process of expanding acore c is as follows (see the pseudocode in line 15–26 in Pseudocode 2): for any unexpanded node n,looking in its neighbors for nodes such that satisfythe following conditions: (1) the score of the node isgreater than or equal to score(n) ∗ T1 and less thanor equal to score(n) ∗ T2; (2) the cohesion betweenthe node and the core is greater than or equal to thethreshold T3. Where T1 and T2 are the node scores





// step2: Cluster findingProcedure ClusterFinding

S = Sort(V) // sort all nodes descendingly according to node score, for nodes// with the same score put the one with higher degree ahead

n = the first node in SClusters = a empty list of clusterwhilen is not the last node in S do

if n is not seen thenC = hkc(neighborhood(n))for all node q in C do

if q is seen or score(q) < score(n)∗T1thenremove q in C

end ifend formark all nodes in C seenfor all node m in C do

if m is expanded then continuefor all i in neighbors of m do

co = the cohesion between i and Cif score(i) >= score(m)∗T1 and

score(i) = T3 then

add i to Cend if

end formark m expanded

end foradd C to Clusters

end ifn = the next node of n in S

end whileend procedure

Pseudocode 2: Procedure Cluster Finding.

Seed

Core

Noncore

Figure 2: The illustration of a typical cluster with densely con-nected core and less densely connected non-core.

lower bound and upper bound, respectively, T1 is areal number between 0 and 1, and T2 is an integergreater than 1; T3 is the cohesion threshold, a realnumber ranging from 0 to 1. Add the found nodes

to the core and continue to expand the core until allnodes (including the new added nodes) in the corehave been expanded.

After finding a cluster using the above method, chooseanother seed and repeat the above cluster-finding processuntil no satisfying nodes can be considered as seed, in thisway we can get all clusters in the PPI network.

It is worth noting that while expanding a core, we neverconsider whether a node has been seen or not. As a result, anoncore part of a cluster can include nodes that have beenseen as the core of another cluster, that is to say the differentclusters detected by HKC can have overlaps between coresand noncores. In this way, we can find overlapping clusters,which better coincide with the fact that different proteincomplexes have overlaps.





// step3: FilteringProcedure Filtering

for all cluster c in Clusters doif the size of c is less than 3 then

remove c in Clustersend if

end forfor all cluster c in Clusters do

den(c) = the density of cs = size of cscore(c) = den(c)∗ s // compute the score of cluster

end forfor all cluster c1, c2 in Clusters do

o = the overlap ratio of c1 and c2if o > 0.95 then

if score(c1) > score(c2) then delete c2 in Clusterselse delete c1 in Clustersend if

end ifend forClusters = sort Clusters descendingly according to cluster scores

end procedure

Pseudocode 3: Procedure Filtering.

2.3.3. Filtering. The clusters found in step two contain manyclusters of size one or two, and these little clusters areinsignificant, since they can be obtained by randomly selectnodes in a PPI network. Thus we filter out the clusters thatcontain less than three nodes. Score all clusters using theproduct of cluster density and cluster size, and larger anddenser clusters will get higher scores (see the pseudocode inline 7–11 in Pseudocode 3). As the algorithm allows overlapsbetween cores and non-cores, the results in step two maycontain highly similar clusters, which must be filtered in thepostprocessing. We use overlap ratio (OR, see Section 3.1for more details) to measure the similarity between clusters;compare each two clusters, when their overlap ratio ishigher than 0.95, delete the one with lower score (see thepseudocode in line 12–19 in Pseudocode 3). Finally, rank allremaining clusters in descending order according to clusterscore.

2.3.4. Pseudocode

2.4. Implementation. The algorithm has been implementedin Java and we plan to convert it into a Cytoscape plug-in. Now the source code of the algorithm is available freelyfor noncommercial purposes upon request. All maps ofnetworks were performed by Cytoscape [21].

3. Results and Discussions

3.1. Evaluation of the Algorithm. To evaluate the performanceof our algorithm, we compare the predicted clusters with theprotein complexes in two different benchmarks: complexcatand Gavin benchmarks. Each of the predicted clusters iscompared with the benchmark complexes. The similaritybetween a predicted cluster and a benchmark complex ismeasured by overlap ratio (OR), which is defined as follows:

OR = 2∗ O(C1 + C2)

, (6)

where O is the number of proteins shared by a predictedcluster and a benchmark complex, C1 is the number ofproteins in the predicted cluster and C2 is the number ofproteins in the benchmark complex. The scope of OR isbetween 0 and 1. OR = 0 means the predicted clusterhas no proteins in common with the benchmark complex;OR = 1 means it is perfectly matched with the benchmarkcomplex. The higher OR is, the more biologically meaningfulthe detected cluster would be. A detected cluster can bedeemed as being matched with a benchmark complex onlywhen their overlap ratio is above a given threshold. And wecall a cluster an effective cluster as long as it has at least onebenchmark complex matching with it. In the same way, a


matched complex refers to the benchmark complex that hasa least one detected cluster that matching with it. A rationalOR threshold should ensure that the detected cluster sharesa large proportion of proteins with the matching benchmarkcomplex, and meanwhile it could not be too stringent. In thispaper, we adopt 0.4 as the OR threshold.

In [6] they use overlap score, defined as O2/(C1 ∗ C2),to determine how effectively a predicted cluster matched to aknown complex in the benchmark set, and it is assumed thata predicted cluster is more or less matches a known complexwhen is its overlap score is above 0.2. Here, we did not adoptthis scoring scheme, because it is biased, that is, it wouldget a relatively high score when a small predicted clustermatching with a large known complex or a large predictedcluster matching with a small complex. For example, whena predicted cluster of size 2 shares 2 proteins with a knowncomplex of size 10, its overlap score equals 0.2 and wouldthus be considered as matched with the known complex;actually, it is not so appropriate to deem such a small clusteras matching a known complex much larger that it. However,our definition of overlap ratio is more balanced and onlygives high score when the matching cluster and complexhave similar sizes. As for the above example, its overlap ratiois only 0.33, less than the threshold 0.4, and would not bedeemed as a match.

To compare the performance of different algorithms, wedefine three criteria: precision, recall, and F-measure, definedas the following formulas:

precision = ECAC

,

recall = MCBC

,

F-measure = 2× precision× recallprecision + recall

,

(7)

where EC is the number of effective clusters found by thealgorithm, AC is the number of all clusters predicted by thealgorithm, MC is the number of matched complexes in thebenchmark set, and BC is the total number of benchmarkcomplexes. Note that according to the overlap score thresh-old, EC may not equal MC, since one predicted cluster maymatch with several benchmark complexes as long as theiroverlap scores are higher than the given threshold, and inthe same way one benchmark complex may correspond withseveral predicted complexes. Precision describes the accuracyof the algorithm result; recall denotes the percentage ofbenchmark complexes that are recovered by the algorithm. F-measure, which is the harmonic mean of precision and recall,shows a good balance of precision and recall, and thus can beused to measure the overall performance of algorithms.

3.2. Experiments and Comparison. We, respectively, useMIPS and SGD-MC data sets as the input PPI network andrun HKC with 120 groups of parameter combination. Therange of T1 is between 0.3 and 0.7, with the step of 0.1, andthe range of T2 is between 5 and 20, with the step of 5, and therange of T3 is between 0.4 and 0.9, with the step of 0.1. We

evaluate the results using the two benchmarks: complexcatand Gavin benchmarks, and then choose the optimizedparameters which enable F-measure to get the highest value.The best result and the corresponding optimized parametersfor HKC are shown in Table 1.

To show the influence of different parameters on thealgorithm performance, we draw the plot of average F-measure versus T1, T2, and T3, respectively, as shown inFigure 3. Note that here the F-measure in y-axis is theaverage value of all F-measures with one parameter specifiedamong the 120 groups of experiment results evaluated bythe complexcat benchmark. From Figure 3, we can see thatamong the three parameters T3 has the greatest influenceon average F-measure, and for MIPS data set, the average F-measure gets the maximum value when T3 = 0.5, while forSGD-MC data set, the average F-measure gets the maximumvalue when T3 = 0.7. This is understandable, as the SGD-MC network (which contains 4,448 proteins and 29,068interactions) is much denser than the MIPS network (whichcontains 4,554 proteins and 12,526 interactions), and duringthe core expansion process the nodes in SGD-MC networkwould have higher cohesion with the core than the nodesin MIPS network. Therefore, for SGD-MC network whenexpanding the core, the cohesion threshold T3 should behigher than that for MIPS network. Furthermore, the figuresshow that a good range for T1 is in the middle, between 0.4and 0.6, the best value for T2 is 10, and for T3 the best rangeis between 0.5 and 0.8.

To show the performance of HKC, we compare it withMCODE [6], as shown in Table 1. We run the MCODEplugin in Cytoscape with 840 parameter combinations (thesame with that used in [6]) on MIPS and SGD-MC data setsrespectively, and then use the two benchmarks to evaluate theresults. The optimized parameters (see Table 1) are chosenbased on the highest F-measure. The result of HKC is thebest one in 120 groups of parameters, and the correspondingoptimized parameters are shown in Table 1. From this table,we can see that for MIPS data set, the recall of HKC is 0.429corresponding to the complexcat benchmark, considerablyhigher than that of MCODE (0.194); for SGD-MC data set,HKC can recall as high as 58% of protein complexes incomplexcat benchmark, notably higher than that of MCODE(around 22%). Experiment results show that whicheverbenchmark is adopted, for both MIPS and SGD-MC data set,the recall and F-measure of HKC are remarkably higher thanthat of MCODE, and the overall performance is substantiallyimproved.

As shown in Figure 4, whatever the OR threshold is, HKCcan extract much more effective clusters than MCODE inMIPS data set, and also the number of matched complexesby HKC is much higher than that by MCODE.

To show the overall performance improvement of ouralgorithm, we also plot precision versus recall for all resultswith different parameters in Figure 5. As can be seen from thefigure, for all four cases, the data points resulted by HKC arelocated in the upper right portion of the plot, correspondingto high values of F-measure, while most of the data pointsresulted by MCODE are located in the lower left part of theplot. The figure illustrates that both precision and recall of


0.3 0.4 0.5 0.6 0.70.34

0.345

0.35

0.355

0.36

0.365

0.37

0.375

0.38

0.385

T1

F-m

easu

re

(a)

0 5 10 15 20

T2

0.34

0.345

0.35

0.355

0.36

0.365

0.37

0.375

0.38

0.385

F-m

easu

re

(b)

0.3

0.32

0.34

0.36

0.38

0.4

0.3 0.4 0.5 0.6 0.7 0.8 0.9

T3

MIPSSGD-MC

F-m

easu

re

(c)

Figure 3: The influence of different parameters on algorithm performance. Note the F-measure in y-axis is the average value of all F-measureswith one parameter specified among the 120 groups of experiment results evaluated by the complexcat benchmark, triangles mark the resulton MIPS network, and circles mark the result on SGD-MC network.

Table 1: Comparison with MCODE. P, R, and F stand for precision, recall, and F-measure, respectively, and their definitions are givenin Section 3.1. MIPS data set contains 4,554 proteins and 12,526 interactions, and SGD-MC data set contains 4,448 proteins and 29,068interactions. AC is the number of all clusters predicted by the algorithm; EC is the number of effective clusters (with a least one matchingcomplex above overlap ratio 0.4) found by the algorithm; MC is the number of matched complexes in the benchmark set. The sizesof complexcat benchmark and Gavin benchmark are 217 and 204, respectively. For HKC the optimized parameters are T1, T2, and T3,respectively, and for MCODE the optimized parameters are NodeScoreCutoff, fluff (T for true, F for false), haircut (T for true, F for false),and other unspecified parameters adopt the default values.

Algorithm Data set Benchmark P R F AC EC MC Optimized parameters

MCODEMIPS

0.455 0.194 0.271 66 30 42 0.05, F, F

HKCcomplexcat

0.380 0.429 0.403 237 90 93 0.6, 10, 0.5

MCODESGD-MC

0.213 0.221 0.217 197 42 48 0.05, F, T

HKC 0.275 0.580 0.373 498 137 126 0.6, 10, 0.8

MCODEMIPS

0.303 0.098 0.148 66 20 20 0.05, F, T

HKCGavin

0.237 0.235 0.236 245 58 48 0.6, 20, 0.5

MCODESGD- MC

0.283 0.152 0.198 106 30 31 0, F, T

HKC 0.271 0.402 0.324 487 132 82 0.5, 5, 0.5


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1OR threshold

020406080

100120140150

Nu

mbe

rof

mat

ched

com

plex

es

(a)

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

OR threshold

020406080

100120140160180

HKCMCODE

Nu

mbe

rof

effec

tive

clu

ster

s

(b)

Figure 4: The number of effective clusters and the number of matched complexes by HKC and MCODE with respect to different ORthresholds. The result is corresponding to MIPS data set and evaluated on complexcat benchmark. Triangles mark the results of HKC andcircles mark the results of MCODE.

HKC results for most parameter combinations are higherthan that of MCODE, showing the overall improvement ofthe algorithm performance. From this figure, we can alsosee that the data points resulted by HKC are much morecentralized than MCODE, indicating that our algorithm doesnot rely so severely on parameter selection.

3.3. Discussions. Among the 237 clusters found by HKCin MIPS data set, 8 clusters perfectly match with knownprotein complexes in complexcat benchmark. Figure 6(a)gives an example of one perfectly matched cluster: cluster23 (consisting of 11 proteins and 54 interactions) per-fectly matches with the TRAPP (transport protein particle)complex (catalog 260.60 in the complexcat benchmark),which plays an essential role in the vesicular transport fromendoplasmic reticulum to Golgi.

Figure 6(b) shows an example of a containment match.Cluster 12 (consisting of 14 proteins and 92 interactions)is totally contained in a known complex of size 16, SAGAcomplex (catalog 510.190.10.20.10 in the complexcat bench-mark), and their overlap ratio is 0.93. The two proteinsYCL010c and YGL066w that are not recovered by cluster12 have only one interaction with the cluster and do notexhibit good graph theoretic property. Actually, based onthe available information currently, we cannot assert thatYCL010c is contained in SAGA complex, and according to[22] it is only a probable subunit of SAGA complex.

Figure 6(c) gives an example of a well-matched clus-ter: cluster 103 matches with the complex of cytoplasmic

translation initiation factor 3 (eIF3, catalog 500.10.40 in thecomplexcat benchmark). Each of them contains 7 proteins,their overlap is 6 and their overlap ratio is 0.857. As shown inthe figure, protein YNL062c is contained in the benchmarkcomplex, but is not included in cluster 103 predicted byHKC. Furthermore, it has only one interaction with cluster103 and does not show an ideal topological property ofbelonging to a cluster. We searched it in Gene Ontology(GO) database [23] and found that according to the mostupdated GO annotation (release date 2011-05-14), YNL062cis not contained in the eIF3 complex (GO:0005852), but isa subunit of tRNA (1-methyladenosine) methyltransferasewith Gcd14p required for the modification of the adenine atposition 58 in tRNAs, especially tRNAi-Met. This indicatesthat the complexcat benchmark we use here may containerrors, because it was last modified in 2006 and many newprotein complexes have been identified through experimentssince then. In a way, it is possible to correct errors in thebenchmark by carefully examining the difference betweenpredicted clusters and their corresponding benchmark com-plexes with high overlap ratio.

Figure 6(d) shows a novel cluster (ranked 61) detectedby HKC, and it involves 5 proteins and 10 interactions.Cluster 61 does not match with any known protein complexin the complexcat benchmark, but is highly homogenousin the cellular component ontology and biological processontology. Search results on Gene Ontology database showthat protein YGL153w, YLR191w and YNL214w form thedocking complex that facilitates the import of peroxisomalmatrix proteins, and YGL153w is a central component of


0

0.1

0.2

0.3

0.4

0.5

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Precision

Rec

all

(a)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

Precision

Rec

all

(b)

(c)

0

0.1

0.2

0.3

0.4

0.5

0.6

Precision

Rec

all

WCODEHKC

0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

(d)

Figure 5: Precision versus Recall plots of all WCODE and HKC results with different parameters on various data sets. (a) Input data set:MIPS, benchmark: complexcat. (b) Input data set: MIPS, benchmark: Gavin. (c) Input data set: SGD-MC, benchmark: complexcat. (d) Inputdata set: SGD-MC, benchmark: Gavin. For all four cases, the data points resulted by HKC are located in the upper right portion of the plot,corresponding to high values of F-measure, while most of the data points resulted by MCODE are located in the lower left part of the plot.Furthermore, the data points resulted by HKC are much more centralized than MCODE.

the peroxisomal protein import machinery. The other twoproteins in this cluster, YDR244w, and YDR142c, are thePTS1 signal recognition factor and the PTS2 signal recogni-tion factor, respectively, and they also participate in the samebiological process protein docking during peroxisome matrixprotein import (GO: 0016560) as the docking complex.

The above illustrative examples show that HKC can notonly effectively detect protein complexes in genome-scalePPI networks, but also through the comparison of predictedclusters and their matching benchmark complexes, it mayhelp to correct the errors in the benchmark. Furthermore,HKC can discover novel protein complexes which can beused as candidates for experimental verification, and thusgreatly helps to reduce the time consumption and costof experiments. Among the 237 clusters resulted by ouralgorithm on MIPS data set, 147 clusters do not match withknown complexes in complexcat benchmark, and we give all49 clusters with score ≥4 and size ≥5 as novel predictions

in Table 2, which would be a starting point for experimentalvalidation in the future.

4. Conclusions and Future Work

A genome-scale PPI network is usually very large, consistingof thousands of proteins and tens of thousands of inter-actions, for example, the SGD-MC data set we use as theinput PPI network in this paper contains 4,448 proteinsand 29,068 interactions. It is a challenging task to extractprotein complexes in such a large and complicated network.To solve this problem, many computational methods havebeen proposed, including the graph theoretic clusteringalgorithm. In this paper, we presented a new topology-based algorithm, HKC, which mainly used the concepts ofhighest k-core and cohesion to predict protein complexesby identifying overlapping clusters. The experiments ontwo data sets and two benchmarks showed that HKC can


YDR118wYOR249c

YLR127c

YGL240w

YLR102c

YDL008wYNL172w

YHR166c

YKL022c

YBL084c

YFR036w

YMR146c YNL244c

YMR309c

YBR079cYPR041w

YNL062c

YOR361cYDR429c

YDR142c YDR244w

YNL214wYLR191w

YGL153w

(a)

(b)

(c)

(d)

YOL148c YBR198c

YBR081c

YHR099w

YLR055c

YDR392w

YMR236wYDR145w

YGR252w

YPL254w

YGL112c

YDR176w

YDR448w

YDR167wYCL010c

YGL066w

Figure 6: Examples of clusters predicted by HKC in MIPS data set. The nodes encircled by the red-dotted line are known complex in thecomplexcat benchmark, the nodes contained within the blue circle are clusters predicted by HKC, and the yellow node in each cluster denotesthe seed of the cluster. (a) A cluster of size 11 perfectly matches with the TRAPP complex. (b) A cluster of size 14 shares 14 proteins with theSAGA complex (size 16), and their overlap ratio is 0.93. The two proteins YCL010c and YGL066w that are not contained in the predictedcluster are isolated nodes with only one edge connecting with the cluster. (c) An example of a well-matched cluster, involving 7 proteins,among which 6 is in common with the complex of cytoplasmic translation initiation factor 3 (eIF3). (d) A novel cluster detected by HKC,which does not match with any known protein complexes in the complexcat benchmark, and the proteins in the black circle form the dockingcomplex that facilitates the import of peroxisomal matrix proteins according to GO annotation.

effectively extract protein complexes from genome-scale PPInetworks and exhibited better performance compared withsome other methods. Besides, HKC is general and can beused in any type of biological interaction networks.

There is huge amount of work to be done in PPI networkanalysis. As for protein complex prediction, there are also alot of researches to be done in the future. One of the problems

with the current PPI networks is that they are consisting ofinteractions that do not necessarily happen at the same timeand space. Instead, the interactions in PPI networks maybe unstable, transient or conditional, and may also happenin different subcellular locations. However, by definition, aprotein complex is a group of proteins that interact with eachother at the same time and place. As a result, to increase


Ta

ble

2:A

lln

ovel

pred

icti

ons

byH

KC

onM

IPS

data

set.

Eac

hco

lum

ngi

ves

the

orig

inal

IDof

clu

ster

s(I

D),

clu

ster

scor

e(S

),n

um

ber

ofn

odes

(N),

nu

mbe

rof

edge

s(E

),an

dth

epr

otei

nn

ames

inea

chcl

ust

er.

IDS

NE

Pro

tein

nam

es

121

.538

406

YLR

200w

,YN

L153

c,Y

DR

318w

,YO

R34

9w,Y

HR

191c

,YE

L003

w,Y

JL01

3c,Y

LR08

5c,Y

PL2

69w

,YH

R12

9c,Y

OR

026w

,YM

R05

5c,Y

PL1

55c,

YC

L029

c,Y

MR

048w

,Y

ML0

94w

,YJL

030w

,YG

R18

8c,Y

CL0

16c,

YP

R13

5w,Y

ML1

24c,

YP

R14

1c,Y

ER

016w

,YO

R05

8c,Y

ER

007w

,YD

R15

0w,Y

GR

078c

,YJR

053w

,YE

L061

c,Y

MR

294w

,YM

R29

9c,Y

PL0

08w

,YM

R07

8c,Y

GL0

86w

,YP

L174

c,Y

FL03

7w,Y

OL0

12c,

YP

L241

c

220

.133

328

YP

L008

w,Y

MR

078c

,YG

R18

8c,Y

GL0

86w

,YC

L016

c,Y

PL2

41c,

YM

R29

4w,Y

ER

007w

,YO

R05

8c,Y

ER

016w

,YP

R14

1c,Y

PR

135w

,YJL

030w

,YP

L155

c,Y

OR

026w

,Y

HR

129c

,YP

L269

w,Y

JL01

3c,Y

DR

254w

,YD

R31

8w,Y

FL03

7w,Y

PL1

74c,

YE

L061

c,Y

GR

078c

,YLR

200w

,YM

L124

c,Y

ML0

94w

,YC

L029

c,Y

EL0

03w

,YO

R34

9w,

YN

L153

c,Y

MR

138w

,YO

R26

5w

319

.727

260

YG

L086

w,Y

MR

078c

,YC

L016

c,Y

GR

188c

,YM

R04

8w,Y

OR

026w

,YJL

013c

,YH

R19

1c,Y

OR

349w

,YD

R31

8w,Y

PL0

08w

,YE

L061

c,Y

GR

078c

,YLR

200w

,YE

R01

6w,

YP

R14

1c,Y

PR

135w

,YJL

030w

,YM

L094

w,Y

EL0

03w

,YN

L153

c,Y

DL0

03w

,YD

R25

4w,Y

JR13

5c,Y

PR

046w

,YP

L018

w,Y

LR38

1w

418

.9

2826

0Y

GR

188c

,YLR

381w

,YP

L018

w,Y

DR

254w

,YD

R31

8w,Y

OL0

12c,

YG

L086

w,Y

PL0

08w

,YG

R07

8c,Y

DL0

03w

,YLR

200w

,YM

L094

w,Y

MR

048w

,YLR

085c

,Y

EL0

03w

,YO

R34

9w,Y

NL1

53c,

YN

L298

w,Y

MR

078c

,YE

L061

c,Y

ER

016w

,YP

R14

1c,Y

PR

135w

,YC

L016

c,Y

JL03

0w,Y

HR

191c

,YO

R19

5w,Y

GL2

16w

515

.224

179

YN

L271

c,Y

NL3

22c,

YB

L061

c,Y

BR

023c

,YG

R22

9c,Y

BL0

07c,

YE

R15

5c,Y

LR33

0w,Y

HR

030c

,YN

L298

w,Y

DL0

29w

,YJR

075w

,YG

R07

8c,Y

LR20

0w,Y

LR33

7c,

YM

L094

w,Y

DR

388w

,YC

R00

9c,Y

BR

234c

,YE

L003

w,Y

JL09

5w,Y

NL1

53c,

YJL

020c

,YM

R10

9w

615

.225

185

YE

R15

5c,Y

EL0

03w

,YG

R07

8c,Y

ML0

94w

,YN

L271

c,Y

GR

229c

,YN

L298

w,Y

LR33

7c,Y

NL1

53c,

YLR

200w

,YH

R03

0c,Y

BL0

07c,

YC

R00

9c,Y

JL09

5w,Y

DR

245w

,Y

BL0

61c,

YH

R14

2w,Y

LR33

0w,Y

DL0

29w

,YD

R38

8w,Y

BR

023c

,YB

R23

4c,Y

JR07

5w,Y

NL3

22c,

YD

R12

9c

714

.827

196

YB

L007

c,Y

EL0

03w

,YG

R07

8c,Y

LR20

0w,Y

ML0

94w

,YN

L153

c,Y

NL2

98w

,YJL

095w

,YH

R11

1w,Y

GR

229c

,YB

R23

4c,Y

CR

009c

,YE

R11

1c,Y

BR

023c

,YD

R38

8w,

YH

R03

0c,Y

DL0

29w

,YLR

330w

,YH

R14

2w,Y

DR

424c

,YN

L271

c,Y

FR01

9w,Y

BL0

61c,

YP

L031

c,Y

BR

200w

,YE

R15

5c,Y

OR

326w

814

.625

178

YLR

337c

,YN

L322

c,Y

NL2

71c,

YB

L007

c,Y

LR20

0w,Y

BL0

61c,

YH

R14

2w,Y

LR33

0w,Y

ML0

94w

,YB

R02

3c,Y

NL2

33w

,YC

R00

9c,Y

JR07

5w,Y

NL1

53c,

YN

L298

w,

YD

L029

w,Y

HR

030c

,YJL

183w

,YD

R38

8w,Y

BR

234c

,YG

R22

9c,Y

LR34

2w,Y

JL09

5w,Y

JL09

9w,Y

JR11

8c

914

.321

146

YN

L250

w,Y

ER

016w

,YP

R14

1c,Y

HR

191c

,YG

L163

c,Y

MR

078c

,YM

R19

0c,Y

KL1

13c,

YO

R14

4c,Y

CL0

61c,

YE

R17

3w,Y

PR

135w

,YC

L016

c,Y

MR

048w

,YLR

234w

,Y

JL09

2w,Y

LR10

3c,Y

ML0

32c,

YP

L194

w,Y

NL2

73w

,YJR

043c

1014

.321

147

YM

L032

c,Y

ER

016w

,YJR

043c

,YM

R07

8c,Y

NL2

73w

,YLR

103c

,YP

R13

5w,Y

CL0

16c,

YM

R04

8w,Y

HR

191c

,YG

L163

c,Y

MR

190c

,YD

R00

4w,Y

OR

144c

,YE

R09

5w,

YC

L061

c,Y

DR

076w

,YE

R17

3w,Y

NL2

50w

,YJL

092w

,YK

L113

c

1114

.221

146

YN

L298

w,Y

HR

030c

,YJL

095w

,YM

R10

9w,Y

BR

023c

,YD

L029

w,Y

BR

234c

,YG

R07

8c,Y

LR20

0w,Y

BL0

61c,

YM

L094

w,Y

DR

388w

,YJL

020c

,YC

R00

9c,Y

GR

229c

,Y

EL0

03w

,YN

L153

c,Y

LR33

7c,Y

EL0

31w

,YB

L007

c,Y

JR07

5w

1313

.923

155

YLR

200w

,YN

L271

c,Y

NL1

53c,

YB

R23

4c,Y

NL3

22c,

YH

R14

2w,Y

ER

111c

,YB

L007

c,Y

CR

009c

,YG

R22

9c,Y

JL09

5w,Y

NL2

98w

,YLR

330w

,YB

R02

3c,Y

JR07

5w,

YB

L061

c,Y

DL0

29w

,YH

R03

0c,Y

JR11

8c,Y

DR

388w

,YB

L047

c,Y

NL2

33w

,YLR

342w

1413

.723

153

YH

R19

1c,Y

ML0

32c,

YH

R03

1c,Y

MR

078c

,YC

L061

c,Y

CL0

16c,

YM

R04

8w,Y

NL2

73w

,YN

L250

w,Y

PR

135w

,YM

R19

0c,Y

KL1

13c,

YLR

234w

,YJR

043c

,YO

R14

4c,

YLR

103c

,YJL

092w

,YO

L006

c,Y

PL0

24w

,YB

R09

8w,Y

DR

386w

,YLR

235c

,YD

R36

3w

1513

.623

154

YE

R17

3w,Y

KL1

13c,

YM

L032

c,Y

NL2

73w

,YO

R14

4c,Y

MR

048w

,YJR

043c

,YM

R07

8c,Y

CL0

61c,

YLR

103c

,YP

R13

5w,Y

CL0

16c,

YH

R19

1c,Y

MR

190c

,YP

L024

w,

YD

R05

2c,Y

NL2

50w

,YH

R03

1c,Y

LR23

5c,Y

LR23

4w,Y

JL09

2w,Y

DL0

17w

,YH

R15

4w

1613

.321

136

YP

L194

w,Y

ML0

32c,

YN

L250

w,Y

JL09

2w,Y

LR10

3c,Y

MR

078c

,YC

L016

c,Y

HR

191c

,YG

L163

c,Y

JR04

3c,Y

MR

190c

,YK

L113

c,Y

NL2

73w

,YC

L061

c,Y

ER

016w

,Y

ER

173w

,YP

R13

5w,Y

MR

048w

,YO

R14

4c,Y

DL0

13w

,YE

R11

6c

1712

.922

138

YE

R01

6w,Y

GL1

63c,

YH

R19

1c,Y

NL2

73w

,YM

L032

c,Y

LR10

3c,Y

MR

078c

,YK

L113

c,Y

OR

144c

,YC

L061

c,Y

PR

135w

,YN

L250

w,Y

CL0

16c,

YH

R03

1c,Y

MR

048w

,Y

LR23

4w,Y

JL09

2w,Y

JR04

3c,Y

MR

190c

,YD

R36

3w,Y

BR

228w

,YLR

135w

2010

.813

66Y

ER

146w

,YD

R37

8c,Y

NL1

47w

,YLR

438c

-a,Y

ER

112w

,YJL

124c

,YC

R07

7c,Y

BL0

26w

,YJR

022w

,YO

L149

w,Y

GL1

73c,

YN

L118

c,Y

EL0

15w

2110

.811

54Y

BL0

26w

,YE

R14

6w,Y

DR

378c

,YJR

022w

,YN

L147

w,Y

LR43

8c-a

,YE

R11

2w,Y

OL1

49w

,YJL

124c

,YC

R07

7c,Y

GL1

73c

2410

1782

YB

L061

c,Y

DR

388w

,YB

R02

3c,Y

JL09

5w,Y

NL2

98w

,YN

L271

c,Y

LR33

0w,Y

HR

030c

,YE

R11

1c,Y

GR

229c

,YE

R15

5c,Y

PL0

31c,

YM

R30

7w,Y

OR

008c

,YLR

342w

,Y

LR37

1w,Y

LR33

2w

259.

917

81Y

JL09

5w,Y

DL0

29w

,YC

R00

9c,Y

NL2

98w

,YB

L061

c,Y

DR

388w

,YE

R11

1c,Y

GR

229c

,YN

L322

c,Y

LR33

0w,Y

HR

030c

,YB

R02

3c,Y

MR

307w

,YG

L027

c,Y

ML1

15c,

YG

L200

c,Y

PR

159w

269.

517

78Y

NL2

98w

,YD

R38

8w,Y

ER

111c

,YN

L271

c,Y

LR33

0w,Y

HR

030c

,YB

L061

c,Y

BR

023c

,YN

L322

c,Y

MR

307w

,YLR

039c

,YLR

262c

,YG

R22

9c,Y

LR34

2w,Y

JR07

3c,

YJL

183w

,YK

L190

w27

8.6

1456

YLR

039c

,YP

L051

w,Y

LR26

2c,Y

JL15

4c,Y

DR

126w

,YG

L005

c,Y

OR

132w

,YO

R06

9w,Y

NL0

51w

,YH

L031

c,Y

NL0

41c,

YO

R07

0c,Y

ML0

71c,

YB

R16

4c


Ta

ble

2:C

onti

nu

ed.

IDS

NE

Pro

tein

nam

es28

8.5

1455

YO

L012

c,Y

LR03

9c,Y

LR26

2c,Y

LR08

5c,Y

LR41

8c,Y

AL0

11w

,YA

L013

w,Y

MR

263w

,YJL

168c

,YM

L041

c,Y

GL2

44w

,YP

L181

w,Y

DR

334w

,YO

R12

3c29

7.7

1346

YN

L051

w,Y

HL0

31c,

YP

L051

w,Y

OR

070c

,YN

L041

c,Y

LR26

2c,Y

ML0

71c,

YD

R12

6w,Y

KL1

90w

,YLR

039c

,YB

R16

4c,Y

KR

001c

,YM

R00

4w30

7.6

1139

YB

R02

3c,Y

KL1

90w

,YN

L322

c,Y

HR

030c

,YG

R22

9c,Y

MR

307w

,YJR

073c

,YLR

262c

,YD

R16

2c,Y

LR03

9c,Y

DL0

06w

317.

513

45Y

NL0

51w

,YM

L071

c,Y

PL0

51w

,YO

R07

0c,Y

NL0

41c,

YLR

262c

,YO

L018

c,Y

DR

126w

,YH

L031

c,Y

BR

164c

,YLR

039c

,YN

L238

w,Y

LL04

0c

396

1751

YB

R20

0w,Y

NL2

98w

,YO

R12

7w,Y

HR

061c

,YH

L007

c,Y

ER

149c

,YG

R15

2c,Y

LL02

1w,Y

PL1

15c,

YLR

319c

,YD

R30

9c,Y

BL0

85w

,YA

L041

w,Y

ER

114c

,YO

R18

8w,

YM

R27

3c,Y

LR22

9c40

68

25Y

CR

088w

,YO

R28

4w,Y

GR

268c

,YD

R38

8w,Y

BL0

07c,

YN

L094

w,Y

NL2

43w

,YH

R01

6c42

67

18Y

ER

155c

,YA

L040

c,Y

DL0

47w

,YK

R02

8w,Y

JL09

8w,Y

FR04

0w,Y

KR

072c

435.

810

27Y

MR

263w

,YO

R12

3c,Y

LR08

5c,Y

GL2

44w

,YM

L041

c,Y

JL16

8c,Y

AL0

13w

,YLR

418c

,YB

L008

w,Y

OR

038c

485.

58

26Y

IL14

4w,Y

DR

201w

,YFL

008w

,YO

L069

w,Y

FR03

1c,Y

EL0

43w

,YD

L074

c,Y

JL07

4c

505.

317

55Y

LR34

7c,Y

NL1

89w

,YM

L064

c,Y

LR08

2c,Y

LR24

5c,Y

DR

321w

,YP

L070

w,Y

BR

176w

,YN

L044

w,Y

NL3

31c,

YP

L124

w,Y

ER

023w

,YLR

423c

,YJR

056c

,YE

L066

w,

YJL

218w

,YN

R01

2w51

5.3

926

YO

R28

4w,Y

JL19

9c,Y

NL1

89w

,YM

L064

c,Y

BR

176w

,YLR

245c

,YP

L070

w,Y

LR29

1c,Y

HL0

18w

554.

98

18Y

HR

066w

,YP

L093

w,Y

ER

006w

,YP

L211

w,Y

MR

049c

,YM

R29

0c,Y

NL0

02c,

YG

R10

3w56

4.9

817

YG

L244

w,Y

LR03

9c,Y

LR26

2c,Y

OR

216c

,YLR

085c

,YLR

418c

,YIR

005w

,YG

L174

w

594.

517

49Y

ML0

64c,

YN

L189

w,Y

LR34

7c,Y

GR

010w

,YK

L067

w,Y

FR04

7c,Y

GL1

75c,

YO

R02

0c,Y

LR32

8w,Y

LR33

5w,Y

GL0

40c,

YG

L037

c,Y

NL3

33w

,YFL

059w

,YP

L111

w,

YG

R26

7c,Y

LR37

7c60

4.5

511

YFL

059w

,YN

L333

w,Y

MR

322c

,YM

R09

5c,Y

MR

096w

614.

55

10Y

NL2

14w

,YD

R24

4w,Y

GL1

53w

,YD

R14

2c,Y

LR19

1w65

4.4

612

YLR

310c

,YLL

016w

,YA

L024

c,Y

NL0

98c,

YAR

019c

,YO

R10

1w70

4.3

716

YM

L064

c,Y

PL1

24w

,YLR

423c

,YP

L070

w,Y

DR

148c

,YE

R08

6w,Y

DL2

39c

714.

37

14Y

DL2

26c,

YG

R17

2c,Y

LR32

4w,Y

GL1

98w

,YD

R42

5w,Y

GL1

61c,

YP

L095

c73

4.3

714

YLR

347c

,YM

R04

7c,Y

KL0

68w

,YB

R21

6c,Y

ML0

07w

,YE

R10

7c,Y

DL2

07w

744.

37

14Y

LR03

9c,Y

PL2

34c,

YK

L190

w,Y

LR26

2c,Y

ER

081w

,YH

R06

0w,Y

GR

020c

774

511

YE

R09

3c,Y

JL05

8c,Y

JR06

6w,Y

KL2

03c,

YN

L006

w78

45

12Y

KL2

03c,

YJR

066w

,YN

L006

w,Y

HR

186c

,YP

L180

w82

45

8Y

LR03

9c,Y

LR26

2c,Y

CR

020c

-a,Y

PR

051w

,YE

L053

c90

45

8YA

R01

9c,Y

GR

092w

,YD

L047

w,Y

PR

111w

,YIL

106w

934

58

YLR

039c

,YO

R07

0c,Y

BR

036c

,YM

R27

2c,Y

PL0

57c

944

58

YM

R19

7c,Y

HL0

31c,

YLR

026c

,YK

L006

c-a,

YK

L196

c


the precision and recall of complex prediction algorithm,further information about the time, space, or conditions ofinteractions should be taken into consideration.

Acknowledgments

This work was supported in part by the Grants fromthe National Natural Science Foundation of China (no.60835005) and the scientific research project of NationalUniversity of Defense Technology (jc09-03-04).

References

[1] Y. Ho, A. Gruhler, A. Heilbut et al., “Systematic identificationof protein complexes in Saccharomyces cerevisiae by massspectrometry,” Nature, vol. 415, no. 6868, pp. 180–183, 2002.

[2] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, andY. Sakaki, “A comprehensive two-hybrid analysis to explorethe yeast protein interactome,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 98, no.8, pp. 4569–4574, 2001.

[3] V. Spirin and L. A. Mirny, “Protein complexes and functionalmodules in molecular networks,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 100,no. 21, pp. 12123–12128, 2003.

[4] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray,“From molecular to modular cell biology,” Nature, vol. 402,no. 6761, pp. C47–C52, 1999.

[5] M. Altaf-Ul-Amin, Y. Shinbo, K. Mihara, K. Kurokawa, and S.Kanaya, “Development and implementation of an algorithmfor detection of protein complexes in large interaction net-works,” BMC Bioinformatics, vol. 7, article 207, 2006.

[6] G. D. Bader and C. W. V. Hogue, “An automated methodfor finding molecular complexes in large protein interactionnetworks,” BMC Bioinformatics, vol. 4, article 2, 2003.

[7] A. D. King, N. Pržulj, and I. Jurisica, “Protein complexprediction via cost-based clustering,” Bioinformatics, vol. 20,no. 17, pp. 3013–3020, 2004.

[8] J. Chen and B. Yuan, “Detecting functional modules in theyeast protein-protein interaction network,” Bioinformatics,vol. 22, no. 18, pp. 2283–2290, 2006.

[9] M. Girvan and M. E. J. Newman, “Community structure insocial and biological networks,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 99, no.12, pp. 7821–7826, 2002.

[10] G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncoveringthe overlapping community structure of complex networks innature and society,” Nature, vol. 435, no. 7043, pp. 814–818,2005.

[11] H. N. Chua, K. Ning, W. K. Sung, H. W. Leong, and L.Wong, “Using indirect protein-protein interactions for proteincomplex predication,” Life Sciences Society: ComputationalSystems Bioinformatics Conference, vol. 6, pp. 97–109, 2007.

[12] G. Cui, Y. Chen, D. S. Huang, and K. Han, “An algorithmfor finding functional modules and protein complexes inprotein-protein interaction networks,” Journal of Biomedicineand Biotechnology, vol. 2008, Article ID 860270, 10 pages,2008.

[13] H. W. Mewes, S. Dietmann, D. Frishman et al., “MIPS: analysisand annotation of genome information in 2007,” Nucleic AcidsResearch, vol. 36, no. 1, pp. D196–D201, 2008.

[14] U. Güldener, M. Münsterkötter, G. Kastenmüller et al.,“CYGD: the comprehensive yeast genome database,” NucleicAcids Research, vol. 33, pp. D364–D368, 2005.

[15] J. M. Cherry, C. Adler, C. Ball et al., “SGD: saccharomycesgenome database,” Nucleic Acids Research, vol. 26, no. 1, pp.73–79, 1998.

[16] S. Brohée and J. van Helden, “Evaluation of clusteringalgorithms for protein-protein interaction networks,” BMCBioinformatics, vol. 7, article 488, 2006.

[17] A. C. Gavin, M. Bösche, R. Krause et al., “Functionalorganization of the yeast proteome by systematic analysis ofprotein complexes,” Nature, vol. 415, no. 6868, pp. 141–147,2002.

[18] N. J. Krogan, W. T. Peng, G. Cagney et al., “High-definitionmacromolecular composition of yeast RNA-processing com-plexes,” Molecular Cell, vol. 13, no. 2, pp. 225–239, 2004.

[19] A. C. Gavin, P. Aloy, P. Grandi et al., “Proteome survey revealsmodularity of the yeast cell machinery,” Nature, vol. 440, no.7084, pp. 631–636, 2006.

[20] J. Gagneur, R. Krause, T. Bouwmeester, and G. Casari, “Mod-ular decomposition of protein-protein interaction networks,”Genome Biology, vol. 5, no. 8, p. R57, 2004.

[21] S. Killcoyne, G. W. Carter, J. Smith, and J. Boyle, “Cytoscape:a community-based framework for network modeling,” Meth-ods in Molecular Biology, vol. 563, pp. 219–239, 2009.

[22] S. L. Sanders, J. Jennings, A. Canutescu, A. J. Link, and P. A.Weil, “Proteomics of the eukaryotic transcription machinery:identification of proteins associated with components of yeastTFIID by multidimensional mass spectrometry,” Molecularand Cellular Biology, vol. 22, no. 13, pp. 4723–4738, 2002.

[23] M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology:tool for the unification of biology,” Nature Genetics, vol. 25,no. 1, pp. 25–29, 2000.

Submit your manuscripts athttp://www.hindawi.com

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

Anatomy Research International

PeptidesInternational Journal of


Hindawi Publishing Corporation http://www.hindawi.com

International Journal of

Volume 2014

Zoology


Molecular Biology International

GenomicsInternational Journal of


The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014


BioinformaticsAdvances in

Marine BiologyJournal of



Signal TransductionJournal of


BioMed Research International

Evolutionary BiologyInternational Journal of



Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Genetics Research International


Advances in

Virolog y

Hindawi Publishing Corporationhttp://www.hindawi.com

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational



Enzyme Research


International Journal of

Microbiology

hkc:analgorithmtopredictproteincomplexesin protein ... · hkc(g). it is the central most densely...

Documents