hkc:analgorithmtopredictproteincomplexesin protein ... · hkc(g). it is the central most densely...

15
Hindawi Publishing Corporation Journal of Biomedicine and Biotechnology Volume 2011, Article ID 480294, 14 pages doi:10.1155/2011/480294 Research Article HKC: An Algorithm to Predict Protein Complexes in Protein-Protein Interaction Networks Xiaomin Wang, 1 Zhengzhi Wang, 1 and Jun Ye 2, 3 1 Institute of Mechanical Engineering and Automation, National University of Defense Technology, Changsha 410073, China 2 Department of Software Engineering, Jiangnan Institute of Computing Technology, Wuxi 214083, China 3 School of Computer, National University of Defense Technology, Changsha 410073, China Correspondence should be addressed to Xiaomin Wang, [email protected] Received 23 May 2011; Accepted 24 August 2011 Academic Editor: Paul W. Doetsch Copyright © 2011 Xiaomin Wang et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. With the availability of more and more genome-scale protein-protein interaction (PPI) networks, research interests gradually shift to Systematic Analysis on these large data sets. A key topic is to predict protein complexes in PPI networks by identifying clusters that are densely connected within themselves but sparsely connected with the rest of the network. In this paper, we present a new topology-based algorithm, HKC, to detect protein complexes in genome-scale PPI networks. HKC mainly uses the concepts of highest k-core and cohesion to predict protein complexes by identifying overlapping clusters. The experiments on two data sets and two benchmarks show that our algorithm has relatively high F-measure and exhibits better performance compared with some other methods. 1. Introduction With the development of high-throughput methods (such as mass spectrometry [1] and two-hybrid [2]), more and more genome-scale protein-protein interaction (PPI) networks are now available, enabling us to systematically analyze the behaviors and properties of biological molecules. Among the various researches on PPI networks, a key topic is to predict protein complexes in PPI networks. A protein complex is a group of proteins that interact with each other at the same time and place, forming a single multimolecular machine [3]. These complexes are a cornerstone of many biological processes. PPI networks are modular [3, 4] and contain modules that are densely connected within themselves but sparsely connected with the rest of the network. These modules are called cluster and they may represent protein complexes [5, 6]. Thus, protein complexes can be detected by identifying clusters from PPI networks. Detecting protein complexes in PPI networks is of vital importance to the understanding of the structural and functional properties of PPI networks and can also help to predict the function of unknown proteins. Due to the high level of noise as well as the topological features of PPI networks, traditional clustering techniques in a metric space cannot successfully predict protein complexes in PPI networks [3], thus various graph analysis approaches have been proposed to solve this problem. These approaches can be classified into the following three types. (1) Agglom- erative methods: Bader and Hogue [6] proposed a graph theoretic clustering algorithm to detect molecular complexes in PPI networks. The method was based on vertex weighting by local neighborhood density and outward traversal from a locally dense seed protein to isolate the dense regions according to given parameters. The problem of this method is that using only vertex weighting, it would find sparsely connected subgraphs (such as a rope-like subgraph or a mixed subgraph consisting of several connected clusters) instead of dense clusters. (2) Graph partition methods: King et al. [7] partitioned networks and found an approximate optimal solution in the space of partitions using a cost-based local search algorithm. This technique was nondeterministic and got dierent result in each run, and it consumed huge space. Chen and Yuan [8] extended a betweenness- based partition algorithm (Girvan-Newman algorithm [9],

Upload: others

Post on 05-Feb-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • Hindawi Publishing CorporationJournal of Biomedicine and BiotechnologyVolume 2011, Article ID 480294, 14 pagesdoi:10.1155/2011/480294

    Research Article

    HKC: An Algorithm to Predict Protein Complexes inProtein-Protein Interaction Networks

    Xiaomin Wang,1 Zhengzhi Wang,1 and Jun Ye2, 3

    1 Institute of Mechanical Engineering and Automation, National University of Defense Technology, Changsha 410073, China2 Department of Software Engineering, Jiangnan Institute of Computing Technology, Wuxi 214083, China3 School of Computer, National University of Defense Technology, Changsha 410073, China

    Correspondence should be addressed to Xiaomin Wang, [email protected]

    Received 23 May 2011; Accepted 24 August 2011

    Academic Editor: Paul W. Doetsch

    Copyright © 2011 Xiaomin Wang et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    With the availability of more and more genome-scale protein-protein interaction (PPI) networks, research interests gradually shiftto Systematic Analysis on these large data sets. A key topic is to predict protein complexes in PPI networks by identifying clustersthat are densely connected within themselves but sparsely connected with the rest of the network. In this paper, we present a newtopology-based algorithm, HKC, to detect protein complexes in genome-scale PPI networks. HKC mainly uses the concepts ofhighest k-core and cohesion to predict protein complexes by identifying overlapping clusters. The experiments on two data setsand two benchmarks show that our algorithm has relatively high F-measure and exhibits better performance compared with someother methods.

    1. Introduction

    With the development of high-throughput methods (such asmass spectrometry [1] and two-hybrid [2]), more and moregenome-scale protein-protein interaction (PPI) networks arenow available, enabling us to systematically analyze thebehaviors and properties of biological molecules. Among thevarious researches on PPI networks, a key topic is to predictprotein complexes in PPI networks. A protein complex is agroup of proteins that interact with each other at the sametime and place, forming a single multimolecular machine[3]. These complexes are a cornerstone of many biologicalprocesses. PPI networks are modular [3, 4] and containmodules that are densely connected within themselves butsparsely connected with the rest of the network. Thesemodules are called cluster and they may represent proteincomplexes [5, 6]. Thus, protein complexes can be detectedby identifying clusters from PPI networks. Detecting proteincomplexes in PPI networks is of vital importance to theunderstanding of the structural and functional properties ofPPI networks and can also help to predict the function ofunknown proteins.

    Due to the high level of noise as well as the topologicalfeatures of PPI networks, traditional clustering techniques ina metric space cannot successfully predict protein complexesin PPI networks [3], thus various graph analysis approacheshave been proposed to solve this problem. These approachescan be classified into the following three types. (1) Agglom-erative methods: Bader and Hogue [6] proposed a graphtheoretic clustering algorithm to detect molecular complexesin PPI networks. The method was based on vertex weightingby local neighborhood density and outward traversal froma locally dense seed protein to isolate the dense regionsaccording to given parameters. The problem of this methodis that using only vertex weighting, it would find sparselyconnected subgraphs (such as a rope-like subgraph or amixed subgraph consisting of several connected clusters)instead of dense clusters. (2) Graph partition methods: Kinget al. [7] partitioned networks and found an approximateoptimal solution in the space of partitions using a cost-basedlocal search algorithm. This technique was nondeterministicand got different result in each run, and it consumedhuge space. Chen and Yuan [8] extended a betweenness-based partition algorithm (Girvan-Newman algorithm [9],

  • 2 Journal of Biomedicine and Biotechnology

    GN for short) and used it to partition PPI networks intosubgraphs and then obtained function modules by filteringthese subgraphs. The chief drawback of GN algorithm is itis time consuming. Generally, graph partition methods canonly find nonoverlapping clusters, while protein complexestend to overlap with each other. (3) Methods based onclique (see Section 2.2 Concepts): Palla et al. [10] suggestedclique percolation method (CPM) to find overlapping com-munities. The core concept of CPM is k-clique communitywhich is defined as the union of all k-cliques (completesubgraphs of size k) that can be reached from each otherthrough a series of adjacent k-cliques (where adjacencymeans sharing k-1 nodes). Some other researchers [11, 12]predicted protein complexes by identifying cliques or near-cliques in PPI networks. Clique-based approaches were toostringent with the topological structure and cannot detectprotein complexes with other types of topological structure.

    In this paper, we present a new topology-based algo-rithm, HKC (as it mainly uses two important concepts,highest k-core and cohesion), to detect protein complexes ingenome-scale PPI networks. Our algorithm uses the conceptsof highest k-core, and cohesion to predict protein complexesby identifying overlapping clusters. It first calculates the scorefor each node in the PPI network based on the concept ofhighest k-core then uses the nodes with high scores and highdegrees as seeds; from each seed, gets a core according tothe corresponding highest k-core of the direct neighborhoodof the seed and expands this core to include nodes whichare highly possible to form a cluster based on the criteriaof node score and cohesion; finally, protein complexes canbe predicted by filtering all the found clusters accordingto predefined features. We apply HKC on two data sets(MIPS and SGD-MC data set) and evaluate the results ontwo benchmarks (complexcat and Gavin benchmarks). Theexperiments show that our algorithm has relatively highprecision and recall, that is, most of the predicted clustersmatch well with known protein complexes, and at thesame time most of the known protein complexes have beenrecalled. HKC exhibits better performance compared withsome other methods, and besides, it is general and can beused in any type of biological interaction networks and evenin nonbiological networks.

    2. Materials and Method

    2.1. Materials. Protein-protein interactions of the modelorganism Saccharomyces cerevisiae (yeast) has been studiedthoroughly, and the data of yeast protein complexes is themost comprehensive so far, so we test HKC on the yeast PPIdata and use yeast complex data to validate its effectiveness.

    We use the following two data sets as the input network:(1) MIPS data set, it is processed based on the MunichInformation Center for Protein Sequences (MIPS) [13] yeastPPI data set (containing 15,456 records) downloaded fromthe Comprehensive Yeast Genome Database (CYGD) [14];after removing those repetitive interactions and taking noaccount of the edge direction, the final input PPI networkcontains 4,554 proteins and 12,526 interactions. (2) SGD-MC data set, which is downloaded from the Literature

    Curation Data in SGD database [15]; the original data setcontains 252247 records, including two types of interactions,high-throughput and manually curated interactions, and weonly use the manually curated interactions. After deletingall repetitive interactions and taking no account of the edgedirection, the final SGD-MC data set contains 4,448 proteinsand 29,068 interactions.

    To evaluate the algorithm, we collect two data sets asbenchmarks: (1) complexcat benchmark, it is obtained byprocessing the MIPS yeast protein complex catalog in CYGD[14]. The MIPS yeast protein complex catalog was lastmodified in 2006, and has been manually curated fromthe literature; therefore, it is more realistic than other dataobtained by high-throughput methods and has been used inmany researches because of its quality [6, 7, 16]. However, itis not proper to use this data set directly as the benchmark,since it contains many complexes composed of a singleprotein which do not fit the definition of clusters and alsocontains the complexes by Systematic Analysis [17, 18] whichare not so reliable as results by small-scale experiments.After removing those complexes by Systematic Analysis (thetype of 550) and the complexes consisting of only oneprotein, the final obtained complexcat benchmark contains217 protein complexes. (2) Gavin benchmark, it is processedfrom the experiment results by Gavin et al. [19]. Theyuse affinity purification and mass spectrometry to get 491complexes that differentially combine with additional attach-ment proteins or protein modules to enable a diversificationof potential functions. We adopt the core proteins (thosepresent in 2/3 of the isoforms) of the 491 complexes and36 additional known complexes, and after removing those ofsize less than 3, finally we get 204 protein complexes in Gavinbenchmark.

    2.2. Concepts

    2.2.1. PPI Network. PPI networks can be intuitively modeledas a static graph G = (V ,E), where V is the set of nodes (pro-teins), and E is the set of edges (protein-protein interactions).

    An undirected edge is drawn between each pair of nodesfor which there is evidence of a protein-protein interaction.

    2.2.2. Basic Concepts in Graph Theory

    Degree. The degree of the node n is the number of edgesattached to it and is denoted as deg(n).

    Graph Density. There is no standard graph theory definitionof density, but definitions are normally based on the con-nectivity level of a graph [6]. For undirected simple graphs,the graph density is defined as the quotient of the number ofreal edges in the graph divided by the number of all possibleedges:

    den(G) = |E||V | ∗ (|V | − 1)/2 , (1)

    where den(G) is the density of graph G, |E| is the number ofreal edges in G, and |V | is the number of nodes in G.

  • Journal of Biomedicine and Biotechnology 3

    For undirected graph with loops (a loop is the edge thatconnects a node and the node itself), the number of allpossible edges is |V | ∗ |V |/2, thus the density of graph G isdefined as

    den(G) = |E||V | ∗ |V |/2 . (2)

    So, the range of graph density is between 0 and 1.

    Clique. A clique is a fully connected subgraph, that is, a set ofnodes that are all neighbors of each other [20]. For instance,Figure 1(c) is a clique consisting of 4 nodes.

    k-Core. A k-core of a graph is a maximal subgraph such thateach node in the subgraph has at least degree k and is denotedas kc(G). For example, in Figure 1, the graph in (b) is a 2-coreof the graph in (a).

    Highest k-Core. The highest k-core of a graph is the one withthe maximal k value among all the k-cores and is denoted ashkc(G). It is the central most densely connected subgraph.The highest k-core of G can be found in the following way[6]: suppose the lowest degree of nodes in G is l, deleteall nodes with degree l, if all remaining nodes have a leastdegree l1 (l1 > l), we will get the l1-core of G; if some ofthe remaining nodes have degree lower than or equals tol, continue delete all nodes with the least degree until allremaining nodes have a degree higher than l, or until allnodes have been deleted. In this way we can find all k-coresand the one with the maximal k value is the highest k-core.For example, Figure 1(a) shows a graph G, if we delete thenode of degree 1 (i.e., node 6), we can obtain a subgraph inwhich each node has a least degree 2, that is, the 2-core ofG, as shown in Figure 1(b); if we continue delete the node ofdegree 2 (node 1), we can get a subgraph in which each nodehas a least degree 3, the 3-core of G, as shown in Figure 1(c),and it is also the highest k-core of graph G.

    2.2.3. Cohesion. During the expansion of a core, only basedon the score information, we cannot efficiently decidewhether a node should be included into the core. We define anew concept cohesion to measure the connectivity between anode n and an existing cluster c, and we denote it as co(n, c).Cohesion is calculated in the following way:

    co(n, c) = EncNc

    , (3)

    where Enc is the number of edges between node n and clusterc, and Nc is the number of nodes in cluster c. co(n, c) isa real number between 0 and 1. The larger co(n, c) is, thetighter node n connects to cluster c, and the more likely theybelong to a larger cluster. Therefore, cohesion can be usedas a criterion during the core expansion process. A node canonly be included into a core when the cohesion between thisnode and the core is greater than a specified threshold.

    2.3. The Algorithm. The algorithm HKC consists of the fol-lowing three steps: scoring, cluster finding, and filtering.

    2.3.1. Scoring. The first step of HKC is to score all nodes inthe PPI network. For each node, firstly we find the highest k-core of its direct neighborhood (the subgraph consisting ofall nodes connecting to the node, including the node itself),which we denote as H. Then we score H using the propertiesof highest k-core. Larger and denser cores will get higherscores, and it is computed in the following way:

    score(H) = NH ∗ den(H)∗ kmax, (4)where NH denotes the number of nodes in H , den(H)refers to the density of H , kmax is the maximal k valuecorresponding to the highest k-core H , and the larger thesethree values are, the larger and denser the correspondinghighest k-core is.

    As highest k-core is the most densely connected centralcore in the local area, in order to make sure that the nodescore gives better reflection of the connectivity in the localarea, we assign score (H) to each node in H . For node n, itmay be contained in more than one highest k-core, thus itmay be given more than one score, and the final score of thisnode is defined as the maximal score of all highest k-cores inwhich node n is contained (see the pseudocode in line 14–16in Pseudocode 1).

    score(n) = max{score(Hi) | ∀Hi,n ∈ Hi}. (5)In this way, we can make sure that all nodes in densely

    connected subgraphs will have high score and those withlittle neighbors will have low scores, and therefore we candistinguish the nodes in clusters from those not in clusterswith the help of node score.

    2.3.2. Cluster Finding. The second step of this algorithm isto find clusters in the scored graph. The process of findinga cluster is as follows: first choose the seed, then obtain thecore based on the seed, and then expand the core to includethe noncore nodes. The final cluster obtained in this way isthus a circular subgraph with densely connected core andless densely connected noncore, as shown in Figure 2, whichfits our expectation of the topological structure of proteincomplexes.

    The detailed process of finding a cluster can be dividedinto the following three steps.

    (1) Choose the seed. As node scores represent the localdensity, if a node has very high score, it must havea very dense neighborhood; besides, for the nodeswith the same score value (e.g., according to ourscoring scheme, nodes in the same highest k-core mayhave the same score value), the one with the highestdegree can be deemed as the most densely connectednode among them. Therefore, among all the unseennodes with the highest score we choose the one withthe highest degree as the seed (see the pseudocodein line 2–4 in Pseudocode 2), and this way of seedselection can insure that the seed is in the center ofthe corresponding cluster.

    (2) Get the core based on the selected seed. First get thehighest k-core of the direct neighborhood of the seed,

  • 4 Journal of Biomedicine and Biotechnology

    1

    2

    34

    56

    (a)

    1

    2

    34

    5

    (b)

    2

    34

    5

    (c)

    Figure 1: The illustration of k-cores of a graph. (a) Graph G; (b) 2-core of G; (c) 3-core of G, it is also the highest k-core of G.

    Input:PPI network (an indirect simple graph): G = (V ,E)Node score threshold: T1 and T2Cohesion threshold: T3

    Output:The predicted protein clusters: Clusters

    Call ScoringCall ClusterFindingCall Filtering

    // step1: scoringProcedure Scoring

    for all node n in Vdoscore(n) = 0

    end forfor all node n in Vdo

    compute the degree of nN = neighborhood (n) // neighborhood returns the

    // direct neighborhood of n (including n)H = hkc(N) // hkc returns the highest k-core of Nkmax = the maximal k value corresponding to HNH = the number of nodes in Hden(H) = the density of Hscore(H) = NH ∗ den(H)∗ kmaxfor all node m in H do

    score(m) = max{score(H), score(m)} // compute the score of nodeend for

    end forend procedure

    Pseudocode 1: Procedure Scoring.

    and after removing all nodes that have been alreadyseen and those with too low scores we can get the core(see the pseudocode in line 8–13 in Pseudocode 2).In order to avoid repetitive computation, all nodes inthe core (including the seed) would be marked seenand cannot be used as the core or the seed of anothercluster.

    (3) Expand the core. As the node score indicates thelocal connectivity in the neighborhood of this node,it can be used as a criterion during core expanding.When expanding a core, in order to guarantee thelocal density, nodes with too low scores could notbe included into the core; on the other side, to avoidexcessively expanding a little core to include a denserand larger cluster, nodes with extortionate scores

    could neither be included into the core. Furthermore,if we use node score threshold as the only criterionto decide whether a node should be included intothe core, connected nodes with similar scores willbe detected as one cluster when they do not actuallymake up a densely connected subgraph, such asthose with rope-like shape. So we adopt cohesionas another criterion. The process of expanding acore c is as follows (see the pseudocode in line 15–26 in Pseudocode 2): for any unexpanded node n,looking in its neighbors for nodes such that satisfythe following conditions: (1) the score of the node isgreater than or equal to score(n) ∗ T1 and less thanor equal to score(n) ∗ T2; (2) the cohesion betweenthe node and the core is greater than or equal to thethreshold T3. Where T1 and T2 are the node scores

  • Journal of Biomedicine and Biotechnology 5

    Input:PPI network (an indirect simple graph): G = (V ,E)Node score threshold: T1 and T2Cohesion threshold: T3

    Output:The predicted protein clusters: Clusters

    Call ScoringCall ClusterFindingCall Filtering

    // step2: Cluster findingProcedure ClusterFinding

    S = Sort(V) // sort all nodes descendingly according to node score, for nodes// with the same score put the one with higher degree ahead

    n = the first node in SClusters = a empty list of clusterwhilen is not the last node in S do

    if n is not seen thenC = hkc(neighborhood(n))for all node q in C do

    if q is seen or score(q) < score(n)∗T1thenremove q in C

    end ifend formark all nodes in C seenfor all node m in C do

    if m is expanded then continuefor all i in neighbors of m do

    co = the cohesion between i and Cif score(i) >= score(m)∗T1 and

    score(i) = T3 then

    add i to Cend if

    end formark m expanded

    end foradd C to Clusters

    end ifn = the next node of n in S

    end whileend procedure

    Pseudocode 2: Procedure Cluster Finding.

    Seed

    Core

    Noncore

    Figure 2: The illustration of a typical cluster with densely con-nected core and less densely connected non-core.

    lower bound and upper bound, respectively, T1 is areal number between 0 and 1, and T2 is an integergreater than 1; T3 is the cohesion threshold, a realnumber ranging from 0 to 1. Add the found nodes

    to the core and continue to expand the core until allnodes (including the new added nodes) in the corehave been expanded.

    After finding a cluster using the above method, chooseanother seed and repeat the above cluster-finding processuntil no satisfying nodes can be considered as seed, in thisway we can get all clusters in the PPI network.

    It is worth noting that while expanding a core, we neverconsider whether a node has been seen or not. As a result, anoncore part of a cluster can include nodes that have beenseen as the core of another cluster, that is to say the differentclusters detected by HKC can have overlaps between coresand noncores. In this way, we can find overlapping clusters,which better coincide with the fact that different proteincomplexes have overlaps.

  • 6 Journal of Biomedicine and Biotechnology

    Input:PPI network (an indirect simple graph): G = (V ,E)Node score threshold: T1 and T2Cohesion threshold: T3

    Output:The predicted protein clusters: Clusters

    Call ScoringCall ClusterFindingCall Filtering

    // step3: FilteringProcedure Filtering

    for all cluster c in Clusters doif the size of c is less than 3 then

    remove c in Clustersend if

    end forfor all cluster c in Clusters do

    den(c) = the density of cs = size of cscore(c) = den(c)∗ s // compute the score of cluster

    end forfor all cluster c1, c2 in Clusters do

    o = the overlap ratio of c1 and c2if o > 0.95 then

    if score(c1) > score(c2) then delete c2 in Clusterselse delete c1 in Clustersend if

    end ifend forClusters = sort Clusters descendingly according to cluster scores

    end procedure

    Pseudocode 3: Procedure Filtering.

    2.3.3. Filtering. The clusters found in step two contain manyclusters of size one or two, and these little clusters areinsignificant, since they can be obtained by randomly selectnodes in a PPI network. Thus we filter out the clusters thatcontain less than three nodes. Score all clusters using theproduct of cluster density and cluster size, and larger anddenser clusters will get higher scores (see the pseudocode inline 7–11 in Pseudocode 3). As the algorithm allows overlapsbetween cores and non-cores, the results in step two maycontain highly similar clusters, which must be filtered in thepostprocessing. We use overlap ratio (OR, see Section 3.1for more details) to measure the similarity between clusters;compare each two clusters, when their overlap ratio ishigher than 0.95, delete the one with lower score (see thepseudocode in line 12–19 in Pseudocode 3). Finally, rank allremaining clusters in descending order according to clusterscore.

    2.3.4. Pseudocode

    2.4. Implementation. The algorithm has been implementedin Java and we plan to convert it into a Cytoscape plug-in. Now the source code of the algorithm is available freelyfor noncommercial purposes upon request. All maps ofnetworks were performed by Cytoscape [21].

    3. Results and Discussions

    3.1. Evaluation of the Algorithm. To evaluate the performanceof our algorithm, we compare the predicted clusters with theprotein complexes in two different benchmarks: complexcatand Gavin benchmarks. Each of the predicted clusters iscompared with the benchmark complexes. The similaritybetween a predicted cluster and a benchmark complex ismeasured by overlap ratio (OR), which is defined as follows:

    OR = 2∗ O(C1 + C2)

    , (6)

    where O is the number of proteins shared by a predictedcluster and a benchmark complex, C1 is the number ofproteins in the predicted cluster and C2 is the number ofproteins in the benchmark complex. The scope of OR isbetween 0 and 1. OR = 0 means the predicted clusterhas no proteins in common with the benchmark complex;OR = 1 means it is perfectly matched with the benchmarkcomplex. The higher OR is, the more biologically meaningfulthe detected cluster would be. A detected cluster can bedeemed as being matched with a benchmark complex onlywhen their overlap ratio is above a given threshold. And wecall a cluster an effective cluster as long as it has at least onebenchmark complex matching with it. In the same way, a

  • Journal of Biomedicine and Biotechnology 7

    matched complex refers to the benchmark complex that hasa least one detected cluster that matching with it. A rationalOR threshold should ensure that the detected cluster sharesa large proportion of proteins with the matching benchmarkcomplex, and meanwhile it could not be too stringent. In thispaper, we adopt 0.4 as the OR threshold.

    In [6] they use overlap score, defined as O2/(C1 ∗ C2),to determine how effectively a predicted cluster matched to aknown complex in the benchmark set, and it is assumed thata predicted cluster is more or less matches a known complexwhen is its overlap score is above 0.2. Here, we did not adoptthis scoring scheme, because it is biased, that is, it wouldget a relatively high score when a small predicted clustermatching with a large known complex or a large predictedcluster matching with a small complex. For example, whena predicted cluster of size 2 shares 2 proteins with a knowncomplex of size 10, its overlap score equals 0.2 and wouldthus be considered as matched with the known complex;actually, it is not so appropriate to deem such a small clusteras matching a known complex much larger that it. However,our definition of overlap ratio is more balanced and onlygives high score when the matching cluster and complexhave similar sizes. As for the above example, its overlap ratiois only 0.33, less than the threshold 0.4, and would not bedeemed as a match.

    To compare the performance of different algorithms, wedefine three criteria: precision, recall, and F-measure, definedas the following formulas:

    precision = ECAC

    ,

    recall = MCBC

    ,

    F-measure = 2× precision× recallprecision + recall

    ,

    (7)

    where EC is the number of effective clusters found by thealgorithm, AC is the number of all clusters predicted by thealgorithm, MC is the number of matched complexes in thebenchmark set, and BC is the total number of benchmarkcomplexes. Note that according to the overlap score thresh-old, EC may not equal MC, since one predicted cluster maymatch with several benchmark complexes as long as theiroverlap scores are higher than the given threshold, and inthe same way one benchmark complex may correspond withseveral predicted complexes. Precision describes the accuracyof the algorithm result; recall denotes the percentage ofbenchmark complexes that are recovered by the algorithm. F-measure, which is the harmonic mean of precision and recall,shows a good balance of precision and recall, and thus can beused to measure the overall performance of algorithms.

    3.2. Experiments and Comparison. We, respectively, useMIPS and SGD-MC data sets as the input PPI network andrun HKC with 120 groups of parameter combination. Therange of T1 is between 0.3 and 0.7, with the step of 0.1, andthe range of T2 is between 5 and 20, with the step of 5, and therange of T3 is between 0.4 and 0.9, with the step of 0.1. We

    evaluate the results using the two benchmarks: complexcatand Gavin benchmarks, and then choose the optimizedparameters which enable F-measure to get the highest value.The best result and the corresponding optimized parametersfor HKC are shown in Table 1.

    To show the influence of different parameters on thealgorithm performance, we draw the plot of average F-measure versus T1, T2, and T3, respectively, as shown inFigure 3. Note that here the F-measure in y-axis is theaverage value of all F-measures with one parameter specifiedamong the 120 groups of experiment results evaluated bythe complexcat benchmark. From Figure 3, we can see thatamong the three parameters T3 has the greatest influenceon average F-measure, and for MIPS data set, the average F-measure gets the maximum value when T3 = 0.5, while forSGD-MC data set, the average F-measure gets the maximumvalue when T3 = 0.7. This is understandable, as the SGD-MC network (which contains 4,448 proteins and 29,068interactions) is much denser than the MIPS network (whichcontains 4,554 proteins and 12,526 interactions), and duringthe core expansion process the nodes in SGD-MC networkwould have higher cohesion with the core than the nodesin MIPS network. Therefore, for SGD-MC network whenexpanding the core, the cohesion threshold T3 should behigher than that for MIPS network. Furthermore, the figuresshow that a good range for T1 is in the middle, between 0.4and 0.6, the best value for T2 is 10, and for T3 the best rangeis between 0.5 and 0.8.

    To show the performance of HKC, we compare it withMCODE [6], as shown in Table 1. We run the MCODEplugin in Cytoscape with 840 parameter combinations (thesame with that used in [6]) on MIPS and SGD-MC data setsrespectively, and then use the two benchmarks to evaluate theresults. The optimized parameters (see Table 1) are chosenbased on the highest F-measure. The result of HKC is thebest one in 120 groups of parameters, and the correspondingoptimized parameters are shown in Table 1. From this table,we can see that for MIPS data set, the recall of HKC is 0.429corresponding to the complexcat benchmark, considerablyhigher than that of MCODE (0.194); for SGD-MC data set,HKC can recall as high as 58% of protein complexes incomplexcat benchmark, notably higher than that of MCODE(around 22%). Experiment results show that whicheverbenchmark is adopted, for both MIPS and SGD-MC data set,the recall and F-measure of HKC are remarkably higher thanthat of MCODE, and the overall performance is substantiallyimproved.

    As shown in Figure 4, whatever the OR threshold is, HKCcan extract much more effective clusters than MCODE inMIPS data set, and also the number of matched complexesby HKC is much higher than that by MCODE.

    To show the overall performance improvement of ouralgorithm, we also plot precision versus recall for all resultswith different parameters in Figure 5. As can be seen from thefigure, for all four cases, the data points resulted by HKC arelocated in the upper right portion of the plot, correspondingto high values of F-measure, while most of the data pointsresulted by MCODE are located in the lower left part of theplot. The figure illustrates that both precision and recall of

  • 8 Journal of Biomedicine and Biotechnology

    0.3 0.4 0.5 0.6 0.70.34

    0.345

    0.35

    0.355

    0.36

    0.365

    0.37

    0.375

    0.38

    0.385

    T1

    F-m

    easu

    re

    (a)

    0 5 10 15 20

    T2

    0.34

    0.345

    0.35

    0.355

    0.36

    0.365

    0.37

    0.375

    0.38

    0.385

    F-m

    easu

    re

    (b)

    0.3

    0.32

    0.34

    0.36

    0.38

    0.4

    0.3 0.4 0.5 0.6 0.7 0.8 0.9

    T3

    MIPSSGD-MC

    F-m

    easu

    re

    (c)

    Figure 3: The influence of different parameters on algorithm performance. Note the F-measure in y-axis is the average value of all F-measureswith one parameter specified among the 120 groups of experiment results evaluated by the complexcat benchmark, triangles mark the resulton MIPS network, and circles mark the result on SGD-MC network.

    Table 1: Comparison with MCODE. P, R, and F stand for precision, recall, and F-measure, respectively, and their definitions are givenin Section 3.1. MIPS data set contains 4,554 proteins and 12,526 interactions, and SGD-MC data set contains 4,448 proteins and 29,068interactions. AC is the number of all clusters predicted by the algorithm; EC is the number of effective clusters (with a least one matchingcomplex above overlap ratio 0.4) found by the algorithm; MC is the number of matched complexes in the benchmark set. The sizesof complexcat benchmark and Gavin benchmark are 217 and 204, respectively. For HKC the optimized parameters are T1, T2, and T3,respectively, and for MCODE the optimized parameters are NodeScoreCutoff, fluff (T for true, F for false), haircut (T for true, F for false),and other unspecified parameters adopt the default values.

    Algorithm Data set Benchmark P R F AC EC MC Optimized parameters

    MCODEMIPS

    0.455 0.194 0.271 66 30 42 0.05, F, F

    HKCcomplexcat

    0.380 0.429 0.403 237 90 93 0.6, 10, 0.5

    MCODESGD-MC

    0.213 0.221 0.217 197 42 48 0.05, F, T

    HKC 0.275 0.580 0.373 498 137 126 0.6, 10, 0.8

    MCODEMIPS

    0.303 0.098 0.148 66 20 20 0.05, F, T

    HKCGavin

    0.237 0.235 0.236 245 58 48 0.6, 20, 0.5

    MCODESGD- MC

    0.283 0.152 0.198 106 30 31 0, F, T

    HKC 0.271 0.402 0.324 487 132 82 0.5, 5, 0.5

  • Journal of Biomedicine and Biotechnology 9

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1OR threshold

    020406080

    100120140150

    Nu

    mbe

    rof

    mat

    ched

    com

    plex

    es

    (a)

    0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

    OR threshold

    020406080

    100120140160180

    HKCMCODE

    Nu

    mbe

    rof

    effec

    tive

    clu

    ster

    s

    (b)

    Figure 4: The number of effective clusters and the number of matched complexes by HKC and MCODE with respect to different ORthresholds. The result is corresponding to MIPS data set and evaluated on complexcat benchmark. Triangles mark the results of HKC andcircles mark the results of MCODE.

    HKC results for most parameter combinations are higherthan that of MCODE, showing the overall improvement ofthe algorithm performance. From this figure, we can alsosee that the data points resulted by HKC are much morecentralized than MCODE, indicating that our algorithm doesnot rely so severely on parameter selection.

    3.3. Discussions. Among the 237 clusters found by HKCin MIPS data set, 8 clusters perfectly match with knownprotein complexes in complexcat benchmark. Figure 6(a)gives an example of one perfectly matched cluster: cluster23 (consisting of 11 proteins and 54 interactions) per-fectly matches with the TRAPP (transport protein particle)complex (catalog 260.60 in the complexcat benchmark),which plays an essential role in the vesicular transport fromendoplasmic reticulum to Golgi.

    Figure 6(b) shows an example of a containment match.Cluster 12 (consisting of 14 proteins and 92 interactions)is totally contained in a known complex of size 16, SAGAcomplex (catalog 510.190.10.20.10 in the complexcat bench-mark), and their overlap ratio is 0.93. The two proteinsYCL010c and YGL066w that are not recovered by cluster12 have only one interaction with the cluster and do notexhibit good graph theoretic property. Actually, based onthe available information currently, we cannot assert thatYCL010c is contained in SAGA complex, and according to[22] it is only a probable subunit of SAGA complex.

    Figure 6(c) gives an example of a well-matched clus-ter: cluster 103 matches with the complex of cytoplasmic

    translation initiation factor 3 (eIF3, catalog 500.10.40 in thecomplexcat benchmark). Each of them contains 7 proteins,their overlap is 6 and their overlap ratio is 0.857. As shown inthe figure, protein YNL062c is contained in the benchmarkcomplex, but is not included in cluster 103 predicted byHKC. Furthermore, it has only one interaction with cluster103 and does not show an ideal topological property ofbelonging to a cluster. We searched it in Gene Ontology(GO) database [23] and found that according to the mostupdated GO annotation (release date 2011-05-14), YNL062cis not contained in the eIF3 complex (GO:0005852), but isa subunit of tRNA (1-methyladenosine) methyltransferasewith Gcd14p required for the modification of the adenine atposition 58 in tRNAs, especially tRNAi-Met. This indicatesthat the complexcat benchmark we use here may containerrors, because it was last modified in 2006 and many newprotein complexes have been identified through experimentssince then. In a way, it is possible to correct errors in thebenchmark by carefully examining the difference betweenpredicted clusters and their corresponding benchmark com-plexes with high overlap ratio.

    Figure 6(d) shows a novel cluster (ranked 61) detectedby HKC, and it involves 5 proteins and 10 interactions.Cluster 61 does not match with any known protein complexin the complexcat benchmark, but is highly homogenousin the cellular component ontology and biological processontology. Search results on Gene Ontology database showthat protein YGL153w, YLR191w and YNL214w form thedocking complex that facilitates the import of peroxisomalmatrix proteins, and YGL153w is a central component of

  • 10 Journal of Biomedicine and Biotechnology

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

    Precision

    Rec

    all

    (a)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

    Precision

    Rec

    all

    (b)

    (c)

    0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    Precision

    Rec

    all

    WCODEHKC

    0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6

    (d)

    Figure 5: Precision versus Recall plots of all WCODE and HKC results with different parameters on various data sets. (a) Input data set:MIPS, benchmark: complexcat. (b) Input data set: MIPS, benchmark: Gavin. (c) Input data set: SGD-MC, benchmark: complexcat. (d) Inputdata set: SGD-MC, benchmark: Gavin. For all four cases, the data points resulted by HKC are located in the upper right portion of the plot,corresponding to high values of F-measure, while most of the data points resulted by MCODE are located in the lower left part of the plot.Furthermore, the data points resulted by HKC are much more centralized than MCODE.

    the peroxisomal protein import machinery. The other twoproteins in this cluster, YDR244w, and YDR142c, are thePTS1 signal recognition factor and the PTS2 signal recogni-tion factor, respectively, and they also participate in the samebiological process protein docking during peroxisome matrixprotein import (GO: 0016560) as the docking complex.

    The above illustrative examples show that HKC can notonly effectively detect protein complexes in genome-scalePPI networks, but also through the comparison of predictedclusters and their matching benchmark complexes, it mayhelp to correct the errors in the benchmark. Furthermore,HKC can discover novel protein complexes which can beused as candidates for experimental verification, and thusgreatly helps to reduce the time consumption and costof experiments. Among the 237 clusters resulted by ouralgorithm on MIPS data set, 147 clusters do not match withknown complexes in complexcat benchmark, and we give all49 clusters with score ≥4 and size ≥5 as novel predictions

    in Table 2, which would be a starting point for experimentalvalidation in the future.

    4. Conclusions and Future Work

    A genome-scale PPI network is usually very large, consistingof thousands of proteins and tens of thousands of inter-actions, for example, the SGD-MC data set we use as theinput PPI network in this paper contains 4,448 proteinsand 29,068 interactions. It is a challenging task to extractprotein complexes in such a large and complicated network.To solve this problem, many computational methods havebeen proposed, including the graph theoretic clusteringalgorithm. In this paper, we presented a new topology-based algorithm, HKC, which mainly used the concepts ofhighest k-core and cohesion to predict protein complexesby identifying overlapping clusters. The experiments ontwo data sets and two benchmarks showed that HKC can

  • Journal of Biomedicine and Biotechnology 11

    YDR118wYOR249c

    YLR127c

    YGL240w

    YLR102c

    YDL008wYNL172w

    YHR166c

    YKL022c

    YBL084c

    YFR036w

    YMR146c YNL244c

    YMR309c

    YBR079cYPR041w

    YNL062c

    YOR361cYDR429c

    YDR142c YDR244w

    YNL214wYLR191w

    YGL153w

    (a)

    (b)

    (c)

    (d)

    YOL148c YBR198c

    YBR081c

    YHR099w

    YLR055c

    YDR392w

    YMR236wYDR145w

    YGR252w

    YPL254w

    YGL112c

    YDR176w

    YDR448w

    YDR167wYCL010c

    YGL066w

    Figure 6: Examples of clusters predicted by HKC in MIPS data set. The nodes encircled by the red-dotted line are known complex in thecomplexcat benchmark, the nodes contained within the blue circle are clusters predicted by HKC, and the yellow node in each cluster denotesthe seed of the cluster. (a) A cluster of size 11 perfectly matches with the TRAPP complex. (b) A cluster of size 14 shares 14 proteins with theSAGA complex (size 16), and their overlap ratio is 0.93. The two proteins YCL010c and YGL066w that are not contained in the predictedcluster are isolated nodes with only one edge connecting with the cluster. (c) An example of a well-matched cluster, involving 7 proteins,among which 6 is in common with the complex of cytoplasmic translation initiation factor 3 (eIF3). (d) A novel cluster detected by HKC,which does not match with any known protein complexes in the complexcat benchmark, and the proteins in the black circle form the dockingcomplex that facilitates the import of peroxisomal matrix proteins according to GO annotation.

    effectively extract protein complexes from genome-scale PPInetworks and exhibited better performance compared withsome other methods. Besides, HKC is general and can beused in any type of biological interaction networks.

    There is huge amount of work to be done in PPI networkanalysis. As for protein complex prediction, there are also alot of researches to be done in the future. One of the problems

    with the current PPI networks is that they are consisting ofinteractions that do not necessarily happen at the same timeand space. Instead, the interactions in PPI networks maybe unstable, transient or conditional, and may also happenin different subcellular locations. However, by definition, aprotein complex is a group of proteins that interact with eachother at the same time and place. As a result, to increase

  • 12 Journal of Biomedicine and Biotechnology

    Ta

    ble

    2:A

    lln

    ovel

    pred

    icti

    ons

    byH

    KC

    onM

    IPS

    data

    set.

    Eac

    hco

    lum

    ngi

    ves

    the

    orig

    inal

    IDof

    clu

    ster

    s(I

    D),

    clu

    ster

    scor

    e(S

    ),n

    um

    ber

    ofn

    odes

    (N),

    nu

    mbe

    rof

    edge

    s(E

    ),an

    dth

    epr

    otei

    nn

    ames

    inea

    chcl

    ust

    er.

    IDS

    NE

    Pro

    tein

    nam

    es

    121

    .538

    406

    YLR

    200w

    ,YN

    L153

    c,Y

    DR

    318w

    ,YO

    R34

    9w,Y

    HR

    191c

    ,YE

    L003

    w,Y

    JL01

    3c,Y

    LR08

    5c,Y

    PL2

    69w

    ,YH

    R12

    9c,Y

    OR

    026w

    ,YM

    R05

    5c,Y

    PL1

    55c,

    YC

    L029

    c,Y

    MR

    048w

    ,Y

    ML0

    94w

    ,YJL

    030w

    ,YG

    R18

    8c,Y

    CL0

    16c,

    YP

    R13

    5w,Y

    ML1

    24c,

    YP

    R14

    1c,Y

    ER

    016w

    ,YO

    R05

    8c,Y

    ER

    007w

    ,YD

    R15

    0w,Y

    GR

    078c

    ,YJR

    053w

    ,YE

    L061

    c,Y

    MR

    294w

    ,YM

    R29

    9c,Y

    PL0

    08w

    ,YM

    R07

    8c,Y

    GL0

    86w

    ,YP

    L174

    c,Y

    FL03

    7w,Y

    OL0

    12c,

    YP

    L241

    c

    220

    .133

    328

    YP

    L008

    w,Y

    MR

    078c

    ,YG

    R18

    8c,Y

    GL0

    86w

    ,YC

    L016

    c,Y

    PL2

    41c,

    YM

    R29

    4w,Y

    ER

    007w

    ,YO

    R05

    8c,Y

    ER

    016w

    ,YP

    R14

    1c,Y

    PR

    135w

    ,YJL

    030w

    ,YP

    L155

    c,Y

    OR

    026w

    ,Y

    HR

    129c

    ,YP

    L269

    w,Y

    JL01

    3c,Y

    DR

    254w

    ,YD

    R31

    8w,Y

    FL03

    7w,Y

    PL1

    74c,

    YE

    L061

    c,Y

    GR

    078c

    ,YLR

    200w

    ,YM

    L124

    c,Y

    ML0

    94w

    ,YC

    L029

    c,Y

    EL0

    03w

    ,YO

    R34

    9w,

    YN

    L153

    c,Y

    MR

    138w

    ,YO

    R26

    5w

    319

    .727

    260

    YG

    L086

    w,Y

    MR

    078c

    ,YC

    L016

    c,Y

    GR

    188c

    ,YM

    R04

    8w,Y

    OR

    026w

    ,YJL

    013c

    ,YH

    R19

    1c,Y

    OR

    349w

    ,YD

    R31

    8w,Y

    PL0

    08w

    ,YE

    L061

    c,Y

    GR

    078c

    ,YLR

    200w

    ,YE

    R01

    6w,

    YP

    R14

    1c,Y

    PR

    135w

    ,YJL

    030w

    ,YM

    L094

    w,Y

    EL0

    03w

    ,YN

    L153

    c,Y

    DL0

    03w

    ,YD

    R25

    4w,Y

    JR13

    5c,Y

    PR

    046w

    ,YP

    L018

    w,Y

    LR38

    1w

    418

    .9

    2826

    0Y

    GR

    188c

    ,YLR

    381w

    ,YP

    L018

    w,Y

    DR

    254w

    ,YD

    R31

    8w,Y

    OL0

    12c,

    YG

    L086

    w,Y

    PL0

    08w

    ,YG

    R07

    8c,Y

    DL0

    03w

    ,YLR

    200w

    ,YM

    L094

    w,Y

    MR

    048w

    ,YLR

    085c

    ,Y

    EL0

    03w

    ,YO

    R34

    9w,Y

    NL1

    53c,

    YN

    L298

    w,Y

    MR

    078c

    ,YE

    L061

    c,Y

    ER

    016w

    ,YP

    R14

    1c,Y

    PR

    135w

    ,YC

    L016

    c,Y

    JL03

    0w,Y

    HR

    191c

    ,YO

    R19

    5w,Y

    GL2

    16w

    515

    .224

    179

    YN

    L271

    c,Y

    NL3

    22c,

    YB

    L061

    c,Y

    BR

    023c

    ,YG

    R22

    9c,Y

    BL0

    07c,

    YE

    R15

    5c,Y

    LR33

    0w,Y

    HR

    030c

    ,YN

    L298

    w,Y

    DL0

    29w

    ,YJR

    075w

    ,YG

    R07

    8c,Y

    LR20

    0w,Y

    LR33

    7c,

    YM

    L094

    w,Y

    DR

    388w

    ,YC

    R00

    9c,Y

    BR

    234c

    ,YE

    L003

    w,Y

    JL09

    5w,Y

    NL1

    53c,

    YJL

    020c

    ,YM

    R10

    9w

    615

    .225

    185

    YE

    R15

    5c,Y

    EL0

    03w

    ,YG

    R07

    8c,Y

    ML0

    94w

    ,YN

    L271

    c,Y

    GR

    229c

    ,YN

    L298

    w,Y

    LR33

    7c,Y

    NL1

    53c,

    YLR

    200w

    ,YH

    R03

    0c,Y

    BL0

    07c,

    YC

    R00

    9c,Y

    JL09

    5w,Y

    DR

    245w

    ,Y

    BL0

    61c,

    YH

    R14

    2w,Y

    LR33

    0w,Y

    DL0

    29w

    ,YD

    R38

    8w,Y

    BR

    023c

    ,YB

    R23

    4c,Y

    JR07

    5w,Y

    NL3

    22c,

    YD

    R12

    9c

    714

    .827

    196

    YB

    L007

    c,Y

    EL0

    03w

    ,YG

    R07

    8c,Y

    LR20

    0w,Y

    ML0

    94w

    ,YN

    L153

    c,Y

    NL2

    98w

    ,YJL

    095w

    ,YH

    R11

    1w,Y

    GR

    229c

    ,YB

    R23

    4c,Y

    CR

    009c

    ,YE

    R11

    1c,Y

    BR

    023c

    ,YD

    R38

    8w,

    YH

    R03

    0c,Y

    DL0

    29w

    ,YLR

    330w

    ,YH

    R14

    2w,Y

    DR

    424c

    ,YN

    L271

    c,Y

    FR01

    9w,Y

    BL0

    61c,

    YP

    L031

    c,Y

    BR

    200w

    ,YE

    R15

    5c,Y

    OR

    326w

    814

    .625

    178

    YLR

    337c

    ,YN

    L322

    c,Y

    NL2

    71c,

    YB

    L007

    c,Y

    LR20

    0w,Y

    BL0

    61c,

    YH

    R14

    2w,Y

    LR33

    0w,Y

    ML0

    94w

    ,YB

    R02

    3c,Y

    NL2

    33w

    ,YC

    R00

    9c,Y

    JR07

    5w,Y

    NL1

    53c,

    YN

    L298

    w,

    YD

    L029

    w,Y

    HR

    030c

    ,YJL

    183w

    ,YD

    R38

    8w,Y

    BR

    234c

    ,YG

    R22

    9c,Y

    LR34

    2w,Y

    JL09

    5w,Y

    JL09

    9w,Y

    JR11

    8c

    914

    .321

    146

    YN

    L250

    w,Y

    ER

    016w

    ,YP

    R14

    1c,Y

    HR

    191c

    ,YG

    L163

    c,Y

    MR

    078c

    ,YM

    R19

    0c,Y

    KL1

    13c,

    YO

    R14

    4c,Y

    CL0

    61c,

    YE

    R17

    3w,Y

    PR

    135w

    ,YC

    L016

    c,Y

    MR

    048w

    ,YLR

    234w

    ,Y

    JL09

    2w,Y

    LR10

    3c,Y

    ML0

    32c,

    YP

    L194

    w,Y

    NL2

    73w

    ,YJR

    043c

    1014

    .321

    147

    YM

    L032

    c,Y

    ER

    016w

    ,YJR

    043c

    ,YM

    R07

    8c,Y

    NL2

    73w

    ,YLR

    103c

    ,YP

    R13

    5w,Y

    CL0

    16c,

    YM

    R04

    8w,Y

    HR

    191c

    ,YG

    L163

    c,Y

    MR

    190c

    ,YD

    R00

    4w,Y

    OR

    144c

    ,YE

    R09

    5w,

    YC

    L061

    c,Y

    DR

    076w

    ,YE

    R17

    3w,Y

    NL2

    50w

    ,YJL

    092w

    ,YK

    L113

    c

    1114

    .221

    146

    YN

    L298

    w,Y

    HR

    030c

    ,YJL

    095w

    ,YM

    R10

    9w,Y

    BR

    023c

    ,YD

    L029

    w,Y

    BR

    234c

    ,YG

    R07

    8c,Y

    LR20

    0w,Y

    BL0

    61c,

    YM

    L094

    w,Y

    DR

    388w

    ,YJL

    020c

    ,YC

    R00

    9c,Y

    GR

    229c

    ,Y

    EL0

    03w

    ,YN

    L153

    c,Y

    LR33

    7c,Y

    EL0

    31w

    ,YB

    L007

    c,Y

    JR07

    5w

    1313

    .923

    155

    YLR

    200w

    ,YN

    L271

    c,Y

    NL1

    53c,

    YB

    R23

    4c,Y

    NL3

    22c,

    YH

    R14

    2w,Y

    ER

    111c

    ,YB

    L007

    c,Y

    CR

    009c

    ,YG

    R22

    9c,Y

    JL09

    5w,Y

    NL2

    98w

    ,YLR

    330w

    ,YB

    R02

    3c,Y

    JR07

    5w,

    YB

    L061

    c,Y

    DL0

    29w

    ,YH

    R03

    0c,Y

    JR11

    8c,Y

    DR

    388w

    ,YB

    L047

    c,Y

    NL2

    33w

    ,YLR

    342w

    1413

    .723

    153

    YH

    R19

    1c,Y

    ML0

    32c,

    YH

    R03

    1c,Y

    MR

    078c

    ,YC

    L061

    c,Y

    CL0

    16c,

    YM

    R04

    8w,Y

    NL2

    73w

    ,YN

    L250

    w,Y

    PR

    135w

    ,YM

    R19

    0c,Y

    KL1

    13c,

    YLR

    234w

    ,YJR

    043c

    ,YO

    R14

    4c,

    YLR

    103c

    ,YJL

    092w

    ,YO

    L006

    c,Y

    PL0

    24w

    ,YB

    R09

    8w,Y

    DR

    386w

    ,YLR

    235c

    ,YD

    R36

    3w

    1513

    .623

    154

    YE

    R17

    3w,Y

    KL1

    13c,

    YM

    L032

    c,Y

    NL2

    73w

    ,YO

    R14

    4c,Y

    MR

    048w

    ,YJR

    043c

    ,YM

    R07

    8c,Y

    CL0

    61c,

    YLR

    103c

    ,YP

    R13

    5w,Y

    CL0

    16c,

    YH

    R19

    1c,Y

    MR

    190c

    ,YP

    L024

    w,

    YD

    R05

    2c,Y

    NL2

    50w

    ,YH

    R03

    1c,Y

    LR23

    5c,Y

    LR23

    4w,Y

    JL09

    2w,Y

    DL0

    17w

    ,YH

    R15

    4w

    1613

    .321

    136

    YP

    L194

    w,Y

    ML0

    32c,

    YN

    L250

    w,Y

    JL09

    2w,Y

    LR10

    3c,Y

    MR

    078c

    ,YC

    L016

    c,Y

    HR

    191c

    ,YG

    L163

    c,Y

    JR04

    3c,Y

    MR

    190c

    ,YK

    L113

    c,Y

    NL2

    73w

    ,YC

    L061

    c,Y

    ER

    016w

    ,Y

    ER

    173w

    ,YP

    R13

    5w,Y

    MR

    048w

    ,YO

    R14

    4c,Y

    DL0

    13w

    ,YE

    R11

    6c

    1712

    .922

    138

    YE

    R01

    6w,Y

    GL1

    63c,

    YH

    R19

    1c,Y

    NL2

    73w

    ,YM

    L032

    c,Y

    LR10

    3c,Y

    MR

    078c

    ,YK

    L113

    c,Y

    OR

    144c

    ,YC

    L061

    c,Y

    PR

    135w

    ,YN

    L250

    w,Y

    CL0

    16c,

    YH

    R03

    1c,Y

    MR

    048w

    ,Y

    LR23

    4w,Y

    JL09

    2w,Y

    JR04

    3c,Y

    MR

    190c

    ,YD

    R36

    3w,Y

    BR

    228w

    ,YLR

    135w

    2010

    .813

    66Y

    ER

    146w

    ,YD

    R37

    8c,Y

    NL1

    47w

    ,YLR

    438c

    -a,Y

    ER

    112w

    ,YJL

    124c

    ,YC

    R07

    7c,Y

    BL0

    26w

    ,YJR

    022w

    ,YO

    L149

    w,Y

    GL1

    73c,

    YN

    L118

    c,Y

    EL0

    15w

    2110

    .811

    54Y

    BL0

    26w

    ,YE

    R14

    6w,Y

    DR

    378c

    ,YJR

    022w

    ,YN

    L147

    w,Y

    LR43

    8c-a

    ,YE

    R11

    2w,Y

    OL1

    49w

    ,YJL

    124c

    ,YC

    R07

    7c,Y

    GL1

    73c

    2410

    1782

    YB

    L061

    c,Y

    DR

    388w

    ,YB

    R02

    3c,Y

    JL09

    5w,Y

    NL2

    98w

    ,YN

    L271

    c,Y

    LR33

    0w,Y

    HR

    030c

    ,YE

    R11

    1c,Y

    GR

    229c

    ,YE

    R15

    5c,Y

    PL0

    31c,

    YM

    R30

    7w,Y

    OR

    008c

    ,YLR

    342w

    ,Y

    LR37

    1w,Y

    LR33

    2w

    259.

    917

    81Y

    JL09

    5w,Y

    DL0

    29w

    ,YC

    R00

    9c,Y

    NL2

    98w

    ,YB

    L061

    c,Y

    DR

    388w

    ,YE

    R11

    1c,Y

    GR

    229c

    ,YN

    L322

    c,Y

    LR33

    0w,Y

    HR

    030c

    ,YB

    R02

    3c,Y

    MR

    307w

    ,YG

    L027

    c,Y

    ML1

    15c,

    YG

    L200

    c,Y

    PR

    159w

    269.

    517

    78Y

    NL2

    98w

    ,YD

    R38

    8w,Y

    ER

    111c

    ,YN

    L271

    c,Y

    LR33

    0w,Y

    HR

    030c

    ,YB

    L061

    c,Y

    BR

    023c

    ,YN

    L322

    c,Y

    MR

    307w

    ,YLR

    039c

    ,YLR

    262c

    ,YG

    R22

    9c,Y

    LR34

    2w,Y

    JR07

    3c,

    YJL

    183w

    ,YK

    L190

    w27

    8.6

    1456

    YLR

    039c

    ,YP

    L051

    w,Y

    LR26

    2c,Y

    JL15

    4c,Y

    DR

    126w

    ,YG

    L005

    c,Y

    OR

    132w

    ,YO

    R06

    9w,Y

    NL0

    51w

    ,YH

    L031

    c,Y

    NL0

    41c,

    YO

    R07

    0c,Y

    ML0

    71c,

    YB

    R16

    4c

  • Journal of Biomedicine and Biotechnology 13

    Ta

    ble

    2:C

    onti

    nu

    ed.

    IDS

    NE

    Pro

    tein

    nam

    es28

    8.5

    1455

    YO

    L012

    c,Y

    LR03

    9c,Y

    LR26

    2c,Y

    LR08

    5c,Y

    LR41

    8c,Y

    AL0

    11w

    ,YA

    L013

    w,Y

    MR

    263w

    ,YJL

    168c

    ,YM

    L041

    c,Y

    GL2

    44w

    ,YP

    L181

    w,Y

    DR

    334w

    ,YO

    R12

    3c29

    7.7

    1346

    YN

    L051

    w,Y

    HL0

    31c,

    YP

    L051

    w,Y

    OR

    070c

    ,YN

    L041

    c,Y

    LR26

    2c,Y

    ML0

    71c,

    YD

    R12

    6w,Y

    KL1

    90w

    ,YLR

    039c

    ,YB

    R16

    4c,Y

    KR

    001c

    ,YM

    R00

    4w30

    7.6

    1139

    YB

    R02

    3c,Y

    KL1

    90w

    ,YN

    L322

    c,Y

    HR

    030c

    ,YG

    R22

    9c,Y

    MR

    307w

    ,YJR

    073c

    ,YLR

    262c

    ,YD

    R16

    2c,Y

    LR03

    9c,Y

    DL0

    06w

    317.

    513

    45Y

    NL0

    51w

    ,YM

    L071

    c,Y

    PL0

    51w

    ,YO

    R07

    0c,Y

    NL0

    41c,

    YLR

    262c

    ,YO

    L018

    c,Y

    DR

    126w

    ,YH

    L031

    c,Y

    BR

    164c

    ,YLR

    039c

    ,YN

    L238

    w,Y

    LL04

    0c

    396

    1751

    YB

    R20

    0w,Y

    NL2

    98w

    ,YO

    R12

    7w,Y

    HR

    061c

    ,YH

    L007

    c,Y

    ER

    149c

    ,YG

    R15

    2c,Y

    LL02

    1w,Y

    PL1

    15c,

    YLR

    319c

    ,YD

    R30

    9c,Y

    BL0

    85w

    ,YA

    L041

    w,Y

    ER

    114c

    ,YO

    R18

    8w,

    YM

    R27

    3c,Y

    LR22

    9c40

    68

    25Y

    CR

    088w

    ,YO

    R28

    4w,Y

    GR

    268c

    ,YD

    R38

    8w,Y

    BL0

    07c,

    YN

    L094

    w,Y

    NL2

    43w

    ,YH

    R01

    6c42

    67

    18Y

    ER

    155c

    ,YA

    L040

    c,Y

    DL0

    47w

    ,YK

    R02

    8w,Y

    JL09

    8w,Y

    FR04

    0w,Y

    KR

    072c

    435.

    810

    27Y

    MR

    263w

    ,YO

    R12

    3c,Y

    LR08

    5c,Y

    GL2

    44w

    ,YM

    L041

    c,Y

    JL16

    8c,Y

    AL0

    13w

    ,YLR

    418c

    ,YB

    L008

    w,Y

    OR

    038c

    485.

    58

    26Y

    IL14

    4w,Y

    DR

    201w

    ,YFL

    008w

    ,YO

    L069

    w,Y

    FR03

    1c,Y

    EL0

    43w

    ,YD

    L074

    c,Y

    JL07

    4c

    505.

    317

    55Y

    LR34

    7c,Y

    NL1

    89w

    ,YM

    L064

    c,Y

    LR08

    2c,Y

    LR24

    5c,Y

    DR

    321w

    ,YP

    L070

    w,Y

    BR

    176w

    ,YN

    L044

    w,Y

    NL3

    31c,

    YP

    L124

    w,Y

    ER

    023w

    ,YLR

    423c

    ,YJR

    056c

    ,YE

    L066

    w,

    YJL

    218w

    ,YN

    R01

    2w51

    5.3

    926

    YO

    R28

    4w,Y

    JL19

    9c,Y

    NL1

    89w

    ,YM

    L064

    c,Y

    BR

    176w

    ,YLR

    245c

    ,YP

    L070

    w,Y

    LR29

    1c,Y

    HL0

    18w

    554.

    98

    18Y

    HR

    066w

    ,YP

    L093

    w,Y

    ER

    006w

    ,YP

    L211

    w,Y

    MR

    049c

    ,YM

    R29

    0c,Y

    NL0

    02c,

    YG

    R10

    3w56

    4.9

    817

    YG

    L244

    w,Y

    LR03

    9c,Y

    LR26

    2c,Y

    OR

    216c

    ,YLR

    085c

    ,YLR

    418c

    ,YIR

    005w

    ,YG

    L174

    w

    594.

    517

    49Y

    ML0

    64c,

    YN

    L189

    w,Y

    LR34

    7c,Y

    GR

    010w

    ,YK

    L067

    w,Y

    FR04

    7c,Y

    GL1

    75c,

    YO

    R02

    0c,Y

    LR32

    8w,Y

    LR33

    5w,Y

    GL0

    40c,

    YG

    L037

    c,Y

    NL3

    33w

    ,YFL

    059w

    ,YP

    L111

    w,

    YG

    R26

    7c,Y

    LR37

    7c60

    4.5

    511

    YFL

    059w

    ,YN

    L333

    w,Y

    MR

    322c

    ,YM

    R09

    5c,Y

    MR

    096w

    614.

    55

    10Y

    NL2

    14w

    ,YD

    R24

    4w,Y

    GL1

    53w

    ,YD

    R14

    2c,Y

    LR19

    1w65

    4.4

    612

    YLR

    310c

    ,YLL

    016w

    ,YA

    L024

    c,Y

    NL0

    98c,

    YAR

    019c

    ,YO

    R10

    1w70

    4.3

    716

    YM

    L064

    c,Y

    PL1

    24w

    ,YLR

    423c

    ,YP

    L070

    w,Y

    DR

    148c

    ,YE

    R08

    6w,Y

    DL2

    39c

    714.

    37

    14Y

    DL2

    26c,

    YG

    R17

    2c,Y

    LR32

    4w,Y

    GL1

    98w

    ,YD

    R42

    5w,Y

    GL1

    61c,

    YP

    L095

    c73

    4.3

    714

    YLR

    347c

    ,YM

    R04

    7c,Y

    KL0

    68w

    ,YB

    R21

    6c,Y

    ML0

    07w

    ,YE

    R10

    7c,Y

    DL2

    07w

    744.

    37

    14Y

    LR03

    9c,Y

    PL2

    34c,

    YK

    L190

    w,Y

    LR26

    2c,Y

    ER

    081w

    ,YH

    R06

    0w,Y

    GR

    020c

    774

    511

    YE

    R09

    3c,Y

    JL05

    8c,Y

    JR06

    6w,Y

    KL2

    03c,

    YN

    L006

    w78

    45

    12Y

    KL2

    03c,

    YJR

    066w

    ,YN

    L006

    w,Y

    HR

    186c

    ,YP

    L180

    w82

    45

    8Y

    LR03

    9c,Y

    LR26

    2c,Y

    CR

    020c

    -a,Y

    PR

    051w

    ,YE

    L053

    c90

    45

    8YA

    R01

    9c,Y

    GR

    092w

    ,YD

    L047

    w,Y

    PR

    111w

    ,YIL

    106w

    934

    58

    YLR

    039c

    ,YO

    R07

    0c,Y

    BR

    036c

    ,YM

    R27

    2c,Y

    PL0

    57c

    944

    58

    YM

    R19

    7c,Y

    HL0

    31c,

    YLR

    026c

    ,YK

    L006

    c-a,

    YK

    L196

    c

  • 14 Journal of Biomedicine and Biotechnology

    the precision and recall of complex prediction algorithm,further information about the time, space, or conditions ofinteractions should be taken into consideration.

    Acknowledgments

    This work was supported in part by the Grants fromthe National Natural Science Foundation of China (no.60835005) and the scientific research project of NationalUniversity of Defense Technology (jc09-03-04).

    References

    [1] Y. Ho, A. Gruhler, A. Heilbut et al., “Systematic identificationof protein complexes in Saccharomyces cerevisiae by massspectrometry,” Nature, vol. 415, no. 6868, pp. 180–183, 2002.

    [2] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, andY. Sakaki, “A comprehensive two-hybrid analysis to explorethe yeast protein interactome,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 98, no.8, pp. 4569–4574, 2001.

    [3] V. Spirin and L. A. Mirny, “Protein complexes and functionalmodules in molecular networks,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 100,no. 21, pp. 12123–12128, 2003.

    [4] L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray,“From molecular to modular cell biology,” Nature, vol. 402,no. 6761, pp. C47–C52, 1999.

    [5] M. Altaf-Ul-Amin, Y. Shinbo, K. Mihara, K. Kurokawa, and S.Kanaya, “Development and implementation of an algorithmfor detection of protein complexes in large interaction net-works,” BMC Bioinformatics, vol. 7, article 207, 2006.

    [6] G. D. Bader and C. W. V. Hogue, “An automated methodfor finding molecular complexes in large protein interactionnetworks,” BMC Bioinformatics, vol. 4, article 2, 2003.

    [7] A. D. King, N. Pržulj, and I. Jurisica, “Protein complexprediction via cost-based clustering,” Bioinformatics, vol. 20,no. 17, pp. 3013–3020, 2004.

    [8] J. Chen and B. Yuan, “Detecting functional modules in theyeast protein-protein interaction network,” Bioinformatics,vol. 22, no. 18, pp. 2283–2290, 2006.

    [9] M. Girvan and M. E. J. Newman, “Community structure insocial and biological networks,” Proceedings of the NationalAcademy of Sciences of the United States of America, vol. 99, no.12, pp. 7821–7826, 2002.

    [10] G. Palla, I. Derényi, I. Farkas, and T. Vicsek, “Uncoveringthe overlapping community structure of complex networks innature and society,” Nature, vol. 435, no. 7043, pp. 814–818,2005.

    [11] H. N. Chua, K. Ning, W. K. Sung, H. W. Leong, and L.Wong, “Using indirect protein-protein interactions for proteincomplex predication,” Life Sciences Society: ComputationalSystems Bioinformatics Conference, vol. 6, pp. 97–109, 2007.

    [12] G. Cui, Y. Chen, D. S. Huang, and K. Han, “An algorithmfor finding functional modules and protein complexes inprotein-protein interaction networks,” Journal of Biomedicineand Biotechnology, vol. 2008, Article ID 860270, 10 pages,2008.

    [13] H. W. Mewes, S. Dietmann, D. Frishman et al., “MIPS: analysisand annotation of genome information in 2007,” Nucleic AcidsResearch, vol. 36, no. 1, pp. D196–D201, 2008.

    [14] U. Güldener, M. Münsterkötter, G. Kastenmüller et al.,“CYGD: the comprehensive yeast genome database,” NucleicAcids Research, vol. 33, pp. D364–D368, 2005.

    [15] J. M. Cherry, C. Adler, C. Ball et al., “SGD: saccharomycesgenome database,” Nucleic Acids Research, vol. 26, no. 1, pp.73–79, 1998.

    [16] S. Brohée and J. van Helden, “Evaluation of clusteringalgorithms for protein-protein interaction networks,” BMCBioinformatics, vol. 7, article 488, 2006.

    [17] A. C. Gavin, M. Bösche, R. Krause et al., “Functionalorganization of the yeast proteome by systematic analysis ofprotein complexes,” Nature, vol. 415, no. 6868, pp. 141–147,2002.

    [18] N. J. Krogan, W. T. Peng, G. Cagney et al., “High-definitionmacromolecular composition of yeast RNA-processing com-plexes,” Molecular Cell, vol. 13, no. 2, pp. 225–239, 2004.

    [19] A. C. Gavin, P. Aloy, P. Grandi et al., “Proteome survey revealsmodularity of the yeast cell machinery,” Nature, vol. 440, no.7084, pp. 631–636, 2006.

    [20] J. Gagneur, R. Krause, T. Bouwmeester, and G. Casari, “Mod-ular decomposition of protein-protein interaction networks,”Genome Biology, vol. 5, no. 8, p. R57, 2004.

    [21] S. Killcoyne, G. W. Carter, J. Smith, and J. Boyle, “Cytoscape:a community-based framework for network modeling,” Meth-ods in Molecular Biology, vol. 563, pp. 219–239, 2009.

    [22] S. L. Sanders, J. Jennings, A. Canutescu, A. J. Link, and P. A.Weil, “Proteomics of the eukaryotic transcription machinery:identification of proteins associated with components of yeastTFIID by multidimensional mass spectrometry,” Molecularand Cellular Biology, vol. 22, no. 13, pp. 4723–4738, 2002.

    [23] M. Ashburner, C. A. Ball, J. A. Blake et al., “Gene ontology:tool for the unification of biology,” Nature Genetics, vol. 25,no. 1, pp. 25–29, 2000.

  • Submit your manuscripts athttp://www.hindawi.com

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Anatomy Research International

    PeptidesInternational Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporation http://www.hindawi.com

    International Journal of

    Volume 2014

    Zoology

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Molecular Biology International

    GenomicsInternational Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    BioinformaticsAdvances in

    Marine BiologyJournal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Signal TransductionJournal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    BioMed Research International

    Evolutionary BiologyInternational Journal of

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Biochemistry Research International

    ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Genetics Research International

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Advances in

    Virolog y

    Hindawi Publishing Corporationhttp://www.hindawi.com

    Nucleic AcidsJournal of

    Volume 2014

    Stem CellsInternational

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    Enzyme Research

    Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014

    International Journal of

    Microbiology