effective and efficient clustering methods for correlated probabilistic graphs

1. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 1117 Effective and Efcient Clustering Methods for Correlated Probabilistic Graphs Yu Gu, Member, IEEE, Chunpeng Gao, Gao Cong, and Ge Yu, Member, IEEE AbstractRecently, probabilistic graphs have attracted signicant interests of the data mining community. It is observed that correlations may exist among adjacent edges in various probabilistic graphs. As one of the basic mining techniques, graph clustering is widely used in exploratory data analysis, such as data compression, information retrieval, image segmentation, etc. Graph clustering aims to divide data into clusters according to their similarities, and a number of algorithms have been proposed for clustering graphs, such as the pKwikCluster algorithm, spectral clustering, k-path clustering, etc. However, little research has been performed to develop efcient clustering algorithms for probabilistic graphs. Particularly, it becomes more challenging to efciently cluster probabilistic graphs when correlations are considered. In this paper, we dene the problem of clustering correlated probabilistic graphs. To solve the challenging problem, we propose two algorithms, namely the PEEDR and the CPGS clustering algorithm. For each of the proposed algorithms, we develop several pruning techniques to further improve their efciency. We evaluate the effectiveness and efciency of our algorithms and pruning methods through comprehensive experiments. Index TermsClustering, correlated, probabilistic graphs, algorithm 1 INTRODUCTION IN recent years, graph mining has gained signicant attention for a broad range of applications, such as social networks, protein-protein interaction networks, road networks, etc [1]. The data from such applications typically displays an inherent property of uncertainty, and they can be rationally modeled as probabilistic graphs [2] [3], in which each edge ei is labeled with an existence probability to represent the uncertainty of the data. In a probabilistic graph, any two edges ei and ej are called conditionally independent if p(ei, ej) = p(ei)p(ej), and conditionally dependent if p(ei, ej) = p(ei)p(ej). For the standard probabilistic graph model, any two edges are conditionally independent of each other. However, some recent studies [4][5] have pointed out that the adjacent edges existence probabilities can be correlated. Typically, coexistence and mutual exclusion among adjacent edges are commonly observed in various graph- oriented applications. For example, in Fig. 1(a), given two edges e2 and e3 with a mutual exclusion constraint, only the joint probabilities p(e2, e3) and p(e2, e3) may exist, where ei represents that ei does not exist. Obviously, p(e2, e3) = p(e2)p(e3), and hence e2 and e3 are conditionally dependent on each other. Similarly, e1 and e2 are also conditionally dependent on each other due to a coexistence Y. Gu, C. Gao, and G. Yu are with the College of Information Science and Engineering, Northeastern University, Shenyang 110816, China. E-mail: {guyu, yuge}@ise.neu.edu.cn; [email protected]. G. Cong is with Nanyang Technological University, Singapore 639798. E-mail: [email protected]. Manuscript received 17 Dec. 2012; revised 29 June 2013; accepted 14 July 2013. Date of publication 25 July 2013; date of current version 7 May 2014. Recommended for acceptance by J. Pei. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identier below. Digital Object Identier 10.1109/TKDE.2013.123 constraint. In this case, ignoring such correlations will lead to incorrect results. Particularly, according to statistical models in many real scenarios, the correlations among edges do not simply follow mutex or coexistence patterns, and more complicated dependency may exist [6]. In order to model such correlations, a joint probability table is usually introduced to record the joint probabilities among adjacent edges (edges sharing a common vertex) to model the correlations. We dene the probabilistic graphs containing correlated adjacent edges as correlated probabilistic graphs, just as shown in Fig. 2(a). As one of the basic data mining techniques, clustering is widely used in various graph analysis applications[7], such as community detection, index construction, etc. This paper focuses on clustering correlated probabilistic graphs which aims to partition the vertices into several disconnected clusters with high intra-cluster and low inter-cluster similarity, as illustrated in Fig. 2(b). Next, we will motivate the problem of clustering correlated probabilistic graphs using several applications. In Protein-Protein Interaction (PPI) networks, the interaction between two proteins is generally established with a probability property due to the limitation of observation methods [2]. In addition, it has been veried that the interaction between proteins A and B can inu- ence the interaction between protein A and another protein C, if A, B and C have some common features. It has been veried that the probability of pairwise interaction and correlation among edges can be derived from statistical models [6]. Clustering applied to such correlated probabilistic protein-protein interaction network data is helpful in nding complexes to analyze the structure properties of the PPI Network. Consider another example in social networks. The edge probability is used to quantify the reliability of a link. 1041-4347 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. W IN G Z TEC H N O LO G IES 9840004562

2. 1118 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 (a) (b) Fig. 1. Graph model: (a) Probabilistic graph with edge correlations. (b) Possible world graph. Obviously, there are correlations for the links in social networks, i.e., a highly reliable link between nodes A and B may affect the reliability between A and others. Accordingly, clustering applied to social networks while considering the potential probabilities and correlations could be more effective to detect user communities. Compared with clustering probabilistic graphs without any correlation among edges, clustering correlated probabilistic graphs has more constraints. For example, if the link between A and B is mutual exclusion to the link between B and C, B and C cannot be clustered into the same cluster once A and B are clustered into a cluster, irrespective of how reliable the link between B and C is. George et al. [8] advance the state of the art by inves- tigating the problem of clustering probabilistic graphs. Generally, a graph consisting of several disconnected clusters is viewed as a cluster graph. They propose algorithms to nd the cluster graph and a measurement to evaluate the deviations from the probabilistic graph to the output cluster graph. To cluster a correlated probabilistic graph G, a possible world graph Gi of G can be modeled as a deterministic instantiation sampled from the correlated probabilistic graph according to the joint probability distribution. The edit distance D(Gi, Q) from Gi to the cluster graph Q is dened as the number of edges that need to be added or removed to transform Gi into Q. By evaluating all the possible world graph instances, the expected edit distance, denoted as D(G, Q) can be obtained and viewed as a measurement to evaluate the deviation from a correlated probabilistic graph to the cluster graph. Hence, a smaller deviation implies a more precise result, and our objective turns to the goal of nding a cluster graph Q that can minimize D(G, Q). However, it is extremely time-consuming if we calculate the expected edit distance by considering all possible world graphs. To solve this problem, we propose a novel estimation model which requires the dynamical generation of an edge access order when calculating conditional probabilities. The estimation model has provable error bounds. Based on the estimation model, we rst propose an efcient clustering algorithm named PEEDR (Partially Expected Edit Distance Reduction). In PEEDR, we ini- tialize a cluster graph, and then iteratively improve it by adding or removing vertices from some clusters if the estimated edit distance can be reduced. Note that in this process, instead of calculating D(G, Q), we only need to judge whether D(G, Q) is reduced by considering the change of a cluster graph. In addition, (a) (b) Fig. 2. Correlated probabilistic graph and a cluster graph: (a) Correlated probabilistic graph. (b) Cluster graph. some tailored techniques are designed to speed up this process. To further improve the clustering quality, we propose an alternative solution, named the CPGS (Correlated Probabilistic Graphs Spectral) clustering algorithm. Spectral clustering is effective for clustering deterministic graph data, while no available spectral clustering methods can be directly applied to our scenarios where uncertainty and correlations exist. In the CPGS clustering algorithm, a model is proposed to transform each vertex in a correlated probabilistic graph into a point in a multi-dimensional space (based on the spectral graph theory[9]), which can effectively handle new features induced by the existence of edge probabilities and correlations. Then, these transformed points in the multi-dimensional space are iteratively clustered by the K-means algorithm. In addition, we develop several optimization strategies to speed up the clustering process. The major contributions of our work are summarized as follows. We formally dene the problem of clustering correlated probabilistic graphs and investigate related properties. We propose a new algorithm, PEEDR, which is rather efcient for clustering correlated probabilistic graphs, and several pruning methods for this algorithm. We develop an algorithm, CPGS, for clustering correlated probabilistic graphs based on the spectral clustering algorithm, which can produce better cluster results, although it is less efcient than PEEDR. The rest of this paper is organized as follows. We discuss related work in Section 2. Section 3 provides the problem denition. In Section 4, we propose a new efcient clustering algorithm, PEEDR, for clustering correlated probabilistic graphs. In Section 5, we design an alternative algorithm CPGS, which is more effective. The experimental results are presented in Section 6. Finally, we conclude the paper in Section 7. 2 RELATED WORK 2.1 Algorithms for Clustering Deterministic Graphs Deterministic graph clustering has been extensively studied in data mining research and a number of clustering W IN G Z TEC H N O LO G IES 9840004562 3. GU ET AL.: EFFECTIVE AND EFFICIENT CLUSTERING METHODS FOR CORRELATED PROBABILISTIC GRAPHS 1119 algorithms have been developed. Ahmed et al. [1] provided a survey of graph clustering methods. They discussed the different categories of clustering algorithms and recent efforts to design clustering methods for various kinds of graph data. Jain et al. [7] presented a taxonomy of clustering techniques and some important applications of clustering algorithms. Furthermore, a number of different graph clustering algorithms were discussed in [10] [11]. As one of the most widely used graph clustering algorithms, spectral clustering has received increased interest of researchers. Spectral clustering relies on the eigen- structure of a graph Laplacian matrix to partition vertices into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity [9]. The rationality of the spectral clustering method was analyzed by Bach et al. [12]. They derived new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation. Furthermore, a number of optimizations for spectral clustering were proposed in [13][14]. However, most of the existing algorithms are applied in clustering deterministic graphs. Particularly, as correlations exist among edges, it is inappropriate to directly apply these algorithms to clustering correlated probabilistic graphs. 2.2 Querying and Mining of Probabilistic Graphs Recently, querying and mining of probabilistic graphs have attracted growing attention by researchers. Many classical data mining problems have been redened in probabilistic graphs, such as the reachability query, short- est path query, K-NN query, etc. Jin et al. [3] studied the Distance-constraint Reachability query and presented a sampling algorithm to answer the NP-hard problem. Michalis et al. [2] introduced an efcient algorithm for K- NN queries in probabilistic graphs based on the random walk method. As an important preliminary work, George et al. [8] advanced the state of the art by exploring the problem of clustering probabilistic graphs. They proposed efcient algorithms to nd a cluster graph, such as the pKwik- Cluster algorithm, the Furthest algorithm, etc. Nevertheless, these algorithms do not consider the correlations among edges, and thus are not applicable for clustering correlated probabilistic graphs. 2.3 Querying and Mining the Probabilistic Data with Correlations Recently, correlations among uncertain data have received increased interest. Sen et al. [15] proposed a framework to represent the correlations among probabilistic tuples. An efcient strategy was developed for query evaluation over such probabilistic databases by casting the query pro- cessing problem as inference problem in an appropriately constructed probabilistic graphical model. Lian et al. [16] investigated the nearest neighbor query on uncertain data with local correlations. Furthermore, a novel ltering technique via ofine pre-computations was developed to reduce the query search space. There also exist studies on evaluating correlated probabilistic graphs. Hua et al. [5] dened the problem of probabilistic path queries in correlated probabilistic networks. They devised three effective heuristic evaluation functions to in advance estimate the conditional probability of each edge. Yuan et al. [4] proposed a method for subgraph similarity search over correlated probabilistic graphs based on possible world semantics. Tight lower and upper bounds of the subgraph similarity probability were developed to prune the search space. Compared to these queries, clustering over correlated probabilistic graphs is more complicated. 3 PROBLEM DEFINITIONS We dene the model of a correlated probabilistic graph as G = {V, E, P, F}, where V is the set of vertices, E is the set of edges, P is the existence probability, and F is the joint probability distribution of edges. Following the previ- ous work on correlated probabilistic graphs[6], we assume that the joint probabilities only exist among edges that share the same vertex. The output graph is modeled as a cluster graph which is composed of several disconnected clusters and each vertex in the graph only belongs to one cluster. In this model, the joint probability distribution F is of the form f(e1, e2, , ek) = p, where ei denotes existence, ei denotes nonexistence for edge ei E, and p is the value of the joint probability. Fig. 2(a) shows an example of a correlated probabilistic graph G and the corresponding correlated probability distribution. For example, edges e6, e7, and e8 share the same vertex v6. The existence probability of edge e6 is conditionally dependent on edge e7; similarly, the existence probability of edge e7 is conditionally dependent on edge e8. The correlations among edges sharing vertex v6 are shown in the joint probability table in Fig. 2(a). Fig. 2(b) shows a cluster graph of the correlated probabilistic graph. A possible world graph serves as an efcient model in dealing with probabilistic graphs. For a correlated probabilistic graph G = {V, E, P, F}, a possible world graph Gi = {V , E } is an instantiation sampled from G, where V = V and E E. Additionally, we refer to Xi(ej) as the existence state of edge ej in Gi, i.e., if ej exists in Gi, Xi(ej) = ej; otherwise, Xi(ej) = ej. Similarly, XQ(ej) indicates the existence state of an edge ej in the cluster graph Q. When calculating the sampling probability of a possible world graph Gi, an edge order (EO) is necessary for conditional probability calculation. Given an EO Ok as < eO1 , eO2 , . . . , eO|E| >, the edge conditional probability of ej in Gi is pk (Xi(ej)) = p ej|Xi eO1 , Xi eO2 , . . . , Xi eOm1 if ej exists in Gi p ej|Xi eO1 , Xi eO2 , . . . , Xi eOm1 if ej does not exist in Gi where ej = eOm . Thus, the probability that the possible world graph Gi is sampled from G is dened as p(Gi) = pk(Xi(eO1 ))pk(Xi(eO2 )) . . . pk(Xi(eO|E| )), where p(Gi) does not change given different Ok. W IN G Z TEC H N O LO G IES 9840004562 4. 1120 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 Example 1. For the correlated probabilistic graph G in Fig. 1(a), and a possible world graph Gi of G in Fig. 1(b), the probability that Gi is sampled from G is p(Gi) = p(Xi(e1), Xi(e2), Xi(e3)). Given an EO Ok as < e1, e2, e3 >, we have p(Gi) = p(e1)p(e2|e1)p(e3|e2, e1). Next, we extend the denition of the edit distance from a probabilistic graph to a cluster graph proposed in [8] to accommodate the correlations. Given a correlated probabilistic graph G = {V, E, P, F} and an output cluster graph Q, according to the possible world semantics, we dene the expected edit distance from G to Q as D(G, Q) = EGiG[D(Gi, Q)]. If each edge in G is conditionally independent of any other, we have D(G, Q) = EGiG[D(Gi, Q)] = EGiG[ |E| j=1 Xi(ej) XQ(ej) ] = |E| j=1 [EGiG Xi(ej) XQ(ej) ] = eEQ (1 p(e)) + e/EQ p(e), (1) where, if edge ej exists in both Gi and Q, or exists in neither Gi nor Q, Xi(ej) XQ(ej) = 0; otherwise, Xi(ej) XQ(ej) = 1. However, if G is a correlated probabilistic graph, the expected edit distance D(G, Q) cannot be transformed into a linear form since the edges are conditionally dependent on each other and the third line of Equation 1 cannot be obtained. Due to efciency considerations, we develop a model to estimate the expected edit distance. To illustrate the estimation model, we rst present several denitions. Given a correlated probabilistic graph G = {V, E, P, F}, a cluster graph Q and an EO Ok, we dene Qj as a cluster graph obtained by removing edges eO1 , , eOj1 from Q, where j is the jth edge according to the order Ok. Similarly, we dene Gj for G, and dene G j i for a possible world graph Gi. Then, we have D(G, Q) = EGiGD(Gi, Q) (2) = EGiGD Xi(eO1 ) G2 i , XQ eO1 Q2 . By ignoring the dependency of eO1 on the edges in G2 i , and taking the dependencies of edges in G2 i on eO1 into consideration, we have D(G, Q) = EGiGD Xi(eO1 ) G2 i , XQ(eO1 ) Q2 EGiGD(Xi(eO1 ), XQ(eO1 )) + EGiGD G2 i , Q2 = 1 pk XQ(eO1 ) + EGiGD G2 i , Q2 , (3) where EGiGD(G2 i , Q2) is the estimated expected edit distance from G2 i to Q2, and the sampling probability of each possible world graph G2 i is conditionally dependent on the existence state of eO1 in Q. Similarly, we have EGiGD(G2 i , Q2) 1 pk(XQ(eO2 )) + EGiGD(G3 i , Q3). By iteratively executing the process, we have D(G, Q) |E| |E| j=1 pk XQ(ej) = Dk (G, Q), (4) where |E| denotes the number of edges in G, and Dk(G, Q) is the objective function, referring to the estimated expected edit distance from G to the output cluster graph Q with respect to Ok. Note that because the conditional probability of an edge varies with Ok, different Ok will yield different Dk(G, Q). Although some correlations among edges are omitted when calculating Dk(G, Q), most of the correlations have been taken into account and therefore Dk(G, Q) can be used to estimate D(G, Q) effectively. Given a correlated probabilistic graph G, this paper pro- poses two types of algorithms to nd a cluster graph Q. The rst algorithm named PEEDR assumes that the number of the output clusters is not xed, while the other one named CPGS focuses on a xed number of clusters. The effectiveness of an output cluster graph Q is evaluated by Dk(G, Q). Analysis of our problem: According to George et al. [8], the expected edit distance from any cluster graph Qk of G to the optimal cluster graph of G is bounded. Here, we denote the optimal cluster graph that can minimize the expected edit distance D(G, Q) as Qopt. Let XQk (ei) be a random variable, and thus p(XQk (ei) [0, 1]) = 1. According to the Hoeffding inequality, we have p D(G, Qk) D(G, Qopt) 2exp 2 2 |E| i=1(1 0)2 = 2exp 2 2 |E| . (5) Therefore, for various Qk, the probability that D(G, Qk) deviates from D(G, Qopt) is bounded. According to Shamir et al. [17], it is NP-hard to cluster a deterministic graph via edit distance, known as the CLUSTEREDIT problem. Obviously, clustering correlated probabilistic graphs is an NP-hard problem as it is a gen- eralization of the CLUSTEREDIT problem. Therefore, we design approximate algorithms in this paper. The main notations introduced in this paper are summarized in Table 1. 4 PEEDR CLUSTERING ALGORITHM 4.1 Basic PEEDR Clustering Algorithm In this subsection, we present a novel algorithm called Partially Expected Edit Distance Reduction (PEEDR), for clustering a correlated probabilistic graph G. To illustrate the PEEDR algorithm, we rst present a denition. Denition 1 (Adjacent Vertex to Cluster). Consider a graph G = {V, E, P, F}, and a cluster C in the cluster graph Q of G. Let VC denote the set of nodes in cluster C. For any vertex v VVC, we say that vertex v is adjacent to the cluster C if it is adjacent to at least one vertex in VC, denoted as v Adj(C). W IN G Z TEC H N O LO G IES 9840004562 5. GU ET AL.: EFFECTIVE AND EFFICIENT CLUSTERING METHODS FOR CORRELATED PROBABILISTIC GRAPHS 1121 The PEEDR algorithm initializes a cluster with one vertex. Then for each vertex that is adjacent to the cluster, it is removed into the cluster if it reduces the expected edit distance from G to the current cluster graph. The above step is iteratively applied until we cannot expand the cluster. We next choose a vertex from the unclustered vertices and repeat the above procedure to generate another cluster. The procedure is repeated until all vertices of G are grouped into clusters. Consequently, we get the nal cluster graph. One open problem in the above clustering procedure is which vertex to choose in each iteration. Motivated by the observation that the vertices with higher degrees are more likely to be the centers of clusters, the vertices in G are sorted in descending order of their degrees. We prioritize the vertices with higher degree in Adj(C) when moving vertices to C or creating new clusters. Let O0 be the EO dynamically generated by this method, and D0(G, Q) be the estimated expected edit distance from the original correlated probabilistic graph to the cluster graph Q generated with respect to O0. The pseudocode of the PEEDR algorithm is outlined in Algorithm 1. The algorithm rst sorts all the vertices in descending order of their degrees (Line 1). It next initializes a virtual cluster C , where VC = {v|v V}, TABLE 1 Notations which keeps all the unclustered vertices (Line 3). The algorithm nds the vertex with the maximum degree in VC , denoted as v (Line 5). As an optimization of our algorithm (to be presented in Section 4.2.1), we build a Distance-Probability-Threshold Clique DPTC centered at v , denoted as Ci (Line 6). The algorithm will check each vertex vj VC that is adjacent to Ci, and put vj into Ci if it can reduce the objective function D(G, Q) (Lines 9 15). The checking is a key step of the algorithm, for which we invoke a function isReduceEdit (Algorithm 3) and several optimization techniques (Sections 4.2.2 and 4.2.3). Then we update the set VC = VCVCi and create the next cluster from the vertices in the virtual cluster C . The procedure is repeated until all the vertices in C are grouped into disconnected clusters. 4.2 Optimizations for Clustering Process 4.2.1 Grouping Vertices into DPTCs(GVD) Inspired by the concept of maximum clique [18] [19], we propose to build a Distance-Probability-Threshold Clique (DPTC) centered at the singleton cluster. Before giving the denition of the DPTC, we present two notions: For two vertices vi and vj in a correlated probabilistic graph G = {V, E, P, F}, if there is a route rk from vi to vj, then we dene the distance from vi to vj as the number of hops in this route, denoted as drk (vi, vj), and dene the similarity between vi and vj as the existence probability of this route, denoted as simrk (vi, vj). Denition 2 (DPTC). Given a distance threshold dt and a probability threshold , we dene a subgraph C of G as a Distance-Probability-Threshold Clique (DPTC), if for any pair of vertices vi, vj VC, there exists a route rk from vi to vj satisfying the requirements drk (vi, vj) dt and simrk (vi, vj) . The intuition behind DPTC is that vertices with small distances and high similarities are likely to be grouped into the same clusters. Algorithm 2 describes how to establish a DPTC. Given a distance threshold dt and a probability threshold , we start from the original DPTC C which contains only one vertex (Line 1). For each vertex vj VC Adj(C), if for any vertex W IN G Z TEC H N O LO G IES 9840004562 6. 1122 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 vj VC, there exists a route rk from vi to vj that makes drk (vi, vj) dt and simrk (vi, vj) , then vi will be added to the DPTC. The above steps are iteratively executed until no vertices meet the requirement (Lines 317). 4.2.2 Pruning By Loose Bounds of Objective Function(PLB) In line 9 of Algorithm 1, for each vertex vj VC Adj(Ci), we need to check whether vj should be moved from the virtual cluster C to Ci. This is determined by D0(G, Q1) D0(G, Q2), where Q1 and Q2 are cluster graphs, and Q2 is obtained from Q1 by moving vj from C to Ci. If D0(G, Q1)D0(G, Q2) 0, vj will be moved from C to Ci; otherwise, vj remains in C . However, computing D0(G, Q1) and D0(G, Q2) according to the denition of estimated expected edit distance is highly complex. Instead of calculating D0(G, Q1) and D0(G, Q2), we propose to estimate upper and lower bounds for them. Only the existence states of the edges adjacent to vj are changed when generating Q2 from Q1. Assume that m edges are changed and the EO order O0 of those changed edges is < e1, e2, . . . , em >. The partial estimated expected edit distance from G to a cluster graph Q is dened by D0 p (G, Q) = m m i=1 p0 XQ(ei) . (6) Then, we have D0 (G, Q1) D0 (G, Q2) = D0 p (G, Q1) D0 p (G, Q2) = m i=1 p0 XQ1 (ei) + m i=1 p0 XQ2 (ei) . (7) Yet calculating Equation 7 is still computationally expen- sive. Instead, we propose to calculate a lower and an upper bound of D0 p(G, Q). To achieve this, we next propose an approach to estimate an upper bound and a lower bound for p0(XQ(ei)). For an arbitrary edge ei E, we refer to SUE as the set of edges whose existence states do not change when generating Q2 from Q1, E indicates an edge set, E = {e1}{e2}. . . {ei1} SUE, and XQ(E ) indicates the existence states of all the edges in E . We have p(XQ(ei)|XQ(E )) = p(XQ(ei)XQ(E )) p(XQ(E )) . The covariance of XQ(ei) and XQ(E ) is (XQ(ei), XQ(E )) = E(XQ(ei)XQ(E )) E(XQ(ei))E(XQ(E )) Var(XQ(ei)) Var(XQ(E )) , (8) where Var(XQ(ei)) is the variance of the variable XQ(ei), and Var(XQ(ei)) = p(XQ(ei))(1 p(XQ(ei))). We dene Var(XQ(E )) similarly. Then, we have p XQ(ei)XQ E = p XQ(ei) p XQ E + XQ(ei), XQ E Var XQ(ei) Var XQ (E ) . (9) As 1 XQ(ei), XQ E 1, we have p XQ(ei) p XQ E Var XQ(ei) Var XQ (E ) p XQ(ei)XQ E p XQ(ei) p XQ E + Var XQ(ei) Var XQ (E ) (10) Consequently, p XQ(ei) Var XQ(ei) 1 p XQ (E ) 1 = p XQ(ei) p XQ E Var XQ(ei) Var XQ (E ) p XQ (E ) p XQ(ei)|XQ E p XQ(ei) p XQ E + Var XQ(ei) Var XQ (E ) p XQ (E ) = p XQ(ei) + Var XQ(ei) 1 p XQ (E ) 1. (11) Accordingly, the upper and lower bounds of each p0(XQ(ei)) can be obtained based on Equation 11. The upper bound U(D0 p(G, Q)) is determined by all the lower bounds of p0(XQ(ei))i=1,...,m and the lower bound L(D0 p(G, Q)) is determined by all the upper bounds of p0(XQ(ei))i=1,...,m. The upper bound U(D0(G, Q1) D0(G, Q2)) is obtained by U(D0 p(G, Q1)) and L(D0 p(G, Q2)), and the lower bound L(D0(G, Q1) D0(G, Q2)) is obtained by L(D0 p(G, Q1)) and U(D0 p(G, Q2)). 4.2.3 Pruning By Tight Upper Bounds of Objective Function (PTUB) We propose a tighter upper bound of the partial expected edit distance in this section. Suppose the O0 order of those edges that have changed from Q1 to Q2 is < e1, e2, . . . , em >. For any cluster graph Q, the partial expected edit distance D0 p(G, Q) is dened in Section 4.2.2. We next provide a tight upper bound of D0 p(G, Q) according to the following lemma. W IN G Z TEC H N O LO G IES 9840004562 7. GU ET AL.: EFFECTIVE AND EFFICIENT CLUSTERING METHODS FOR CORRELATED PROBABILISTIC GRAPHS 1123 Lemma 1. p(XQ(e1)|XQ(SUE)) + p(XQ(e2)|XQ(SUE), XQ(e1)) + . . . + p(XQ(em)|XQ(SUE), XQ(e1), . . . , XQ(em1)) mp 1 m (XQ(e1), . . . , XQ(em)|XQ(SUE)). (12) Proof. First, we prove a1 + a2 + . . . + am m (a1a2 . . . am) 1 m . (13) Let us assume a function g(x) = ln x. As the deriva- tives of g(x), g (x) = 1 x and g (x) = 1 x2 < 0. As a result, g(x) is a convex function. Consequently, g[ a1 + a2 + + am m ] g(a1) + g(a2) + + g(am) m .(14) Besides, as ln(ab) = ln a + ln b and ln x n = ln(x 1 n ), we can get Equation 13. Accordingly, we can obtain XQ(e1)|XQ(SUE)) + p(XQ(e2)|XQ(SUE), XQ(e1)) + . . . + p(XQ(em)|XQ(SUE), XQ(e1), . . . , XQ(em1)) m[p(XQ(e1)|XQ(SUE))p(XQ(e2)|XQ(SUE), XQ(e1)) . . . p(XQ(em)|XQ(SUE), XQ(e1), . . . , XQ(em1))] 1 m = m[ p(XQ(e1), XQ(SUE)) p(XQ(SUE)) p(XQ(e1), XQ(e2), XQ(SUE)) p(XQ(e1), XQ(SUE)) . . . p(XQ(e1), . . . , XQ(em), XQ(SUE)) p(XQ(e1), . . . , XQ(em1), XQ(SUE)) ] 1 m = mp 1 m (XQ(e1), . . . , XQ(em)|XQ(SUE)). (15) Based on Lemma 1, we have D0 p(G, Q) m mp 1 m (XQ(e1), . . . , XQ(em)|XQ(SUE)). (16) We denote the upper bound developed in Equation 16 as TU(D0 p(G, Q)). As a consequence, the tight upper bound TU(D0(G, Q1) D0(G, Q2)) is determined by the TU(D0 p(G, Q1)) and L(D0 p(G, Q2)), and the tight lower bound TL(D0(G, Q1) D0(G, Q2)) is obtained by L(D0 p(G, Q1)) and TU(D0 p(G, Q2)). Note that calculating this tighter upper bound will incur higher computation costs, with a time complexity of O(dm2dm ), where dm is the maximum degree of vertices. On the other hand, using the loose bound proposed in Section 4.2.2 will only incur O(2dm ) time complexity. Therefore, our proposed strategy is to rst employ the loose bound L(D0(G, Q1) D0(G, Q2)) and U(D0(G, Q1) D0(G, Q2)) to produce a smaller candidate, and then apply the tight bound TU(D0(G, Q1) D0(G, Q2)) and TL(D0(G, Q1)D0(G, Q2)) on this candidate to increase the pruning power. With these bounds, the efciency of the PEEDR clustering algorithm can be further improved. Algorithm 3 describes how to apply the pruning techniques in the PEEDR algorithm. (a) (b) Fig. 3. Possible world graph G1 and the cluster graph Q: (a) Possible world graph G1. (b) Cluster graph Q. 4.2.4 Optimizing By Redening Objective Function (OROF) In the proposed algorithms, joint probability tables are repeatedly read when calculating the edges conditional probabilities. Example 4.1. For instance, considering the cluster graph Q illustrated in Fig. 3 and another cluster graph Q2 that is obtained by removing vertex v2 from C1 to C2, the estimated expected edit distances Dk(G, Q) and Dk(G, Q2) are calculated to determine whether Q2 is a more effective cluster graph than Q. Assuming the Ok order of edges that are adjacent to v2 is < e4, e3, e1 >. According to the denition of Dk(G, Q) based on the edge conditional probability, Dk p(G, Q) = 3 [p(e4|SUE) + p(e3|e4, SUE) + p(e1|e3, e4, SUE)] = 3 [ p(e4, SUE) p(SUE) + p(e3, e4, SUE) p(e4, SUE) + p(e1, e3, e4, SUE) p(e3, e4, SUE) ]. (17) Clearly, p(e4, SUE) and p(e3, e4, SUE) are recalculated when calculating Dk p(G, Q) based on edge conditional probability. To avoid this phenomenon, we introduce the denition of joint existence states of adjacent edges and redene the estimated expected edit distance Dk(G, Q). We refer to Xi(vj) as the joint existence states of all the adjacent edges sharing a vertex vj in a possible world graph Gi. According to pk(Xi(ej)), we dene pk(Xi(vj)) as the sum of conditional probabilities of these edges, i.e., d pk(Xi(ed)), where ed is an edge that is adjacent to vj. Considering a possible world graph G1 in Fig. 3, we can infer that pk(X1(v2)) = p(e4) + p(e3|e4) + p(e1|e3e4). As discussed before, each EO Ok is generated based on a corresponding vertex order, which can be easily obtained. Assuming the vertex order corresponding to the EO Ok is < vO1 , vO2 , . . . , vO|V| >, we have pk(XQ(vi)) = p(XQ(vOi )|XQ(vO1 ), . . . , XQ(vOm1 )), where vi = vOm . The estimated expected edit distance from a correlated probabilistic graph G = {V, E, P, F} to a cluster graph Q can be redened as Dk (G, Q) = |E| |V| i=1 pk(XQ(vi)) 2 . (18) When calculating the Equation 17 according to the joint existence states of vertices, the recalculation of p(e4, SUE) and p(e3, e4, SUE) can be avoided. W IN G Z TEC H N O LO G IES 9840004562 8. 1124 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 The PEEDR clustering algorithm and its pruning techniques (in Sections 4.2.2 and 4.2.3) can be redened according to joint existence state. Running time: The worst-case time complexity of the PEEDR algorithm is O(|C| |V|), where |C| is the number of clusters in the nal cluster graph, and |V| is the number of vertices in the original input graph. In the worst case, a vertex may be visited in the process of establishing every cluster in the nal cluster graph, and thus the time complexity is obtained. 5 CPGS CLUSTERING ALGORITHM 5.1 Basic Framework of CPGS The clustering process of the PEEDR algorithm starts from a local graph and establishes the cluster graph gradu- ally. As vertices will never be separated once grouped into a cluster, it is essentially a greedy algorithm. The PEEDR algorithm may not meet the need for high preci- sion. Besides, there exists no prior information about the number of nal clusters. In some applications, graph clustering aims to partition vertices into a certain number of clusters. Spectral clustering refers to a class of techniques which rely on the eigen-structure of a graph Laplacian matrix to partition vertices into disjoint clusters with high intra-cluster and low inter-cluster similarity [9]. However, applying spectral clustering to correlated probabilistic graphs has not been studied. In a study of Jain et al. [7], the intra-cluster and inter-cluster similarities can be evaluated according to different measurements. To improve the effectiveness of deterministic graph clustering algorithms, Dhillon et al. [20] modeled each pairwise interaction with a similarity measurement (ei). Therefore, the inter-cluster similarity can be modeled by eiEQ (ei) and the intra- cluster dissimilarity by ei /EQ (1(ei)). Spectral clustering on this graph aims to minimize the inter-cluster similarity and intra-cluster dissimilarity. By examining the denition of the estimated expected edit distance proposed in Equation 4, we nd that our objective function has a similar form as that used by spectral clustering. Also, minimizing eEQ (1 pk(e)) ensures a high intra-cluster similarity and minimizing e/EQ pk(e) ensures a low inter-cluster similarity. Based on this observation, one straightforward approach to extending the spectral clustering algorithm for correlated probabilistic graphs works as follows: 1) We map conditional probabilities into weights between each pair of adjacent vertices. 2) We extend the Dijkstra method to nd the K nearest neighbors (K-NN) of each vertex. It enumerates part of the possible world graphs and calculates the probability that a vertex is the K-NN of the others when correlations exist among edges. 3) We establish a Laplacian matrix according to the K-NN results, and compute the eigenvectors of it according to the power method[21]. 4) We represent the vertices by points in a K-dimensional space, and cluster these points with a K-means algorithm. We call the straightforward method Spectral, and it will be used as a benchmark method in our experiments. However, directly applying the spectral clustering algorithm in correlated probabilistic graphs will incur high time complexity. In this paper, we propose a more efcient algorithm called CPGS (Correlated Probabilistic Graphs Spectral) to cluster correlated probabilistic graphs. Given a correlated probabilistic graph G and a cluster number K, we reduce the number of objects by establishing DPTCs rst, and represent these DPTCs as the objects to be clustered. Second, we dene the similarity between pairwise adjacent DPTCs to nd the K-NN of each DPTC. Third, a Laplacian matrix can be obtained according to the K-NN results, and we propose a new approach to calculating the eigenvectors of the Laplacian matrix. Here, we denote the eigenvectors as U1, U2, , UK. Last, each DPTC Ci will be represented by a point PKi(U1i, U2i, . . . , UKi) in a K-dimensional space, and these points are iteratively clustered with a K-means algorithm, such that we get the nal cluster graph. We proceed to present the four steps in detail. 5.2 Establishing DPTCs In the CPGS algorithm, we cluster a graph by establishing DPTCs and representing these DPTCs as objects to be clustered. This can improve the time efciency as the number of objects to be clustered is much smaller than the number of vertices. This paper represents vertices in the graph by points in a multi-dimensional space (Section 5.5) as the distance between vertices in graphs is complex to calculate compared with points in a multi-dimensional space. This is the intuition of the spectral clustering method used in graphs. More details about this transformation can be found in [9]. Assuming vi and vj can be represented by Pi(Pi1, . . . , Pim) and Pj(Pj1, . . . , Pjm) in an m-dimensional space, we dene the similarity between Pi and Pj as sim(Pi, Pj) = m s=1(Pis Pjs)2. Besides, we refer to sim(vi, vj) between vi and vj as maxksimrk (vi, vj), where rk is a route from vi to vj and simrk (vi, vj) is the existence probability of rk. For any three vertices vi, vj and vk represented by Pi, Pj and Pk, we dene the similarity accuracy probability of these rep- resentations as the probability of sim(Pi, Pj) > sim(Pi, Pk) when sim(vi, vj) > sim(vi, vk). For any vertices vi and vj which are represented by Pi and Pj, we would like to ensure that sim(Pi, Pj) = sim(vi, vj) to keep the pairwise similarity. Accordingly, n(n1) 2 equations (i.e., sim(Pi, Pj) = sim(vi, vj) for any i, j) will be generated in a graph containing n vertices. Assume that each vertex is represented by an m dimensional point, then m should be at least n1 2 in order to satisfy the n(n1) 2 equations which contain n m variables. Using a n1 2 dimensional point to represent one vertex will cost too much space. Nevertheless, reducing the dimensions will lower the accuracy probability. In the spectral clustering method, the dimension of the space m is usually xed with K, which equals the number of output clusters. We would like to prove that with a xed dimension of the space, the smaller the number of vertices, the higher the similarity accuracy probability is. Lemma 2. For the n clustering objects in a correlated probabilistic graph, if we use points in a K-dimensional space to keep their pairwise similarities, the smaller n is, the higher the similarity accuracy probability. W IN G Z TEC H N O LO G IES 9840004562 9. GU ET AL.: EFFECTIVE AND EFFICIENT CLUSTERING METHODS FOR CORRELATED PROBABILISTIC GRAPHS 1125 Proof. According to the former analysis, n clustering objects can be represented by n points in a n1 2 dimensional space to preserve pairwise similarity properties. Assuming m = n1 2 , for three points Pm0(0, 0, . . . , 0), Pm1(x1, x2, . . . , xm) and Pm2(y1, y2, . . . , ym) in an m- dimensional space, the distance from Pm1 to the original point Pm0 is x2 1 + x2 2 + . . . + x2 m, and the distance from Pm2 to Pm0 is y2 1 + y2 2 + . . . + y2 m. When using points in a K-dimensional space to represent the three points, we denote them as PK0(0, 0, . . . , 0), PK1(x1, x2, . . . , xK) and PK2(y1, y2, . . . , yK). Here, we dene x2 = x2 1+x2 2+...+x2 m m ,y2 = y2 1+y2 2+...+y2 m m , = x2 1+x2 2+...+x2 K K and = y2 1+y2 2+...+y2 K K . The pairwise similarity accuracy can be dened as the value of Pr(x2 > y2) in the case > . And we would like to prove that the smaller the cluster number n, the higher Pr(x2 > y2) is. Lets assume H0:x2 > y2. Then the hypothesis H0 is accepted in the case > , if ( ) K 1 S2 1 + S2 2 2S12 > t(K 1), (19) where S1 is the standard deviation of , S2 is the standard deviation of , S12 is the standard deviation of , is a given condence coefcient, and t(K 1) is a constant value in given and K. As is known, the smaller m = n1 2 , the smaller S2 1 + S2 2 2S12 is, and the more likely the hypothesis is accepted. The lemma is proved accordingly. Generally speaking, those pairs of vertices with small distances and high similarities are more likely to be grouped into the same cluster. As a consequence, we build DPTCs to group close and similar vertices and represent these DPTCs as objects to be clustered. The algorithm of establishing DPTCs is executed as follows. From the set of vertices that are not in any DPTC, we select a vertex and then invoke Algorithm 2 to build a DPTC for the vertex. Similar to the method of generating the edge order O0, we select vertices with high degrees when creating new DPTCs or adding vertices to DPTCs. This procedure is repeated until all the vertices are grouped into DPTCs. Finally, we obtain a DPTC Graph, a graph composed of several disconnected DPTCs, denoted as GD. Establishing DPTCs increases the similarity accuracy probability, and improves the effectiveness of searching the K-NN of each DPTC proposed in Section 5.3. However, it is disadvantageous to the effectiveness of the processes in Sections 5.4 and 5.5, compared with directly clustering vertices. 5.3 Searching the K-NN of each DPTC In CPGS, we need to nd the K nearest neighbor (K- NN) DPTCs of each DPTC to establish a graph Laplacian matrix. Before presenting the approaches to nding the K- NN DPTCs, we dene the similarity between two adjacent DPTCs based on the denition of estimated expected edit distance. Given a correlated probabilistic graph G = {V, E, P, F}, and two adjacent DPTCs Ci and Cj in a DPTC graph GD, the similarity between them is dened as sim(Ci, Cj) = epS(Ci,Cj) pk(XD(ep)), where S(Ci, Cj) is the set of edges between Ci and Cj in the correlated probabilistic graph, and XD(ep) is the existence state of ep in the DPTC graph GD. Furthermore, the probability that a walk starting at Ci rst hits one of its adjacent DPTC Cj is dened as w(Ci, Cj) = sim(Ci, Cj) CkAdj(Ci) sim(Ci, Ck) . (20) In order to compute the K-NN DPTCs of a source DPTC Ci, we propose to simulate a random walk and approximate the stationary distribution of each DPTC by the frequency that the DPTC was visited during the simulation[2]. Let m denote the number of DPTCs of the DPTC Graph. We construct a m m matrix W from the K-NN results, where each element Wij represents the similarity between two DPTCs Ci and Cj. 5.4 Establishing a Laplacian Matrix and Calculating Eigenvectors In this subsection, we present how to establish a Laplacian matrix from the graph matrix W and an approach to calculating its eigenvectors. The K eigenvectors corresponding to the K largest eigenvalues of the Laplacian matrix are used to represent each object, namely DPTC in our algorithm, with a point in a multi-dimensional space [9]. We rst present how to establish a symmetric Laplacian matrix L(G) from W. We set Wij = Max{Wij, Wji}, and thus W is a symmetric matrix. Then, a diagonal matrix D can be obtained, where Di = j Wij. Subsequently, we have the Laplacian matrix L(G) = D W. However, if we employ the classic power method, computing the K eigenvectors corresponding to the K biggest eigenvalues of L(G) is costly with a time complexity of O(m3). In the classic power method, the eigenvectors are obtained in ascending order of their corresponding eigenvalues and the subsequent eigenvectors are calculated based on the former eigenvectors. Thus, to get the K eigenvectors needed in the CPGS clustering algorithm, we have to calculate all the eigenvectors of the matrix. To this end, we develop a novel method to get the K eigenvectors corresponding to the K biggest eigenvalues instead of computing all the eigenvectors of the Laplacian matrix. The time complexity of calculating the K eigenvectors using our method can be reduced to O(Km2). We select a value which is larger than all the elements in W to establish a "self-complementary" weight matrix W , where W = 0 W12 W1m W21 0 W2m ... ... ... ... Wm1 Wm2 0 . (21) Similarly, we dene D as a diagonal matrix, where Di = j Wij. Then we have another Laplacian matrix L(G ) = D W . According to [22], we can use the power method to efciently calculate the K eigenvectors corresponding to the K W IN G Z TEC H N O LO G IES 9840004562 10. 1126 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 smallest eigenvalues of L(G ), which are the same as the eigenvectors corresponding to the K biggest eigenvalues of L(G). 5.5 Iteratively Clustering Vertices in a K-Dimensional Space Denote the K eigenvectors of the graph Laplacian matrix as U1, U2, , UK, where Ui = [Ui1, Ui2, , Uim] and m is the number of DPTCs in the DPTC graph. According to the spectral clustering method, each DPTC Ci can be represented by a point Pi(U1i, U2i, , UKi). The distance between any two points Pi and Pj is dened as d(Pi, Pj) = K k=1(Uki Ukj)2. Those points can be clustered by the K-means method, and the cluster result is the nal cluster graph of the corresponding DPTCs. To sum up, the CPGS clustering algorithm is presented in Algorithm 4. Running time: Compared with directly applying spectral clustering in graphs with O(|V|3 ) time complexity, the cost of the CPGS clustering algorithm is reduced to O(|V|2 + |C|3 ), where |V| is the number of vertices in the original input graph and |C| is the number of established DPTCs in the original graph. In the CPGS algorithm, |V|2 is the time cost of establishing DPTCs and O(|C|3 ) is the time complexity of clustering |C| objects with the spectral clustering method. 6 EXPERIMENTS 6.1 Experimental Setup In this section, we empirically study the performance of the proposed algorithms. The algorithms are implemented in Microsoft Visual Studio C++ on a PC with a 4 dual core CPU and 8GB memory. We use two real-life graph datasets in our experiments. PPI network: We use a PPI network from the STRING database[4]. The network is modeled as a probabilistic TABLE 2 Default Values (a) (b) Fig. 4. PEEDR: efciency vs vertex number: (a) PPI-vertex number(100). (b) Youtube-vertex number(100). graph by representing proteins as vertices, pairwise inter- actions as edges, and the reliability of each pairwise interaction as edge probability. This graph has 4, 179 vertices and 39, 309 edges. YouTube social network: In the YouTube social network1, each vertex represents a user and an edge between two vertices represents there exists a connection between them. This dataset comprises 1, 134, 890 vertices and 5, 987, 624 edges. The edge existence probability is randomly generated to indicate the link reliability between users. Correlation Simulation: These two datasets do not contain the correlation probabilities among adjacent edges. To generate these probabilities, we rst present several denitions. For the m adjacent edges, we dene the correlation rate among these edges sharing the same vertex, so that the correlation only exists among m edges, and the other edges are independent of each other. Additionally, we dene the correlation coefcient between ei and ej as (ei, ej) = p(ei,ej)p(ei)p(ej) var(ei) var(ej) , where p(ei) and p(ej) denote the existence probability of edges ei and ej, respectively; var(ei) and var(ej) denote the variance of ei and ej, respectively. Given a correlation rate , for any vertex vj and its m edges, the correlations are generated as follows: we randomly select m edges, then for any edge ei and ei+1 of these m edges, the correlation coefcient (ei, ei+1) is randomly generated in the interval [0, 1], and we have p(ei, ei+1) according to (ei, ei+1). Furthermore, the joint probability table of these m edges is obtained. To evaluate the performance of the proposed algorithms, subgraphs from the two networks are generated by varying the vertex number. Based on each of the two networks, we generate a series of data graphs that contain n vertices and the edges among these vertices by searching the n 1 neighbors of a random vertex according to the BFS method. We study the efciency and effectiveness of different parameters on the proposed algorithms. The default 1. http://snap.stanford.edu/data/ W IN G Z TEC H N O LO G IES 9840004562 11. GU ET AL.: EFFECTIVE AND EFFICIENT CLUSTERING METHODS FOR CORRELATED PROBABILISTIC GRAPHS 1127 (a) (b) Fig. 5. PEEDR: efciency vs : (a) PPI-correlation rate. (b) Youtube- correlation rate. values for the parameters used in our experiments are given in Table 2. In our problem, the optimal cluster graph cannot be obtained in polynomial time. According to the Chernoff Bound, we sample 3 2 ln(2 ) cluster graphs, where = 0.1, = 0.1 and = mineiEp(ei), and nd the one that can minimize the expected edit distance to the correlated probabilistic graph as the optimal cluster graph, denoted as Qopt. Then, the accuracy of our algorithm is dened as 1 Dk(G,Q)D(G,Qopt) D(G,Qopt) , where G is the input graph, Q is the output cluster graph induced by our algorithm, and D(G, Qopt) is the sampled minimum expected edit distance. 6.2 Performance of PEEDR Clustering Algorithm In this subsection, we evaluate the performance of the PEEDR algorithm and its optimizations. Efciency of Optimizations: This set of experiments studies the effect of the optimizations for PEEDR in terms of the running time. Specically, we compare the following algorithms: BASIC: The PEEDR algorithm that is based on the DPTCs and does not employ any other optimizations proposed in Section 4.2. OPSV: The difference of OPSV from the BASIC algorithm is that OPSV sorts vertices in descending order of their degrees. PLB: Based on the OPSV algorithm, PLB adopts the pruning technique proposed in Section 4.2.2. PTUB: Based on the PLB algorithm, the PTUB algorithm adopts the pruning technique in Section 4.2.3. OROF: The PEEDR algorithm implemented with all the pruning techniques proposed in Section 4. Fig. 4(a) and 4(b) report the efciency of the different optimizations by varying the data size from 800 to 4, 000 on PPI and YouTube datasets, respectively. Fig. 5(a) and 5(b) illustrate the efciency of the different optimizations by varying the correlation rate from 0.2 to 0.8. (a) (b) Fig. 6. PEEDR: effectiveness vs vertex number: (a) PPI-vertex number(100). (b) Vertex number(100). (a) (b) Fig. 7. PEEDR: effectiveness vs : (a) PPI-correlation rate. (b) Youtube- correlation rate. We observe that in general the runtime increases as the number of vertices increases, while it is relatively stable as the average correlation coefcient increases, especially for OROF. Furthermore, it increases as the correlation rate increases, for the number of correlated edges increases. Comparing the runtime of the BASIC and the OPSV algorithms, we nd that the order in OPSV can improve the efciency of BASIC. This is because the vertices with higher degrees are more likely to be the centers of clusters, and sorting vertices in descending order of their degrees will reduce the number of clustering iterations in OPSV. We observe that the PLB method reduces the runtime of OPSV by about 30%. PLB improves OPSV by avoiding the accurate calculation of the objective function. Fig. 5(a) and 5(b) show that the runtime can be reduced more signicantly with a higher correlation rate, especially when 0.6. This is because this pruning technique has been developed based on the correlation among adjacent edges and is more effective with a higher correlation rate. The PTUB method further improves the efciency of PLB. Although the calculation of the tight upper bound of the objective function is complex, it can further prune vertices that cannot be pruned by PLB. The OROF method outperforms all the other methods. It reduces the runtime by combining the states of all the adjacent edges together. The effectiveness of PEEDR: In this set of experiments, we evaluate the effectiveness of the PEEDR algorithm. The BASIC algorithm may generate different cluster graphs from the other algorithms. In other words, PLB, PTUB and OROF do not affect the effectiveness of the PEEDR algorithm. Thus, we only discuss the effectiveness of the BASIC and the OPSV algorithms. Fig. 6(a) and 6(b) show the accuracy rate by varying the vertex number from 800 to 4000 on PPI and YouTube networks. The accuracy decreases as the vertex number increases, since the larger the vertex number, the bigger the deviation of the generated cluster graph from the optimal cluster graph is. OPSV performs better than BASIC as sorting vertices makes the (a) (b) Fig. 8. CPGS: efciency vs vertex number: (a) PPI-vertex number(100). (b) Youtube-vertex number(100). W IN G Z TEC H N O LO G IES 9840004562 12. 1128 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 (a) (b) Fig. 9. CPGS: efciency vs K: (a) PPI-K. (b) Youtube-K. vertices with higher degrees become the cluster centers, and thus OPSV generates better cluster graphs. Fig. 7 illustrates the effectiveness of the OPSV algorithm by varying on both PPI and YouTube. It is observed that varying has little effect on the effectiveness of the PEEDR clustering algorithm. That is because the variable reveals the correlation among edges, and it does not affect the existence probabilities of edges. And thus the accuracy is not affected. 6.3 Performance of CPGS Clustering Algorithm We aim to evaluate the efciency and effectiveness of CPGS and its optimizations. The following algorithms were implemented. Spectral: We implemented the CPGS algorithm using the basic spectral clustering algorithm without optimizations as it is described in Section 5.1. DPTC: Based on Spectral, we implemented the idea of DPTCs given in Section 5.2. Random walk: We implemented the CPGS algorithm using DPTCs (Section 5.2) and the random walk method (Section 5.3). SCMM: We implemented the CPGS algorithm using DPTCs (Section 5.2) and the random walk method (Section 5.3), and the self-complementary matrix method(SCMM) given in Section 5.4. The efciency of CPGS: Fig. 8 reports the efciency of the CPGS clustering algorithm and its different optimization versions by varying vertex number. Fig. 8 shows that the running time grows exponentially with the vertex number. Fig. 9 illustrates that the runtime grows linearly with the value of K. Furthermore, the runtime is almost not affected by the correlation rate , as presented in Fig. 10. By comparing Spectral and DPTC, we can see that DPTC runs consistently faster than Spectral and the dis- parity increases as the vertex number increases. Spectral clusters vertices while DPTC rst establishes DPTCs and then clusters the generated DPTCs. Although the process of establishing DPTCs takes time, it can greatly speed up the subsequent steps, including searching K-NN, calculating the K eigenvectors and applying K-means. (a) (b) Fig. 10. CPGS: efciency vs : (a) PPI-correlation rate. (b) Youtube- correlation rate. (a) (b) Fig. 11. CPGS: effectiveness vs vertex number: (a) PPI-vertex number(100). (b) Youtube-vertex number(100). By comparing Random walk with DPTC, we can observe the benets of the random walk method (Section 5.3) over the baseline Dijkstra method. The basic idea of the Dijkstra method is to enumerate parts of the possible world graphs and calculate the probability that a DPTC is the K-NN of another. As the random walk method avoids the enumera- tion of the possible world graphs, it is applicable in dealing with probabilistic problems. The experimental result illustrates that it performs better than the Dijkstra method on correlated probabilistic graphs. By comparing SCMM and Random walk, we can see the benets of the optimization using the self-complementary matrix method. Note that the self-complementary matrix method (SCMM) reduces the number of eigenvectors to be calculated compared to directly calculating the K eigenvectors, thus improving the efciency. The effectiveness of CPGS: Figs. 11 and 12 report the accuracy rates from the input correlated probabilistic graph to the output cluster graphs generated by varying vertex numbers n and K, respectively, on both networks. As the SCMM method has little effect on the accuracy compared with Random walk, we only evaluate the accuracy rate of Spectral, DPTC, and SCMM in this set of experiments. The accuracy rate decreases with the number of vertices and increases as K increases. Similar to Fig. 7, the accuracy rate keeps stable with different correlation rates, and we wont show the experimental results due to the page constraints. Establishing DPTCs improves the effectiveness of the process in Section 5.3, which can be proved according to Lemma 2. However, it is disadvantageous to the effectiveness of those processes in Sections 5.4 and 5.5, as all the vertices in the same DPTC are represented by the same point and cannot be partitioned into different clusters in these processes. It is observed that clustering DPTCs leads to a trivial 2% decrease in the accuracy rate, revealing that generating DPTCs does not signicantly affect the effectiveness of the cluster graph. In addition, our experimental studies illustrate that the random walk method used in our algorithm has little effect on the effectiveness of the CPGS clustering algorithm. (a) (b) Fig. 12. CPGS: effectiveness vs K: (a) PPI-K. (b) Youtube-K. W IN G Z TEC H N O LO G IES 9840004562 13. GU ET AL.: EFFECTIVE AND EFFICIENT CLUSTERING METHODS FOR CORRELATED PROBABILISTIC GRAPHS 1129 Fig. 13. Comparisons with existing methods. 6.4 Comparisons with Existing Methods We compare our methods with existing graph clustering methods in this subsection. The variable K of the CPGS algorithm is determined by the output cluster number of the PEEDR algorithm, so that they generate the same number of clusters. Specically, we rst compare with the Furthest algorithm[8] which runs on a probabilistic graph obtained by removing the correlations among edges from the original input graph. Additionally, we compare with two representative graph clustering methods applied to the deterministic graph, namely the Girvan-Newman algorithm[23] and the spectral clustering algorithm[9], by removing both the correlations and the uncertain information from the original input graph. Fig. 13 reports the accuracy rate of different algorithms. The accuracy rate decreases as the vertex number increases. We can see that CPGS and PEEDR generate better cluster graphs than the Furthest algorithm, the Girvan-Newman algorithm and the spectral clustering algorithm do. 7 CONCLUSION In this paper, we have addressed the problem of clustering correlated probabilistic graphs and propose an efcient clustering algorithm named PEEDR. Based on the properties of joint probability, we introduce several pruning methods for PEEDR. To achieve better effectiveness of clustering, we also propose another clustering algorithm named CPGS. A comprehensive performance evaluation veries the efciency and effectiveness of our algorithms and pruning methods. ACKNOWLEDGMENTS This work was supported in part by the National Basic Research Program of China under Grant 2012CB316201, in part by the National Natural Science Foundation of China (61003058, 61272179), and in part by the Fundamental Research Funds for the Central Universities (N130404010). REFERENCES [1] C. C. Aggarwal and H. Wang, Managing and Mining Graph Data, New York, NY, USA: Springer, 2010. [2] M. Potamias, F. Bonchi, A. Gionis, and G. Kollios, K-nearest neighbors in uncertain graphs, PVLDB, vol. 3, no. 1, pp. 9971008, Sept. 2010. [3] R. Jin, L. Liu, B. Ding, and H. Wang, Distance-constraint reachability computation in uncertain graphs, PVLDB, vol. 4, no. 9, pp. 551562, Jun. 2011. [4] Y. Yuan, G. Wang, L. Chen, and H. Wang, Efcient subgraph similarity search on large probabilistic graph databases, PVLDB, vol. 5, no. 9, pp. 800811, May 2012. [5] M. Hua and J. Pei, Probabilistic path queries in road networks: Trafc uncertainty aware path selection, in Proc. 13th Int. EDBT, New York, NY, USA, 2010, pp. 347358. [6] W. C. Wang and L. A. Demsetz, Model for evaluating networks under correlated uncertainty-NETCOR, J. Constr. Eng. Manage., vol. 126, no. 6, pp. 458466, 2000. [7] A. K. Jain, M. N. Murty, and P. J. Flynn, Data clustering: A review, ACM Comput. Surv., vol. 31, no. 3, pp. 264323, Sept. 1999. [8] G. Kollios, M. Potamias, and E. Terzi, Clustering large probabilistic graphs, IEEE Trans. Knowl. Data Eng., vol. 25, no. 2, pp. 325336, Feb. 2013. [9] U. von Luxburg, A tutorial on spectral clustering, Statist. Comput., vol. 17, no. 4, pp. 395416, Dec. 2007. [10] M. R. Ackermann, J. Blmer, D. Kuntze, and C. Sohler, Analysis of agglomerative clustering, Algorithmica, vol. 69, no. 1, pp. 184215, May 2014. [11] G. W. Flake, R. Tarjan, and K. Tsioutsiouliklis, Graph clustering and minimum cut trees, Internet Math., vol. 1, no. 4, pp. 385408, 2003. [12] F. R. Bach and M. I. Jordan, Learning spectral clustering, with application to speech separation, J. Mach. Learn. Res., vol. 7, pp. 19632001, Oct. 2006. [13] D. Yan, L. Huang, and M. I. Jordan, Fast approximate spectral clustering, in Proc. 15th KDD, Paris, France, 2009, pp. 907916. [14] R. Kannan, S. Vempala, and A. Vetta, On clusterings: Good, bad and spectral, J. ACM, vol. 51, no. 3, pp. 497515, 2004. [15] P. Sen and A. Deshpande, Representing and querying correlated tuples in probabilistic databases, in Proc. ICDE, Istanbul, Turkey, 2007, pp. 596605. [16] X. Lian and L. Chen, A generic framework for handling uncertain data with local correlations, PVLDB, vol. 4, no. 1, pp. 1221, 2010. [17] R. Shamir, R. Sharan, and D. Tsur, Cluster graph modi- cation problems, Discrete Appl. Math., vol. 144, no. 12, pp. 173182, 2004. [18] J. Pei, D. Jiang, and A. Zhang, On mining cross-graph quasi- cliques, in Proc. 11th KDD, Chicago, IL, USA, 2005, pp. 228238. [19] D. Gibson, R. Kumar, and A. Tomkins, Discovering large dense subgraphs in massive graphs, in Proc. 31st VLDB, 2005, Trondheim, Norway, pp. 721732. [20] I. S. Dhillon, Y. Guan, and B. Kulis, Kernel k-means: Spectral clustering and normalized cuts, in Proc. 10th KDD, 2004, Seattle, WA, USA, pp. 551556. [21] G. M. D. Corso, Estimating an eigenvector by the power method with a random start, SIAM J. Matrix Anal. Appl., vol. 18, no. 4, pp. 913937, 1997. [22] B. Mohar, Laplace eigenvalues of graphsA survey, Discrete Math., vol. 109, no. 13, pp. 171183, 1992. [23] M. Girvan and M. E. J. Newman, Community structure in social and biological networks, in Proc. Natl. Acad. Sci. U.S.A., vol. 99, no. 12, pp. 78217826, 2002. Yu Gu received the B.E., M.E., and the Ph.D. degrees in computer software and theory from Northeastern University of China, in 2004, 2007, and 2010, respectively. Currently, he is an Associate Professor at Northeastern University, Shenyang, China. His current research interests include graph data management and spatial database management. He is a member of IEEE, the ACM and the CCF. W IN G Z TEC H N O LO G IES 9840004562 14. 1130 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014 Chunpeng Gao received the B.E. degree in 2011 from China University of Mining and Technology. Currently, he is an M.E. candidate in computer software and theory, Northeastern University, Shenyang, China. His current research interests include uncertain data management. Gao Cong received the Ph.D. degree in 2004 from the National University of Singapore. Currently, he is an Assistant Professor at Nanyang Technological University, Singapore. Prior to that, he worked at Aalborg University, Microsoft Research Asia, and the University of Edinburgh. His current research interests include geo-textual data management and data mining. Ge Yu received the Ph.D. degree in computer science from Kyushu University of Japan in 1996. He is now a Professor at the Northeastern University, Shenyang, China. His current research interests include distributed and parallel database, OLAP and data warehousing, data integration, graph data management, etc. He has published more than 200 papers in refer- eed journals and conferences. He is a member of IEEE Computer Society, the ACM, and the CCF. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib. W IN G Z TEC H N O LO G IES 9840004562