effective community search dami2015

16
Efficient and Effective Community Search 1 Nicola Barbieri Yahoo Labs – London [email protected] Francesco Bonchi ISI Foundation & Eurecat [email protected] Edoardo Galimberti Deloitte [email protected] Francesco Gullo UniCredit R&D [email protected]

Upload: others

Post on 08-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: effective community search DAMI2015

Efficient and EffectiveCommunity Search

1

Nicola BarbieriYahoo Labs – [email protected]

Francesco BonchiISI Foundation & [email protected]

Edoardo [email protected]

Francesco GulloUniCredit R&[email protected]

Page 2: effective community search DAMI2015

Community search & applications§ Given a set of query vertices from a large graph, the

community-search problem requires to find a cohesive(high density), connected sub-graph that contains thegiven vertices;

Example of Community search on DBLP dataset (co authorship) [1]. The query is specified by rectangular nodes.

Kanellakis

Papadimitriou

Abiteboul

Buneman

VianuVardi

Hull

Delobel

Ioannidis

Hellerstein

Ross

Ullman

Bernstein

(a) Database theory

Fortnow

Babai

Nisan

Wigderson

Zuckerman

Safra

Saks

Papadimitriou

Karp

Itai

Lipton

Goldreich

(b) Complexity theory

Karp

Blum

Papadimitriou

Afrati

Johnson

Goldman

Piccolboni

Yannakakis

Crescenzi

TarjanUllman

Sagiv

(c) Algorithms I

Kleinberg

Raghavan

Rajagopalan

Tomkins

Hirsch

Dantsin

Kannan

Goerdt

Papadimitriou

Chakrabarti

Gibson

Kumar

Dom

Schoning

(d) Algorithms II

Figure 4: Different communities of Christos Papadimitriou. Rectangular nodes indicate the query nodes, and ellipticalnodes indicate nodes discover by our algorithm.

[10] C. Faloutsos, K. McCurley, and A. Tomkins. Fast discoveryof connection subgraphs. In KDD, 2004.

[11] U. Feige, G. Kortsarz, and D. Peleg. The dense k-subgraphproblem. Algorithmica, 29:2001, 1999.

[12] G. W. Flake, S. Lawrence, and C. L. Giles. Efficientidentification of web communities. In KDD, 2000.

[13] G. W. Flake, S. Lawrence, C. L. Giles, and F. M. Coetzee.Self-organization and identification of web communities.Computer, 35(3):66–71, 2002.

[14] S. Fortunato and M. Barthelemy. Resolution limit incommunity detection. PNAS, 104(1), 2007.

[15] D. Gibson, R. Kumar, and A. Tomkins. Discovering largedense subgraphs in massive graphs. In VLDB, 2005.

[16] M. Girvan and M. E. J. Newman. Community structure insocial and biological networks. Proceedings of the NationalAcademy of Sciences of the USA, 99(12):7821–7826, 2002.

[17] J. Hastad. Clique is hard to approximate within n1−ε.Electronic Colloquium on Computational Complexity(ECCC), 4(38), 1997.

[18] G. Karypis and V. Kumar. A fast and high qualitymultilevel scheme for partitioning irregular graphs. JSC,20(1), 1998.

[19] G. Kasneci, S. Elbassuoni, and G. Weikum. Ming: mininginformative entity relationship subgraphs. In CIKM, 2009.

[20] S. Khuller and B. Saha. On finding dense subgraphs. InICALP, 2009.

[21] Y. Koren, S. C. North, and C. Volinsky. Measuring andextracting proximity graphs in networks. TKDD, 1(3),2007.

[22] B. Korte and J. Vygen. Combinatorial Optimization:Theory and Algorithms (Algorithms and Combinatorics).Springer, 2007.

[23] L. Kou, G. Markowsky, and L. Berman. A fast algorithmfor steiner trees. Acta Informatica, 15(2):141–145, 1981.

[24] T. Lappas, K. Liu, and E. Terzi. Finding a team of expertsin social networks. In KDD, 2009.

[25] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney.Statistical properties of community structure in large socialand information networks. In WWW, 2008.

[26] M. Newman. Fast algorithm for detecting communitystructure in networks. Physical Review E, 69, 2003.

[27] P. Sevon, L. Eronen, P. Hintsanen, K. Kulovesi, andH. Toivonen. Link discovery in graphs derived frombiological databases. In DILS, 2006.

[28] H. Tong and C. Faloutsos. Center-piece subgraphs:problem definition and fast solutions. In KDD, 2006.

[29] S. White and P. Smyth. A spectral clustering approach tofinding communities in graph. In SDM, 2005.

948

Kanellakis

Papadimitriou

Abiteboul

Buneman

VianuVardi

Hull

Delobel

Ioannidis

Hellerstein

Ross

Ullman

Bernstein

(a) Database theory

Fortnow

Babai

Nisan

Wigderson

Zuckerman

Safra

Saks

Papadimitriou

Karp

Itai

Lipton

Goldreich

(b) Complexity theory

Karp

Blum

Papadimitriou

Afrati

Johnson

Goldman

Piccolboni

Yannakakis

Crescenzi

TarjanUllman

Sagiv

(c) Algorithms I

Kleinberg

Raghavan

Rajagopalan

Tomkins

Hirsch

Dantsin

Kannan

Goerdt

Papadimitriou

Chakrabarti

Gibson

Kumar

Dom

Schoning

(d) Algorithms II

Figure 4: Different communities of Christos Papadimitriou. Rectangular nodes indicate the query nodes, and ellipticalnodes indicate nodes discover by our algorithm.

[10] C. Faloutsos, K. McCurley, and A. Tomkins. Fast discoveryof connection subgraphs. In KDD, 2004.

[11] U. Feige, G. Kortsarz, and D. Peleg. The dense k-subgraphproblem. Algorithmica, 29:2001, 1999.

[12] G. W. Flake, S. Lawrence, and C. L. Giles. Efficientidentification of web communities. In KDD, 2000.

[13] G. W. Flake, S. Lawrence, C. L. Giles, and F. M. Coetzee.Self-organization and identification of web communities.Computer, 35(3):66–71, 2002.

[14] S. Fortunato and M. Barthelemy. Resolution limit incommunity detection. PNAS, 104(1), 2007.

[15] D. Gibson, R. Kumar, and A. Tomkins. Discovering largedense subgraphs in massive graphs. In VLDB, 2005.

[16] M. Girvan and M. E. J. Newman. Community structure insocial and biological networks. Proceedings of the NationalAcademy of Sciences of the USA, 99(12):7821–7826, 2002.

[17] J. Hastad. Clique is hard to approximate within n1−ε.Electronic Colloquium on Computational Complexity(ECCC), 4(38), 1997.

[18] G. Karypis and V. Kumar. A fast and high qualitymultilevel scheme for partitioning irregular graphs. JSC,20(1), 1998.

[19] G. Kasneci, S. Elbassuoni, and G. Weikum. Ming: mininginformative entity relationship subgraphs. In CIKM, 2009.

[20] S. Khuller and B. Saha. On finding dense subgraphs. InICALP, 2009.

[21] Y. Koren, S. C. North, and C. Volinsky. Measuring andextracting proximity graphs in networks. TKDD, 1(3),2007.

[22] B. Korte and J. Vygen. Combinatorial Optimization:Theory and Algorithms (Algorithms and Combinatorics).Springer, 2007.

[23] L. Kou, G. Markowsky, and L. Berman. A fast algorithmfor steiner trees. Acta Informatica, 15(2):141–145, 1981.

[24] T. Lappas, K. Liu, and E. Terzi. Finding a team of expertsin social networks. In KDD, 2009.

[25] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney.Statistical properties of community structure in large socialand information networks. In WWW, 2008.

[26] M. Newman. Fast algorithm for detecting communitystructure in networks. Physical Review E, 69, 2003.

[27] P. Sevon, L. Eronen, P. Hintsanen, K. Kulovesi, andH. Toivonen. Link discovery in graphs derived frombiological databases. In DILS, 2006.

[28] H. Tong and C. Faloutsos. Center-piece subgraphs:problem definition and fast solutions. In KDD, 2006.

[29] S. White and P. Smyth. A spectral clustering approach tofinding communities in graph. In SDM, 2005.

948

§ Efficient and effective community search

Tag recommendationThe graph encodes associations between tags andpictures.a) Given a new photo and a set of initial tags,

provide the user with additional tags.b) Given some tags/keywords, recommend related

pictures.

MarketingThe graph encodes relationships between users anditems purchased/clicked.Given an item (i), to which users should it berecommended?

Some applications:

Discovering circles in social networksThe graph encodes friendship relationships betweenusers. Given a user and a set of his friends, we canprovide a compact visualization of their social circle.

items&

social&network&(e.g.,&Flickr,&Tumblr,&Twi7er,&Facebook)&

tags/features*

photos

*

§ It is a query-dependent variant of the communitydetection problem!!

E�cient and E↵ective Community Search 19

Email

2 4 8 16 32|Q|

Time(ms)

100

200

500

1000

2000

GrCon GS K−GS

2 4 8 16 32|Q|

Size

500

1000

2000

5000

10000

GrCon GS K−GS

2 4 8 16 32|Q|

density

0.002

0.005

0.020

0.050

0.200

GrCon GS K−GS

Web

-ND

2 4 8 16 32|Q|

Time(ms)

500

1000

2000

5000

20000

GrCon GS K−GS

2 4 8 16 32|Q|

Size

5e+03

1e+04

2e+04

5e+04

1e+05

GrCon GS K−GS

2 4 8 16 32|Q|

density

5e−04

2e−03

1e−02

5e−02

2e−01

GrCon GS K−GS

Web

-G

2 4 8 16 32|Q|

Time(ms)

2e+03

5e+03

2e+04

5e+04

GrCon GS K−GS

2 4 8 16 32|Q|

Size

5e+03

2e+04

5e+04

2e+05

5e+05 GrCon GS K−GS

2 4 8 16 32|Q|

density

0.005

0.010

0.020

0.050

0.100

GrCon GS K−GS

You

tube

2 4 8 16 32|Q|

Time(ms)

5e+03

2e+04

1e+05

5e+05

GrCon GS K−GS

2 4 8 16 32|Q|

Size

1e+04

2e+04

5e+04

1e+05

2e+05

GrCon GS K−GS

2 4 8 16 32|Q|

density

2e−04

1e−03

5e−03

2e−02

GrCon GS K−GS

Flickr

2 4 8 16 32|Q|

Time(ms)

2e+04

5e+04

2e+05

5e+05

GrCon GS K−GS

2 4 8 16 32|Q|

Size

5e+04

1e+05

2e+05

5e+05

GrCon GS K−GS

2 4 8 16 32|Q|

density

1e−04

5e−04

2e−03

1e−02

5e−02

GrCon GS K−GS

Fig. 1 Multiple-vertex query evaluation with varying number of query vertices |Q|: proposed

GrCon method vs. GS and K-GS state-of-the-art methods.

The results of this evaluation are shown in Table 3, which reports on queryvertex sets of variable size, from 2 to 32. For each graph, the (100) sets of queryvertices used are sampled uniformly at random from various cores of the graphs,particularly from cores of order 1 (i.e., the whole graph), 2, 4, 8, 16, 32. The tablereports results averaged over all those query sets. Since the K-GS method doesnot have any guarantee in terms of optimal minimum degree of the result, we also

[1] M.Sozio and A. Gionis: The community search problem and how to plan a successful cocktail party. In KDD 2010.

Page 3: effective community search DAMI2015

Problem definition

3

minimum degree in the subgraph. More formally:

Problem 1 (Community Search). Given a graphG = (V,E) and a set of query vertices Q ✓ V , find

H⇤ = argmax

Q✓H✓V,

G[H]is connected

µ(H).

2.2 State of the art

Sozio and Gionis show that Problem 1 is solvable in poly-nomial time, precisely in O(n+m) time, by a simple greedyalgorithm, which we refer to as Global Search (GS) [23]. Thisalgorithm iteratively removes a vertex having the minimumdegree along with all its incident edges, and it stops whenthe next vertex to be removed corresponds to a query ver-tex. This process generates a set of subgraphs. Among thesesubgraphs, the one having the maximum minimum degree,and such that all query vertices are connected is returned.

Cui et al. [9] propose a local-search algorithm that aimsat improving the e�ciency of the GS algorithm. This algo-rithm, which we refer to as Local Search (LS), iteratively ex-pands the neighborhood of the (unique) query vertex, until asubgraph that is guaranteed to contain an optimal solutionhas been built. Then, this subgraph is used as a reducedversion of the input graph from where an optimal solutionis extracted. The worst-case time complexity of the LS al-gorithm is still linear in the size of the whole input graph,but LS has been shown to achieve better e�ciency than GS

in practice [9]. However, a severe limitation of LS is that itworks only when a single query vertex is provided as input.As we show in our experiments (e.g., in Table 1), both GS

and LS tend to return very large communities. Motivatedby this observation, in this work we aim at returning smallerand denser, but still optimal, communities.Sozio and Gionis [23] also study the constrained version of

csp, where an upper bound on the size of the output commu-nity is given as input. They show that the problem is NP-hard and propose heuristics to solve it. Nevertheless, theobjective-function value (i.e., the minimum degree) of a so-lution to size-bounded csp may be arbitrarily far away fromthe value of a solution to unconstrained csp. Moreover, theheuristics proposed in [23] are based on the maximum dis-tance between a query vertex and every other vertex in thegraph, they thus require a number of shortest-path-distancecomputations, which heavily a↵ect e�ciency. For these rea-sons, in this paper we do not assess these heuristics, but weonly focus on exact methods for unconstrained csp (as ourmethods are).Finally, Cui et al. [9] also define the problem of finding,

among all optimal solutions to csp, the one of smaller size.They prove that the problem is NP-hard (for the case ofa single query vertex), but they do not study it further: inparticular, no algorithm for this problem is proposed.

2.3 Core decomposition

We next recall the concept of core decomposition [22]which is auxiliary to the proposed methods.The k-core (or core of order k) of a graph G = (V,E) is

defined as the maximal subgraph in which every vertex isconnected to at least k other vertices within that subgraph.In the following we slightly abuse of notation and identify ak-core with its vertex set, which we denote by Ck. It is easyto see that the order of a core corresponds to the minimumdegree of a vertex in that core, i.e., k = µ(Ck).

The set C = {Ck}k⇤

k=1 forms the core decomposition ofG. The core number (or core index) of a vertex u 2 V ,denoted c(u), is the highest order of a core that contains u:c(u) = max{k 2 [0..k⇤] | u 2 Ck}. All cores are nested intoeach other: V = C1 ◆ C2 ◆ · · · ◆ Ck⇤ , and the “di↵erence”between two consecutive cores, i.e., the set Sk = Ck+1 \Ck,is usually referred to as k-shell. The set of all k-shells there-fore forms a partition of the vertex set V . Note that a k-core(or a k-shell) does not necessarily induce a connected sub-graph, and that the k-cores are not necessarily all distinct,i.e., it may happen that, for some k, Ck = Ck+1 (and thecorresponding k-shell Sk = ;).Batagelj and Zaversnik [4] show how to e�ciently compute

the core decomposition of a graph G. The algorithm itera-tively removes the smallest-degree vertex and sets the corenumber of the removed vertex accordingly. The algorithmruns in O(n+m) time.

3. PRECOMPUTATION

Here we present our proposal to solve csp e↵ectively ande�ciently. ur approach is composed of two two phases. Inthe preprocessing phase (which we discuss in the remainderof this section), the input graph is processed o✏ine (unatantum), in order to compute and store information thatcan profitably be reused when a query is issued. The queryprocessing phase (which we present in Section 4), is in turndivided in two subphases: a retrieval phase (Section 4.1),where the proper information computed/stored during thepreprocessing is retrieved, and an online processing phase(Section 4.2), where the information retrieved is further pro-cessed in order to obtain the ultimate answer to the query.The preprocessing phase of our approach is based on an in-

teresting relationship between core decomposition and csp.This relationship consists of two main results that we for-mally state in the next theorem:2 the highest-order (i.e.,smallest-sized) core containing all query vertices and suchthat all query vertices are connected (i) is a solution to csp,and (ii) contains all the solutions to csp.

Theorem 1. Given a graph G = (V,E), its core decom-position C = {C1, . . . , Ck⇤}, and a set of query verticesQ ✓ V , let C⇤

Q be the highest-order (i.e., smallest-sized) corein C such that every q 2 Q belongs to the same connectedcomponent of C⇤

Q. It holds that:

1. The connected component of C⇤

Q that contains Q is asolution to csp;

2. The connected component of C⇤

Q that contains Q con-tains all possible solutions to csp.

Proof. We prove both statements by contradiction. Asfar as the first statement, letX denote the connected compo-nent of C⇤

Q that contains all query vertices and let k denotethe minimum degree of a vertex in X. Assume that X isnot a solution to csp, i.e., assume that there exists anothersubgraph Y 6= X of G that is connected, contains all queryvertices, and has minimum degree > k. By the definition ofk-core, this would imply that Y is (part of) a core of orderhigher than k. This contradicts the hypothesis that C

Q isthe highest-order core in C such that every q 2 Q belongsto the same connected component of C⇤

Q.2Similar results are shown in [9], but they hold only for a single query

vertex.

We aim at finding a connected subgraph that contains all the query vertices Q and that maximizes a quality measure of density μ(.):

As measure of density we adopt the min degree of nodes in H,

1.1 Our approach

Our approach is based on exploiting precomputed infor-mation, which at query time is retrieved and further pro-cessed to produce the final solution. The precomputationrelies on the graph-theory notion of core decomposition of agraph [22]. The k-core of a graph is defined as a maximalsubgraph in which every vertex is connected to at least k

other vertices within that subgraph. The set of all k-coresof a graph G forms the core decomposition of G.

Given a graph G = (V,E) and a set of query verticesQ ✓ V , let k be the largest integer such that a connectedcomponent of the k-core of G contains Q. It holds that allthe optimal solutions to the community-search problem forthe given query Q, are contained in such k-core. Thanksto this property, our approach starts retrieving from theprecomputed information (i.e., the core decomposition) asubgraph H⇤ that is itself an optimal answer to the queryissued. However, as H

⇤ is also guaranteed to contain allother optimal answers, it may easily be not so good in termsof size and cohesiveness. For this reason, we further pro-cess H⇤ in order to obtain a smaller-sized yet more cohesivesubgraph that however still meets the community-search re-quirements, i.e., it satisfies the constraints of containing allquery vertices and being connected, and preserves the opti-mality in terms of minimum degree. Therefore, we formulatethe problem of finding the smallest subgraph H

min of H⇤

that satisfies the community-search requirements. We showthat the problem is NP-hard and we devise two bottom-upheuristics that trade-o↵ between e↵ectiveness and e�ciency.

1.2 Contributions

Our contributions are summarized as follows.

• We highlight, for the first time, a severe limitation inthe state of the art for the community-search problem:the existing methods produce solutions of extremelylarge size, so large as to be practically useless in mostcontexts.

• To overcome this drawback, we devise an approachthat exploits the core decomposition of the graph asprecomputed information.

• We devise two di↵erent ways of organizing/storing thecore decomposition: the first one requires more space,but it guarantees faster retrieval of the informationneeded at query time.

• For the online query evaluation, we devise two meth-ods. The first one is a greedy method that iterativelyselects the vertex maximizing a score based on con-nection and minimum degree. The second one is atwo-step method that first aims at ensuring connec-tion and afterwards selects vertices based on how wellthey contribute to the maximization of the minimumdegree. The first method is faster, while the second oneproduces higher-quality (i.e., more cohesive) solutions.

• Our extensive evaluation on a large variety of real-world graphs confirms that our methods producehigher-quality solutions: the extracted subgraphs aremuch smaller in size and denser. Moreover, our meth-ods are recognized to be more e�cient.

Table 1: Comparison between our method (Greedy)and the two state-of-the-art methods, i.e., Global

Search (GS) [23] and Local Search (LS) [9] for the

case of one-vertex queries. For each dataset we sam-

ple 100 random query vertices, and we report the

median of the following measures: (i) size of the so-

lution |H|, (ii) density � = |E[H]|/�|H|

2

�, where |E[H]|

denotes the number of edges induced by H, and (iii)runtime t in seconds.

Jazz

Celeg

. N.

Notre

Dame

Google

Youtub

e

AS-Skit

ter

|H|

118 140 50 572 192 968 101 772 438 984 GS

92 119 30 264 62 047 78 228 383 218 LS

30 65 20 13 21 626 885 Ours

�0.281 0.122 0.0004 0.0001 0.0002 0.00007 GS

0.372 0.144 0.0008 0.0004 0.0004 0.00009 LS

0.8 0.228 0.6594 0.9190 0.0011 0.026 Ours

t

2 1 32 220 133 593 669 384 333 516 GS

6 5 3 273 6 241 31 540 114 565 LS

2 2 7 2 4 820 611 Ours

1.3 A preview of the results

Table 1 reports a comparison of our method to the meth-ods by Sozio and Gionis [23] and by Cui et al. [9].In this preview of the results, we only consider one-vertex

query (|Q| = 1) in order to be able to compare with theLS method [9], which only can deal with the type of query.Moreover, in order to avoid outliers that boost the variancein the experiments, we sample 100 random query verticesfrom the fourth core: that is to say that peripheral queryvertices belonging to the first three cores cannot be selectedas queries. A more exhaustive assessment with varying sizeof Q, and varying cores for the query vertices, is providedlater in Section 5, together with the characteristics of thegraphs used. In the case |Q| = 1 our two proposed methodscoincide (as it will become clear later): therefore we reportonly one, which we refer to as Ours.The results in Table 1 clearly shows that our method out-

performs the state-of-the-art methods in terms of cohesive-ness of the solution extracted: smaller size and higher den-sity. It is worth recalling that all three methods output op-timal solutions, that is to say that they all return subgraphswith the same minimum degree. The results in Table 1 alsoconfirm that our method is much faster.

2. PRELIMINARIES

In this section we first provide the formal statement of theproblem studied, then we discuss the existing methods, towhich we compare in our experimental evaluation.

2.1 Problem definition

LetG = (V,E) be an undirected graph, where V (|V | = n)is the set of vertices and E ✓ V ⇥ V (|E| = m) is theset of edges. For a subset of vertices H ✓ V , we denoteby G[H] the subgraph of G induced by H, i.e., G[H] =(H,E[H]), where E[H] = {(u, v) 2 E | u 2 H, v 2 H}.Finally, we denote by �H(u) the degree of u in G[H], andµ(H) the minimum degree of a vertex in G[H], i.e., µ(H) =minu2H �H(u).In this paper we study the Community Search problem

(csp), that is finding a connected subgraph containing thequery vertices and maximizing a quality measure of density,which in this specific version of the problem [23, 9] is the

1.1 Our approach

Our approach is based on exploiting precomputed infor-mation, which at query time is retrieved and further pro-cessed to produce the final solution. The precomputationrelies on the graph-theory notion of core decomposition of agraph [22]. The k-core of a graph is defined as a maximalsubgraph in which every vertex is connected to at least k

other vertices within that subgraph. The set of all k-coresof a graph G forms the core decomposition of G.

Given a graph G = (V,E) and a set of query verticesQ ✓ V , let k be the largest integer such that a connectedcomponent of the k-core of G contains Q. It holds that allthe optimal solutions to the community-search problem forthe given query Q, are contained in such k-core. Thanksto this property, our approach starts retrieving from theprecomputed information (i.e., the core decomposition) asubgraph H⇤ that is itself an optimal answer to the queryissued. However, as H

⇤ is also guaranteed to contain allother optimal answers, it may easily be not so good in termsof size and cohesiveness. For this reason, we further pro-cess H⇤ in order to obtain a smaller-sized yet more cohesivesubgraph that however still meets the community-search re-quirements, i.e., it satisfies the constraints of containing allquery vertices and being connected, and preserves the opti-mality in terms of minimum degree. Therefore, we formulatethe problem of finding the smallest subgraph H

min of H⇤

that satisfies the community-search requirements. We showthat the problem is NP-hard and we devise two bottom-upheuristics that trade-o↵ between e↵ectiveness and e�ciency.

1.2 Contributions

Our contributions are summarized as follows.

• We highlight, for the first time, a severe limitation inthe state of the art for the community-search problem:the existing methods produce solutions of extremelylarge size, so large as to be practically useless in mostcontexts.

• To overcome this drawback, we devise an approachthat exploits the core decomposition of the graph asprecomputed information.

• We devise two di↵erent ways of organizing/storing thecore decomposition: the first one requires more space,but it guarantees faster retrieval of the informationneeded at query time.

• For the online query evaluation, we devise two meth-ods. The first one is a greedy method that iterativelyselects the vertex maximizing a score based on con-nection and minimum degree. The second one is atwo-step method that first aims at ensuring connec-tion and afterwards selects vertices based on how wellthey contribute to the maximization of the minimumdegree. The first method is faster, while the second oneproduces higher-quality (i.e., more cohesive) solutions.

• Our extensive evaluation on a large variety of real-world graphs confirms that our methods producehigher-quality solutions: the extracted subgraphs aremuch smaller in size and denser. Moreover, our meth-ods are recognized to be more e�cient.

Table 1: Comparison between our method (Greedy)and the two state-of-the-art methods, i.e., Global

Search (GS) [23] and Local Search (LS) [9] for the

case of one-vertex queries. For each dataset we sam-

ple 100 random query vertices, and we report the

median of the following measures: (i) size of the so-

lution |H|, (ii) density � = |E[H]|/�|H|

2

�, where |E[H]|

denotes the number of edges induced by H, and (iii)runtime t in seconds.

Jazz

Celeg

. N.

Notre

Dame

Google

Youtub

e

AS-Skit

ter

|H|

118 140 50 572 192 968 101 772 438 984 GS

92 119 30 264 62 047 78 228 383 218 LS

30 65 20 13 21 626 885 Ours

�0.281 0.122 0.0004 0.0001 0.0002 0.00007 GS

0.372 0.144 0.0008 0.0004 0.0004 0.00009 LS

0.8 0.228 0.6594 0.9190 0.0011 0.026 Ours

t

2 1 32 220 133 593 669 384 333 516 GS

6 5 3 273 6 241 31 540 114 565 LS

2 2 7 2 4 820 611 Ours

1.3 A preview of the results

Table 1 reports a comparison of our method to the meth-ods by Sozio and Gionis [23] and by Cui et al. [9].In this preview of the results, we only consider one-vertex

query (|Q| = 1) in order to be able to compare with theLS method [9], which only can deal with the type of query.Moreover, in order to avoid outliers that boost the variancein the experiments, we sample 100 random query verticesfrom the fourth core: that is to say that peripheral queryvertices belonging to the first three cores cannot be selectedas queries. A more exhaustive assessment with varying sizeof Q, and varying cores for the query vertices, is providedlater in Section 5, together with the characteristics of thegraphs used. In the case |Q| = 1 our two proposed methodscoincide (as it will become clear later): therefore we reportonly one, which we refer to as Ours.The results in Table 1 clearly shows that our method out-

performs the state-of-the-art methods in terms of cohesive-ness of the solution extracted: smaller size and higher den-sity. It is worth recalling that all three methods output op-timal solutions, that is to say that they all return subgraphswith the same minimum degree. The results in Table 1 alsoconfirm that our method is much faster.

2. PRELIMINARIES

In this section we first provide the formal statement of theproblem studied, then we discuss the existing methods, towhich we compare in our experimental evaluation.

2.1 Problem definition

LetG = (V,E) be an undirected graph, where V (|V | = n)is the set of vertices and E ✓ V ⇥ V (|E| = m) is theset of edges. For a subset of vertices H ✓ V , we denoteby G[H] the subgraph of G induced by H, i.e., G[H] =(H,E[H]), where E[H] = {(u, v) 2 E | u 2 H, v 2 H}.Finally, we denote by �H(u) the degree of u in G[H], andµ(H) the minimum degree of a vertex in G[H], i.e., µ(H) =minu2H �H(u).In this paper we study the Community Search problem

(csp), that is finding a connected subgraph containing thequery vertices and maximizing a quality measure of density,which in this specific version of the problem [23, 9] is the

This choice allows us to exploit an optimal greedy algorithm, Global Search(GS) [1]: • Init: H0= V• Hi = iteratively remove from Hcurr the vertex with the minimum degree• Stopping criterion: when the next candidate to be removed is a query vertex.• Output: The process generates a series of sub-graphs. Among these, select the one

with max-min-degree and where query vertices are connected.

Intuition: by maximizing the degree of the least connected node we increase the overall density of the sub-graph!!

[1] M.Sozio and A. Gionis: The community search problem and how to plan a successful cocktail party. In KDD 2010.

Page 4: effective community search DAMI2015

Size-bounded Community Search

4

The size-bounded version of community search includes an upper bound Kon the size of the output community.

This version of CSP becomes NP-Hard.

K-Global Search (K-GS) [1] is an heuristic approach for addressing CSP with a softconstraint on the desired size.

K-GS calls GS as a sub-routine, enforcing each time a tighter distance bound until the size constraint is satisfied (or the query nodes become disconnected).

Design principle:• Consider a new constraint on the max distance between a query vertex and each

other vertex in the solution.• A tighter distance constraint implies smaller communities.

Drawbacks:• Shortest-path distance computation• K-GS does not guarantee optimality with respect to max-min degree• The upper-bound on the size is a soft constraint.

[1] M.Sozio and A. Gionis: The community search problem and how to plan a successful cocktail party. In KDD 2010.

Page 5: effective community search DAMI2015

Local community search

5

Local search algorithm(LS) [2] to improve the efficiency of GS.

[2] W.Cui, Y.Xiao, H.Wang, and W.Wang: Local search of communities in large graphs. In SIGMOD 2014.

§ Iteratively expands the neighborhood of the (unique) query vertex, to retrieve a sub-graph that is guaranteed to contain an optimal solution.

§ This sub-graph is used as a reduced version of the input graph to retrieve the optimal solution.

LS has been shown to achieve better efficiency than GS in practice.

Main limitation: it works only when a single query vertex is provided as input.

Page 6: effective community search DAMI2015

Our contribution

6

Efficiency

Quality

We show how it is possible to pre-compute and organize information about the structure of the graph

to address CSP queries in an efficient way.

We explicitly seek compact solutions to CSP queries. Hence, we study and address the problem of finding the

smallest sub-graph that is solution to CSP.

Page 7: effective community search DAMI2015

Addressing efficiency via k-core decomposition:

7

§ The k-core of a graph is defined as the maximal sub-graph in which every vertex is connected to at least k other vertices.

§ All cores are nested into each other: § The k-shell is defined as§ The core index c(u) of a vertex u is the highest order core that contains u.§ The core decomposition can be computed in linear time [3]:

› Remove iteratively the smallest-degree vertex and sets its core number accordingly.

V ⊇C1 ⊇C2 ⊇ ...⊇Ck*

Sk =Ck \Ck+1

Efficient and Effective Community Search 5

2.3 Core decomposition

The k-core (or core of order k) of a graph G = (V,E) is defined as the maximal subgraphin which every vertex is connected to at least k other vertices within that subgraph. In thefollowing we slightly abuse the notation and identify a k-core with its vertex set, which wedenote by Ck. It is easy to see that the order of a core corresponds to the minimum degreeof a vertex in that core, i.e., k = µ(Ck).

The set C = {Ck}k⇤

k=1 forms the core decomposition of G [16]. The core number (orcore index) of a vertex u 2 V , denoted c(u), is the highest order of a core that contains u:c(u) = max{k 2 [0..k⇤] | u 2 Ck}. All cores are nested into each other: V = C1 ◆ C2 ◆· · · ◆ Ck⇤ , and the “difference” between two consecutive cores, i.e., the set Sk = Ck\Ck+1,is usually referred to as k-shell. The set of all k-shells therefore forms a partition of the vertexset V . Note that a k-core (or a k-shell) does not necessarily induce a connected subgraph,and that the k-cores are not necessarily all distinct, i.e., it may happen that, for some k,Ck = Ck+1 (and the corresponding k-shell Sk = ;).

Batagelj and Zaversnik [1] show how to compute the core decomposition of a graph G

in linear time. The algorithm iteratively removes the smallest-degree vertex and sets the corenumber of the removed vertex accordingly.

3 Precomputation

Here we present our proposal to solve CSP effectively and efficiently. Our approach is com-posed of two phases. In the preprocessing phase (which we discuss in the remainder of thissection), the input graph is processed offline (una tantum), in order to compute and store in-formation that can profitably be reused when a query is issued. The query processing phase(which we present in Section 4), is in turn divided into two subphases: a retrieval phase(Section 4.1), where the proper information computed/stored during the preprocessing isretrieved, and an online processing phase (Section 4.2), where the information retrieved isfurther processed in order to obtain the ultimate answer to the query.

The preprocessing phase of our approach is based on an interesting relationship betweencore decomposition and CSP. This relationship consists of two main results: the highest-order (i.e., smallest-sized) core containing all query vertices and such that all query verticesare connected (i) is a solution to CSP, and (ii) contains all the solutions to CSP. Similarresults are also shown in [5], but they hold only for the special case where queries arecomposed of only one vertex. In the following theorem we provide evidence of these resultsfor the general case where one can have multiple query vertices.

Theorem 1 Given a graph G = (V,E), its core decomposition C = {C1, . . . , Ck⇤}, and

query vertices Q ✓ V , let C⇤Q

be the highest-order (i.e., smallest-sized) core in C such that

every q 2 Q belongs to the same connected component of C⇤Q

. It holds that:

1. The connected component of C⇤Q

that contains Q is a solution to CSP;

2. The connected component of C⇤Q

containing Q contains all solutions to CSP.

Proof We prove both statements by contradiction. As far as the first statement, let X denotethe connected component of C⇤

Qthat contains all query vertices and let k denote the mini-

mum degree of a vertex in X . Assume that X is not a solution to CSP, i.e., assume that thereexists another subgraph Y 6= X of G that is connected, contains all query vertices, and hasminimum degree > k. By the definition of k-core, this would imply that Y is (part of) a core

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

[3] V. Batagelj and M. Zaversnik. Fast algorithms for determining (generalized) core groups in social networks. Advances in Data Analysis and Classification.

Page 8: effective community search DAMI2015

Retrieving solutions to CSP queries

8

How to organize the k-core decomposition to efficiently retrieve solutions to CSP queries?

CoreStruct • Store k-cores and connected components within a k-core.• Replication of information.

ShellStruct• Store k-shells in a hierarchical way.

• Each node at the level (k) represents a connected component within the k-shell.

• Links between nodes in different level represent connected component that are merged.

Building-time Building: Space Retrieval

CoreStruct O(h*n+m) O(h*n) O(|Q|*log h)ShellStruct O(h*n+m) O(h*M) O(|Q|*h+n)

h #distinct k-cores

M Max #connected components in a core

Page 9: effective community search DAMI2015

The Minimum Community Search problem

9

§ Our goal is to further refine the solution H* provided by the k-core structure to extract a sub-graph that is still optimal and as small as possible.

Efficient and Effective Community Search 9

Algorithm 2 RetrievalShellStructInput: Information from ShellStruct; a set of query vertices Q.Output: A set of vertices H⇤ ◆ Q.1: k max{c(u) | u 2 Q}2: H {connected components of core Ck containing a vertex from Q}3: H

⇤ S

H2Hvertices(H)

4: while |H| 6= 1 ^ Q 6= ; do5: Q

� {u 2 Q | u 2 Sk}6: H0 {parent(H) | H 2 H} [ {connected components of core i containing a vertex from Q

�}7: H

⇤ H⇤ [

SH2H0 vertices(H)

8: Q Q \ Q�, H H0, k k � 1

of the procedure are reported as Algorithm 2. In this case, we need to visit all the levels ofthe tree, starting from level cmax(Q) = max{c(u) | u 2 Q}, until a connected componentcontaining all query vertices has been encountered. Since binary search is no longer pos-sible and the content of H⇤ needs to be reconstructed, the time complexity of the retrievalphase from ShellStruct is O(|Q|h+|H⇤|), where |H⇤| is O(n). The time complexity can belowered up to O(|Q|+ |H⇤|) by using a suitable lowest-common-ancestor (LCA) data struc-ture to store the connected-component tree (e.g., by preprocessing the connected-componenttree with the well-known Tarjan’s offline lowest common ancestors algorithm which allowsconstant-time online retrieval [8]). However, as |H⇤| is typically the dominant term in theoverall time complexity, the efficiency gain would not be really significant in practice.

As anticipated above, CoreStruct compensates its larger storage space with a fasterretrieval phase, while ShellStruct offers less space at a price of slower retrieval.

4.2 Online processing

4.2.1 The Minimum Community Search problem

The retrieval phase described above finds the set H⇤ containing all the solutions to CSP fora given query Q. The goal of the online processing phase is to further refine H

⇤ to extract asolution that is as small as possible. This corresponds to solving the following problem:

Problem 2 (MINIMUM COMMUNITY SEARCH (MIN-CSP)) Given a graph G = (V,E),a set of query vertices Q ✓ V , let H⇤ ✓ V be the subgraph of G containing all the solutionsto CSP for Q. Find

H⇤min = argmin

Q✓H✓H⇤:G[H]is connected,µ(H)�µ(H⇤)

|H|. ut

The above problem has been mentioned by Cui et al. in their work [5], but they donot study it further (no algorithm is proposed). They just provide a proof of NP-hardnessby a reduction from MAX CLIQUE. Their proof however holds only for the special case|Q| = 1, and, as such, it does not exclude that the problem can admit polynomial-timesolutions for the more general case |Q| > 1. In the following we show that this is not thecase: the problem remains NP-hard even for the case |Q| > 1. We formally state this resultin the next Theorem 2, by resorting to the well-known STEINER TREE problem: given agraph G = (V, E) and a set of terminal vertices T ✓ V , find a connected subgraph G0 ofG containing all terminal vertices and having minimum number of edges. This problem isNP-hard and the fastest existing algorithm with provable guarantees corresponds to the

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

§ MIN-CSP is NP-hard

Page 10: effective community search DAMI2015

Heuristic approaches to Min-SCP

10

Greedy

Connection

• Start from query vertices and add nodes by applying greedy selection on the score:

• p’(u) is the number of distinct connected components that will be merged by adding (u) to the current solution.

• p’’(u) is focuses on satisfying the optimality of min-degree:• #neighbors of (u) in H*min that will satisfy the optimal degree.• How far is (u) from satisfying the optimal degree?

12 Nicola Barbieri et al.

Thus, the minimum-degree score p00(u) can be defined as:

p00(u) = p

00+(u)� p

00�(u), (2)

where p00+(u) (gain effect) is the number of neighbors of u in G[H⇤

min] having degree stillless than µ

⇤:

p00+(u)= |{v2neigh(u,H⇤

min) | |neigh(v,H⇤min)|<µ

⇤}|, (3)

and p00�(u) (penalty effect) is the number of u’s neighbors required to be added to H

⇤min for

having its degree at least µ⇤:p00�(u) = max{0, µ⇤ � |neigh(u,H⇤

min)|}. (4)

The ultimate score that is exploited for greedy selection is defined asp(u) =

⌦p0(u), p00(u)

↵(5)

and the queue for selecting the next node follows an inverse lexicographical order (u pre-cedes w if p0(u) > p

0(w) or if p0(u) = p0(w) and p

00(u) > p00(w)). This way, we favor the

connectivity constraint over the minimum-degree score.Algorithm. The pseudocode of Greedy is reported as Algorithm 3. The algorithm takes asinput a graph G, a set of query vertices Q, and the set H⇤ containing all the solutions toCSP for Q, and returns a smallest-sized, yet optimal, solution H

⇤min ✓ H

⇤. The algorithmkeeps (Line 1): the current solution H

⇤min, the set A of connected components of G[H⇤

min],the minimum degree µ

⇤ that the ultimate solution is required to have, and a priority queueP that stores the vertices u 2 H

⇤ \ H⇤min with priority equal to the score p(u) reported in

Equation (5). At the beginning, all query vertices are added to the queue with priority +1(Line 2), so as to implicitly initialize H

⇤min as equal to the query vertex set Q.

The main cycle of the algorithm (Lines 3-38) runs until H⇤min has become a valid solu-

tion to CSP for the query vertices Q, i.e., until there is a single connected component in Aand the minimum degree of G[H⇤

min] is equal to µ⇤. At each step of the cycle, a vertex u is

extracted from the queue and added to H⇤min (Lines 4-6). Then, various operations need to

be performed on the neighbors of u in H⇤.

First of all (Lines 7-10), all u’s neighbors v that are not in the queue are added to it(by firstly computing their priority p(v)). Afterwards (Lines 11-23), the connected com-ponents in A are updated: the connected components of each u’s neighbor are collectedin an auxiliary set A0 (Line 20), and they are eventually merged in a unique connectedcomponent (Lines 22-23). At the same time, the degree ↵(v) of every u’s neighbor v inH

⇤min is increased by one (Line 13), and, if ↵(v) becomes equal to the desired µ

⇤ value, theminimum-degree score p

00(w) of every v’s neighbor w in the queue is updated accordingly(Lines 13-19).

Given that the connected components in A (may) have changed, we need to recomputealso the connection score p

0 of every neighbor of a u’s neighbor that belongs to a connectedcomponent among the ones merged, i.e., among the ones in A0 (Lines 24-37). In particular,for every of such vertices w its connection score p

0(w) needs to be (i) decreased by thenumber of connected components in A0 containing a w’s neighbor (Lines 24-28), and (ii)increased by one because of the new connected component A⇤ created (Line 30). Addition-ally, for every of such vertices w that have u as a neighbor, its minimum-degree score p

00(w)is also updated (Lines 31-35).Running time. We analyze now the time complexity of Greedy. Let eH denote the set ofvertices given by the union of the vertices in the output solution H

⇤min and the neighbors

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Given a CSP query Q, retrieve the solution H* from the k-core structure.Start with H*min= Q and add vertices until H*min is optimal.

• Build a Steiner tree on Q by exploring H*;• Iteratively add more nodes according to score p’’(u) until all vertices

satisfy the condition on the degree.

GreedyConnection:• Run Greedy and Connection sequentially.

Page 11: effective community search DAMI2015

Experimental evaluation

11

§ We compare GS, K-GS, LS and GrCon on real-world datasets.

§ Evaluation focuses on 3 dimensions: time, size, density.

§ Sample 100 sets of query vertices at random, avg the results. Efficient and Effective Community Search 15

Table 2 Characteristics of the selected datasets: # vertices (n), # edges (m), # distinct cores (h), graph type.n m h type

Email 33, 696 180, 811 43 Communication networkWeb-NotreDame 325, 729 1, 090, 108 41 Web graph

Web-Google 791, 822 3, 815, 994 40 Web graphYoutube 1, 134, 889 2, 987, 623 51 Social network

Flickr 1, 624, 992 15, 476, 835 567 Social network

GreedyConnection achieves the best tradeoff between efficiency and accuracy. Indeed,based on the above argument about small minimum degree exhibited by real-world commu-nities, running only the Greedy step would be very efficient but not that accurate (it pro-duces larger communities). On the other hand, running only the Connection step, withoutprofitably exploiting the input reduction provided by Greedy, would achieve high quality(smaller communities) but poor efficiency.

5 Experiments

In this section we empirically evaluate the performance of the proposed method Greedy-Connection (for short, GrCon) and compare it to the state-of-the-art methods describedin Section 2.2, i.e., GS and K-GS (for multiple-vertex queries), and LS (for one-vertexqueries). We consider LS only for the special case of a single query vertex since that methodcannot handle multiple-vertex queries.

We use real-world, publicly-available graphs whose main characteristics are reported inTable 2.3 For each graph we take its largest connected component.

We aim at assessing both efficiency and quality of the proposed method and its com-petitors. As far as quality, we evaluate communities H produced by each method in terms ofsize |H|, density |E[H]|/(|H|

2 ), and query-biased density [19]. Density is a well-establishedmeasure for assessing the quality of a community and it represents a desirable “side effect”that a community identified by optimizing any other criterion should in principle exhibit.Query-biased density is a quality measure that has been recently introduced to effectivelyassess the goodness of a community without suffering from the so-called free-rider effect

(vertices attached by weak links to a strong group) [19].The quality of our GreedyConnection is independent from the preprocessing strategy

used (CoreStruct or ShellStruct). Preprocessing only affects efficiency: unless otherwisespecified, the runtimes reported for our GreedyConnection always comprise the retrievaltime from the slowest one among the two preprocessing methods (i.e., ShellStruct).

For each set of experiments, we sample 100 (sets of) query vertices at random (from ei-ther the whole graph or different parts of the graph—more details on this later), and, for eachmeasure, we report the average over this set of queries, along with statistical significance ofthe difference among the various methods. For the latter we report p-values computed ac-cording to the well-established Wilcoxon signed rank test [6].

All methods are implemented in Java 1.7, and experiments are run on an Intel Xeon CPUat 2.2GHz and 96GB RAM machine, which we limit to 30GB. The source code is availableat http://bit.ly/1b6WbSQ.

3 Flickr is available at http://socialnetworks.mpi-sws.org/datasets, while all other graphs at https://snap.stanford.edu/data.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Page 12: effective community search DAMI2015

Preprocessing-phase & Retrieval

12

20 Nicola Barbieri et al.

Em

ail

100 500 1000 1500 2000 5000# queries

Tim

e(se

cs)

1020

5010

050

020

00

GrCon LS

Web

-ND

100 500 1000 1500 2000 5000# queries

Tim

e(se

cs)

100

200

500

1000

2000

5000

GrCon LS

Web

-G

100 500 1000 1500 2000 5000# queries

Tim

e(se

cs)

1000

2000

5000

1000

0

GrCon LS

Yout

ube

100 500 1000 1500 2000 5000# queries

Tim

e(se

cs)

2e+0

35e

+03

2e+0

45e

+04

GrCon LS

Flic

kr

100 500 1000 1500 2000 5000# queries

Tim

e(se

cs)

2e+0

45e

+04

2e+0

55e

+05

2e+0

6

GrCon LS

Fig. 4 One-vertex query-evaluation cumulative times (i.e., preprocessing time + query time) of proposedGrCon vs. LS state-of-the-art method varying the number of queries. Core index of the query vertex c(v)=8.Times are shown in logarithmic scale.

Table 5 Building time and storage space of the proposed preprocessing methods.building time (s) space (MB)

CoreStruct ShellStruct CoreStruct ShellStructEmail 1 3 7 4

Web-ND 12 574 41 28Web-G 43 3, 155 147 90

Youtube 201 5, 946 130 100Flickr 1, 765 15, 865 564 347

a simplified version of our GreedyConnection method which comprises only the first oneof the two steps described in Section 4.2 (i.e., only the Greedy step).

The results of this evaluation are reported in Figure 3 (query vertices sampled uniformlyat random from different cores of the graph) and Table 4 (summary on query vertices sam-pled from core 8 along with statistical-significance evaluation). GrCon consistently outper-forms LS in all cases, in terms of accuracy and efficiency. Indeed, GrCon is always at leastone/two orders of magnitude faster than LS, and in some cases the gap is even of three or-ders of magnitude. At the same time, the solutions produced by GrCon have much smallersize and larger density than LS: in most cases the size gap is one order of magnitude whilethe density exhibited by GrCon’s solutions is on average 73% larger than LS.

Table 4 shows that the superiority of our GrCon is always statistically significant, andit is confirmed in terms of the query-biased density measure too.

Finally, in Figure 4 we show cumulative times for various numbers of queries. Times ofour method here include the time needed for building the ShellStruct structure.

5.3 Preprocessing-phase evaluation

We now focus on the preprocessing methods CoreStruct and ShellStruct. Table 5 containsbuilding time and storage space, while Table 6 shows running times of the retrieval phases.These results confirm what theoretically stated in Section 3: ShellStruct offers less storagespace but requires more building time than CoreStruct. However, for both methods retrievaltime is always < 1 ms, thus being negligible if compared to the online-processing times.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Efficient and Effective Community Search 21

Table 6 Retrieval time (µs) of the proposed preprocessing methods with varying |Q|.|Q| = 1 |Q| = 2 |Q| = 4 |Q| = 8 |Q| = 16 |Q| = 32

Core Shell Core Shell Core Shell Core Shell Core Shell Core ShellStruct Struct Struct Struct Struct Struct Struct Struct Struct Struct Struct Struct

Email 14 17 18 48 26 115 41 284 69 361 121 153Web-ND 17 16 18 50 26 106 42 232 76 227 144 420

Web-G 36 19 20 69 30 169 49 242 115 311 197 350Youtube 13 17 19 39 27 81 47 198 77 335 137 338

Flickr 13 18 18 57 26 128 40 275 68 485 123 129

6 Related Work

Community search. Community search is the problem of finding a good community for aset of query vertices. A number of community-quality measures have been proposed, basedon notions such as random walk [18,11], ↵-adjacency-�-quasi-k-clique [4], k-truss [10],influence [14], or query-biased density [19]. Sozio and Gionis [17] are the first to providea combinatorial-optimization formulation to the community-search problem, and, in partic-ular, they ask for a connected subgraph that contains all query vertices and maximizes theminimum degree. This formulation of community search has been further studied in the lit-erature [5] and is the specific formulation we focus on in this work. All main related researchon this topic has already been discussed in more detail in Section 2.2.

Dense-subgraph discovery. Finding dense subgraphs is a long-standing problem in datamining [13]. This problem aims at finding a subgraph of a given input graph that maximizessome notion of density. A density notion widely employed in the literature is the averagedegree. Finding a subgraph that maximizes the average degree can be solved in polynomialtime [9], and approximated within a factor of 1

2 in linear time [3].Dense-subgraph discovery differs from community search as it does not permit to spec-

ify any query vertices to be contained in the output subgraph/community.

Community detection. Community detection and the related problem of graph partitioning

have been extensively studied in the literature. The goal of such problems is to partition thevertices of a graph into communities so as to maximize edges between vertices in the samecommunity and minimize edges between vertices in different communities. Representativeapproaches to community detection include spectral, minimum-cut, modularity-based meth-ods, and label propagation [20]—in fact, the literature on the topic is so extensive that wedo not attempt to make a proper review here; a comprehensive survey can be found in [7].

The goal of community detection is to find a global community structure for the wholeinput graph; it thus differs from community search, whose objective is instead to find acommunity for a set of input query vertices in an online fashion.

7 Conclusions

Given a graph and a set of query vertices, the community-search problem requires to find acohesive subgraph that contains the query vertices. A well-established criterion to measurethe quality of the community to be discovered is the minimum degree of the output subgraph.A number of methods have been proposed for this specific formulation of community search,but they all have shortcomings about efficiency, as they need to visit (large part of) the wholeinput graph, effectiveness, as they tend to find large and not really cohesive communities, orlack of generality, as they cannot handle multiple query vertices, do not find communitieswith optimal minimum degree, and/or require input parameters.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Page 13: effective community search DAMI2015

Evaluation: multiple-vertex queries

13

Efficient and Effective Community Search 17

Em

ail

2 4 8 16 32|Q|

Time(ms)

100

200

500

1000

2000

GrCon GS K−GS

2 4 8 16 32|Q|

Size

500

1000

2000

5000

10000

GrCon GS K−GS

2 4 8 16 32|Q|

density

0.002

0.005

0.020

0.050

0.200

GrCon GS K−GS

Web

-ND

2 4 8 16 32|Q|

Time(ms)

500

1000

2000

5000

20000

GrCon GS K−GS

2 4 8 16 32|Q|

Size

5e+03

1e+04

2e+04

5e+04

1e+05

GrCon GS K−GS

2 4 8 16 32|Q|

density

5e−04

2e−03

1e−02

5e−02

2e−01

GrCon GS K−GS

Web

-G2 4 8 16 32

|Q|Time(ms)

2e+03

5e+03

2e+04

5e+04

GrCon GS K−GS

2 4 8 16 32|Q|

Size

5e+03

2e+04

5e+04

2e+05

5e+05 GrCon GS K−GS

2 4 8 16 32|Q|

density

0.005

0.010

0.020

0.050

0.100

GrCon GS K−GS

Yout

ube

2 4 8 16 32|Q|

Time(ms)

5e+03

2e+04

1e+05

5e+05

GrCon GS K−GS

2 4 8 16 32|Q|

Size

1e+04

2e+04

5e+04

1e+05

2e+05

GrCon GS K−GS

2 4 8 16 32|Q|

density

2e−04

1e−03

5e−03

2e−02

GrCon GS K−GS

Flic

kr

2 4 8 16 32|Q|

Time(ms)

2e+04

5e+04

2e+05

5e+05

GrCon GS K−GS

2 4 8 16 32|Q|

Size

5e+04

1e+05

2e+05

5e+05

GrCon GS K−GS

2 4 8 16 32|Q|

density

1e−04

5e−04

2e−03

1e−02

5e−02

GrCon GS K−GS

Fig. 1 Multiple-vertex query evaluation with varying number of query vertices |Q|: proposed GrCon methodvs. GS and K-GS state-of-the-art methods. Results are shown in logarithmic scale.

As a general trend exhibited by all methods, the size (resp. density) of the solutionsproduced is increasing (resp. decreasing) as the number of query vertices increases. This isexpected: the more the query vertices, the larger the number of other vertices needed to makethem connected, and, hence, the larger the size and the smaller the density of the resultingsolution. Also, running time is increasing with the number of query vertices only for GrConand K-GS, while it results roughly constant for GS. This is in accordance with the time

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Page 14: effective community search DAMI2015

Evaluation: one-vertex queries

14

Efficient and Effective Community Search 19

Em

ail

1 2 4 8 16 32c(v)

Time(ms)

510

2050

100

200

500 GrCon LS

1 2 4 8 16 32c(v)

Size

500

1000

2000

GrCon LS

1 2 4 8 16 32c(v)

density

0.1

0.2

0.5

1.0

GrCon LS

Web

-ND

1 2 4 8 16 32c(v)

Time(ms)

1020

50100

500

2000

GrCon LS

1 2 4 8 16 32c(v)

Size

200

500

2000

5000

20000

GrCon LS

1 2 4 8 16 32c(v)

density

0.1

0.2

0.5

1.0

GrCon LS

Web

-G1 2 4 8 16 32

c(v)Time(ms)

110

100

1000

10000

GrCon LS

1 2 4 8 16 32c(v)

Size

1e+02

5e+02

5e+03

5e+04

GrCon LS

1 2 4 8 16 32c(v)

density

0.2

0.4

0.6

0.8

1.0

GrCon LS

Yout

ube

1 2 4 8 16 32c(v)

Time(ms)

50200

500

2000

10000

50000 GrCon LS

1 2 4 8 16 32c(v)

Size

5000

10000

20000

50000

GrCon LS

1 2 4 8 16 32c(v)

density

0.005

0.020

0.050

0.200

0.500

GrCon LS

Flic

kr

1 2 4 8 16 32c(v)

Time(ms)

1e+03

5e+03

2e+04

1e+05

5e+05

GrCon LS

1 2 4 8 16 32c(v)

Size

1e+04

2e+04

5e+04

1e+05

2e+05

GrCon LS

1 2 4 8 16 32c(v)

density

0.01

0.02

0.05

0.10

0.20

0.50

1.00

GrCon LS

Fig. 3 One-vertex query evaluation with varying core index c(v) of the query vertex v: proposed GrConmethod vs. LS state-of-the-art method. Results are shown in logarithmic scale.

5.2 One-vertex queries

For the special case of one-vertex queries we aim at comparing the proposed GreedyCon-nection against LS, which is the state-of-the-art method for this special type of query andhas been recognized as more efficient and effective than GS and K-GS [5]. Note that forthis special case no connection constraint among query vertices is required, thus here we use

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65

Page 15: effective community search DAMI2015

Summary

15

Future work

§ The minimum Weiner connector problem, Sigmod 2015.§ CSP problem on heterogeneous graphs, with constraints on admissible

labels of nodes/edges.

Source code and datasets are available online!!

Page 16: effective community search DAMI2015

Thank you!