optimizing generalized pagerank methods for seed-expansion … · optimizing generalized pagerank...

Optimizing Generalized PageRank Methods forSeed-Expansion Community Detection

Pan LiUIUC

[email protected]

Eli ChienUIUC

[email protected]

Olgica MilenkovicUIUC

[email protected]

Abstract

Landing probabilities (LP) of random walks (RW) over graphs encode rich infor-mation regarding graph topology. Generalized PageRanks (GPR), which representweighted sums of LPs of RWs, utilize the discriminative power of LP features toenable many graph-based learning studies. Previous work in the area has mostlyfocused on evaluating suitable weights for GPRs, and only a few studies so far haveattempted to derive the optimal weights of GPRs for a given application. We take afundamental step forward in this direction by using random graph models to betterour understanding of the behavior of GPRs. In this context, we provide a rigorousnon-asymptotic analysis for the convergence of LPs and GPRs to their mean-fieldvalues on edge-independent random graphs. Although our theoretical results applyto many problem settings, we focus on the task of seed-expansion communitydetection over stochastic block models. There, we find that the predictive power ofLPs decreases significantly slower than previously reported based on asymptoticfindings. Given this result, we propose a new GPR, termed Inverse PR (IPR), withLP weights that increase for the initial few steps of the walks. Extensive experi-ments on both synthetic and real, large-scale networks illustrate the superiority ofIPR compared to other GPRs for seeded community detection. 1

1 Introduction

PageRank (PR), an algorithm originally proposed by Page et al. for ranking web-pages [1] hasfound many successful applications, including community detection [2, 3], link prediction [4] andrecommender system design [5, 6]. The PR algorithm involves computing the stationary distribution ofa Markov process by starting from a seed vertex and then performing either a one-step of random walk(RW) to the neighbors of the current seed or jumping to another vertex according to a predeterminedprobability distribution. The RW aids in capturing topological information about the graph, while thejump probabilities incorporate modeling preferences [7]. A proper selection of the RW probabilitiesensures that the stationary distribution induces an ordering of the vertices that may be used todetermine the “relevance” of vertices or the structure of their neighborhoods.

Despite the wide utility of PR [7, 8], recent work in the field has shifted towards investigating variousgeneralizations of PR. Generalized PR (GPR) values enable more accurate characterizations ofvertex distances and similarities, and hence lead to improved performance of various graph learningtechniques [9]. GPR methods make use of arbitrarily weighted linear combinations of landingprobabilities (LP) of RWs of different length, defined as follows. Given a seed vertex and anotherarbitrary vertex v in the graph, the k-step LP of v, x(k)

v , equals the probability that a RW startingfrom the seed lands at v after k steps; the GPR value for vertex v is defined as

∑∞k=0 γkx

(k)v , for

some weight sequence {γk}k≥0. Certain GPR representations, such as personalized PR (PPR)[10] orheat-kernel PR (HPR)[11], are associated with weight sequences chosen in a heuristic manner: PPR

1Pan Li and Eli Chien contribute equally to this work.

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

uses traditional PR weights, γk = (1−α)αk, for some α ∈ (0, 1), and a seed set that captures localityconstraints. On the other hand, HPR uses weights of the form γk = hk

k! e−h, for some h > 0. A

question that naturally arises is what are the provably near-optimal or optimal weights for a particulargraph-based learning task.

Clearly, there is no universal approach for addressing this issue, and prior work has mostly reportedcomparative analytic or empirical studies for selected GPRs. As an example, for community detectionbased on seed-expansion (SE) where the goal is to identify a densely linked component of the graphthat contains a set of a priori defined seed vertices, Chung [12] proved that the HPR method producescommunities with better conductance values than PPR [13]. Kloster and Gleich [14] confirmedthis finding via extensive experiments over real world networks. Avron and Horesh [15] leveragedtime-dependent PRs, a convolutional form of HPR and PPR [16], and showed that this new PR canoutperform HPR on a number of real network datasets. Another line of work considered adaptivelylearning the GPR weights given access to sufficiently many both within-community and out-of-community vertex labels [17, 18]. Related studies were also conducted in other application domainssuch as web-ranking [8] and recommender system design [19].

Recently, Kloumann et al. [20] took a fresh look at the GPR-based seed-expansion communitydetection problem. They viewed LPs of different steps as features relevant to membership in thecommunity of interest, and the GPRs as scores produced by a linear classifier that digests thesefeatures. A key observation in this setting is that the GPR weights have to be chosen with respect tothe informativeness of these features. Based on the characterization of the mean-field values of theLPs over a modified stochastic block model (SBM) [21], Kloumann et al. [20] determined that PPRwith a proper choice of the parameter α corresponds to the optimal classifier if only the first-ordermoments are available. Unfortunately, as the variance of the LPs was ignored, the performance ofthe PPR was shown to be sub-optimal even for synthetic graphs obeying the generative modelingassumptions used in [20].

We report substantial improvements of the described line of work by characterizing the non-asymptoticbehavior of the LPs over random graphs. More precisely, we derive non-asymptotic conditions for theLPs to converge to their mean-field values. Our findings indicate that in the non-asymptotic setting,the discriminative power of k-step LPs does not necessarily deteriorate as k increases; this followssince our bounds on the variance decay even faster than the distance between the means of LPs withinthe same and across two different communities. We leverage this finding and propose new weightsthat suitably increase with the length of RWs for small values of k. This choice differs significantlyfrom the geometrically decaying weights used in PPR, as suggested by [20].

The reported results may also provide useful means for improving graph neural networks (GNN) [22,23, 24] and their variants [25, 26] for vertex classification tasks. Currently, the typical numbers oflayers in graph neural networks is 2−3, as such a choice offers the best empirical performance [24, 25].More layers may over-smooth vertex features and thus provide worse results. However, in this setting,long paths in the graphs may not be properly utilized, as our work demonstrates that these pathsmay have strong discriminative power for community detection. Hence a natural research directionof research regarding GNNs is to investigate how to leverage long paths over graphs without over-smoothing the vertex features. Concurrent to this work, several empirical studies were performedto address the same problem. The work in [27, 28] used a decoupling non-linear transformation offeatures and PR propagation over graphs, while [29] used GNNs over graphs that are transformedbased on GPRs.

Our contributions are multifold. We derive the first non-asymptotic bound of the distance between LPvectors to their mean-field values over random graphs. This bound allows us to better our understand-ing of a class of GPR-based community detection approaches. For example, it explains why PPRwith a parameter α ' 1 often achieves good community detection performance [30] and why HPRstatistically outperforms PPR for community detection, which matches the combinatorial demon-stration proposed previously [12]. Second, we describe the first non-asymptotic characterization ofGPRs with respect to their mean-field values over edge-independent random graphs. The obtainedresults improve the previous analysis of standard PR methods [31, 32] as one needs fewer modelingassumptions and arrives at more general conclusions. Third, we introduce a new PR-type classifierfor SE community detection, termed inverse PR (IPR). IPR carefully selects the weights for thefirst several steps of the RW by taking into account the variance of the LPs, and offers significantlyimproved SE community detection performance compared to canonical PR diffusions (such as HPR

2

and PPR) over SBMs. Fourth, we present extensive experiments for detecting seeded communities inreal large-scale networks using IPR. Although real world networks do not share the properties ofSBMs used in our analysis, IPR still significantly outperforms both HPR and PPR for networks withnon-overlapping communities and offers performance improvement over two examined networkswith overlapping community structures.

2 Preliminaries

We start by formally introducing LPs, GPR methods, random graphs and other relevant notions.

Generalized PageRank. Consider an undirected graph G = (V,E) with |V | = n. Let A be theadjacency matrix, and let D be the diagonal degree matrix of G. The RW matrix of G equalsW = AD−1. Let {λi}i∈[n] be the eigenvalues of W ordered as 1 = λ1 ≥ λ2 ≥ ... ≥ λn ≥−1. Furthermore, let dmin and dmax stand for the minimum and maximum degree of verticesin V , respectively. A distribution over the vertex set V is a mapping x : V → R[0,1] such that∑v∈V xv = 1, with xv denoting the probability of vertex v. Given an initial distribution x(0),

the k-step LPs equal x(k) = W kx(0). The GPRs are parameterized by a sequence of nonnegativeweights γ = {γk}k≥0 and an initial potential x(0), pr(γ, x(0)) =

∑∞k=0 γkx

(k) =∑∞k=0 γkW

kx(0).For an in-depth discussion of PageRank methods, the interested reader is referred to the review [7].In some practical GPR settings, the bias caused by varying degrees is compensated for throughdegree normalization [33]. The k-step degree-normalized LPs (DNLP) are defined as z(k) =(∑

v∈V dv)D−1x(k).

Random graphs. Throughout the paper, we assume that the graph G is sampled according to aprobability distribution P . The mean-field of G with respect to P is an undirected graph G withadjacency matrix A = E[A], where the expectation is taken with respect to P . Similarly, the mean-field degree matrix is defined as D = E[D] and mean-field random walk matrix as W = AD−1. Themean-field GPR reads as pr(γ, x(0)) =

∑∞k=0 γkx

(k) =∑∞k=0 γkW

kx(0). We also use the notationz(k), dmin, and dmax for the mean-field counterparts of z(k), dmin, and dmax, respectively.

For the convergence analysis, we consider a sequence of random graphs {G(n)}n≥0 with increasingsize n, sampled using a corresponding sequence of distributions {P (n)}n≥0. For a given initialdistribution {x(0,n)}n≥0 and weights {γ(n)}n≥0, we aim to analyze the conditions under whichthe LPs x(k,n) and GRPs pr(γ(n), x(0,n)) converge to their corresponding mean-field counterpartsx(k,n) and pr(γ(n), x(0,n)), respectively. We say that an event occurs with high probability if it hasprobability at least 1−n−c, for some constant c. If no confusion arises, we omit n from the subscript.

We also use ‖x‖p =(∑

v∈V |xv|p) 1p to measure the distance between LPs.

Edge-independent random graphs and SBMs. Edge-independent models include a wide range ofrandom graphs, such as Erdos-Rényi graphs [34], Chung-Lu models [35], stochastic block models(SBM) [21] and degree corrected SBMs [36]. In an edge-independent model, for each pair of verticesu, v ∈ V , an edge uv is drawn according to the Bernoulli distribution with parameter puv ∈ [0, 1] andthe draws for different edges are performed independently. Hence, E[Auv] = puv, and Auv, Au′v′

are independent if uv, u′v′ are different unordered pairs.

Some of our subsequent discussion focuses on two-block SBMs. In this setting, we let C1, C0 ⊂ Vdenote the two blocks, such that |C1| = n1 and |C0| = n0. For any pair of vertices from the sameblock u, v ∈ Ci, we set puv = pi, for some pi ∈ (0, 1), i ∈ {0, 1}. Note that we allow self loops, i.e.we allow u = v, which makes for simpler notation without changing our conclusions. For pairs uvsuch that u ∈ C1 and v ∈ C0, we set puv = q, for some q ∈ (0, 1). A two-block SBM in this settingis parameterized by (n1, p1, n0, p0, q).

3 Mean-field Convergence Analysis of LPs and GPRs

In what follows, we characterize the conditions under which x(k) and pr(γ, x(0)) converge to theirmean-field counterparts x(k) and pr(γ, x(0)), respectively. The derived results enable a subsequentanalysis of the variance of LPs over SBM, as outlined in the sections to follow (all proofs arepostponed to Section B of the Supplement). Note that since GPRs are linear combinations of LPs,

3

the convergence properties of x(k) may be used to analyze the convergence properties of pr(γ, x(0)).More specifically, given a sequence of graphs of increasing sizes, and G(n) ∼ P (n), the first questionof interest is to derive non-asymptotic bounds for ‖x(k) − x(k)‖1, as both x(k), x(k) have unit `1-norms2. The following lemma shows that under certain conditions, one cannot expect convergence inthe `1 norm for arbitrary values of k.

Lemma 3.1. If there exists a vertex v that may depend on n such that dv = ω(1) and dv ≤ (1− ε)n,for some ε > 0, setting x(0) = 1v gives limn→∞ P

[‖x(1) − x(1)‖1 ≥ ε

]= 1.

Consequently, we start with an upper bound for ‖x(k) − x(k)‖2. Then, we provide conditions that

ensure that ‖x(k) − x(k)‖2 = o(√

1n ). As ‖x(k) − x(k)‖1 ≤

√n‖x(k) − x(k)‖2, we subsequently

arrive at necessary conditions for convergence in the `1-norm. The novelty of our proof techniqueis to use mixing results for RWs to characterize the upper bound for the convergence of landingprobabilities for each k. The results establish uniform convergence of GPRs as long as

∑k γk <∞.

This finding improves the results in [31, 32, 37] for GPRs with weights γk that scale as O(ck), wherec ∈ (0, 1) denotes the damping factor.

Our first relevant results are non-asymptotic bounds for the `2-distance between LPs and their mean-field values. The obtained bounds lead to non-asymptotic bounds for the `2-distance between GPRsand their mean-field values, described in Lemma 3.2. Lemma 3.2 is then used to derive conditions forconvergence of LPs and GPRs in the `1-distance, summarized in Theorems 3.3 and 3.4, respectively.

Lemma 3.2. Let λ = max{|λ2|, |λn|}. Suppose that dmin = ω(log n). Then, with high probability,and for some constants C1, C2, C3 that do not depend on n or k, one has

‖x(k) − x(k)‖2‖x(0)‖2

≤ C1

√log n

ndmin

1

‖x(0)‖2+ C2k

(λ+ C3

√log n

dmin

)k−1√dmax log n

d2min

. (1)

Moreover, let g(γ, λ, dmin) =∑k≥1 γkk

(λ+ C3

√logndmin

)k−1

. Then,

‖pr(γ, x(0))− pr(γ, x(0))‖2‖x(0)‖2

≤C1

√log n

ndmin

1

‖x(0)‖2+ C2g(γ, λ, dmin)

√dmax log n

d2min

. (2)

Lemma 3.2 allows us to establish the following conditions for `1−convergence of the LPs.

Theorem 3.3. 1) If ‖x(0)‖2 = O( 1√n

) and dmax lognd2

min

= o(1), then for any sequence {k(n)}n≥0,

‖x(k(n)) − x(k(n))‖1 = o(1), w.h.p.; 2) If dmin = ω(log n) and λ < 1 − c, for some c > 0

and n ≥ n0 such that c3 > C4

√logndmin

where n0, C4 are constants, then for any x(0) and sequence

{k(n)}n≥n0that satisfies k(n) ≥ (log n+ log dmax

dmin)/c, we have ‖x(k(n))− x(k(n))‖1 = o(1), w.h.p.

Theorem 3.3 asserts that either broadly spreading the seeds, i.e., ‖x(0)‖2 = O( 1√n

), or allowing for

the RW to progress until the mixing time, i.e., k(n) ≥ (log n + log dmax

dmin)/c, ensures that the LPs

converge in `1-distance. One also has the following corresponding convergence result for GPRs.

Theorem 3.4. 1) If ‖x(0)‖2 = O( 1√n

), dmax lognd2

min

= o(1), and λ < 1−c for some c > 0, then for any

weight sequence {γ(n)}n≥0 such that∑k γ

(n)k <∞, one has ‖pr(γ(n), x(0))− pr(γ(n), x(0))‖1 =

o(1), w.h.p. 2) If γ(n)0 /

∑k γ

(n)k ≥ C5 > 0 for some constant C5, λ < 1 − c for some c > 0,

and dmax lognd2

min

= o(1), then for any x(0) one has ‖pr(γ(n),x(0))−pr(γ(n),x(0))‖2‖pr(γ(n),x(0))‖2

= o(1) w.h.p. 3) If

dmin = ω(log n) and g(γ(n), λ(n)) =∑k≥1 γ

(n)k k(λ(n)+C6)k−1 = O(

√dmin

ndmax) for some constant

C6 > 0, then for any x(0) one has ‖pr(γ(n), x(0))− pr(γ(n), x(0))‖1 = o(1) w.h.p.

2In some cases, both ‖x(k)‖2 and ‖x(k)‖2 naturally equal to o(1), which leads to the obvious, yet loosebound ‖x(k) − x(k)‖2 ≤ ‖x(k)‖2 + ‖x(k)‖2 = o(1).

4

Remarks pertaining to Theorem 3.4: The result in 1) requires weaker conditions than Proposition1 in [31] for the standard PR: we disposed of the constraint λ = o(1) and bounded dmax/dmin.As a result, GPR converges in `1-norm as long as the initial seeds are sufficiently spread anddmax lognd2

min

= o(1). The result in 2) implies that for fixed weights that do not depend on n, both thestandard PR and HPR have guaranteed convergence in the relative `2-distance. This generalizesTheorem 1 in [32] stated for the standard PR on SBMs. The result in 3) implies that as long as theweights γ(n)

k appropriately depend on n, convergence in the `1-norm is guaranteed (e.g., for HPRwith h > (lnn+ ln dmax

dmin)/(2− 2λ)).

The following lemma uses the same proof techniques as Lemma 3.2 to provide an upper bound onthe distance between the DNLPs z(k) and z(k), which we find useful in what follows. The resultessentially removes the dependence on the degrees in the first term of the right hand side of (1).

Lemma 3.5. Suppose that the conditions of Lemma 3.2 are satisfied. Then, one has

‖z(k) − z(k)‖2‖z(0)‖2

≤ C1k

(λ+ C2

√log n

dmin

)k−1√dmax log n

d2min

w.h.p.

4 GPR-Based SE Community Detection

One important application of PRs is in SE community detection: For each vertex v, the LPs {x(k)v }k≥0

may be viewed as features and the GPR as a score used to predict the community membership of vby comparing it with some threshold [20]. Kloumann et al. [20] investigated mean-field LPs, i.e.,{x(k)

v }k≥0, and showed that under certain symmetry conditions, PPR with α = λ2 correspondsto an optimal classifier for one block in an SBM, given only the first-order moment information.However, accompanying simulations revealed that PPR underperforms with respect to classificationaccuracy. As a result, Fisher’s linear discriminant [38] was used instead [20] by empirically leveraginginformation about the second-order moments of the LPs, and was showed to have a performancealmost matching that of belief propagation, a statistically optimal method for SBMs [39, 40, 41].

In what follows, we rigorously derive an explicit formula for a variant of Fisher’s linear discriminantby taking into account the individual variances of the features while neglecting their correlations.This explicit formula provides new insight into the behavior of GPR methods for SE communitydetection in SBMs and will be later generalized to handle real world networks (see Section 5).

4.1 Pseudo Fisher’s Linear Discriminant

Suppose that the mean vectors and covariance matrices of the features from two classes C0, C1 areequal to (µ0,Σ0) and (µ1,Σ1), respectively. For simplicity, assume that the covariance matrices areidentical, i.e., Σ0 = Σ1 = Σ. The Fisher’s linear discriminant depends on the first two moments(mean and variance) of the features [38], and may be written as F (x) = [Σ−1(µ1 − µ0)]T x. Thelabel of a data point x is determined by comparing F (x) with a threshold.

Neglecting the differences in the second order moments by assuming that Σ = σ2I , Fisher’s lineardiscriminant reduces to G(x) = (µ1 − µ0)T x, which induces a decision boundary that is orthogonalto the difference between the means of the two classes; G(x) is optimal under the assumptions thatonly the first-order moments µ1 and µ0 are available.

The two linear discriminants have different practical advantages and disadvantages in practice. Onthe one hand, Σ can differ significantly from σ2I, in which case G(x) performs much worse thanF (x). On the other hand, estimating the covariance matrix Σ is nontrivial, and hence F (x) may notbe available in a closed form. One possible choice to mitigate the above drawbacks is to use what wecall the pseudo Fisher’s linear discriminant,

SF (x) = [diag(Σ)−1(µ1 − µ0)]T x, (3)

where diag(Σ) is the diagonal matrix of Σ; diag(Σ) preserves the information about variances, butneglects the correlations between the terms in x. This discriminant essentially allows each featureto contribute equally to the final score. More precisely, given a feature of a vertex v, say x(k)

v , its

5

corresponding weight according to SF (·) equals µ(k)1 −µ

(k)0

(σ(k))2 , where (σ(k))2 denotes the variance ofthe feature (i.e., the k-th component in the diagonal of Σ). Note that this weight may be rewritten asµ

(k)1 −µ

(k)0

σ(k) × 1σ(k) ; the first term is a frequently-used metric for characterizing the predictiveness of a

feature, called the effect size [42], while the second term is a normalization term that positions allfeatures on the same scale.

Next, we derive an expression for SF (x) pertinent to SE community detection, following the settingproposed for Fisher’s linear discriminant in [20]. To model the community to be detected withseeds and the out-of-community portion of a graph respectively, we focus on two-block SBMs withparameters (n1, p1, n0, p0, q), and characterize both the means µ1, µ0 and the variances diag(Σ).Note that for notational simplicity, we first work with DNLPs {z(k)

v }k≥0 as the features of choice, asthey can remove degree-induced noise; the results for LPs {x(k)

v }k≥0 are only stated briefly.

4.2 SF (·) Weights and the Inverse PageRank

Characterization of the means. Consider a two-block SBM with parameters (n1, p1, n0, p0, q).Without loss of generality, assume that the seed lies in block C1. Due to the block-wise symmetryof A, for a fixed k ≥ 1, the DNLP z(k)

v is a constant for all v ∈ Ci within the same community Ci,i ∈ {0, 1}. Consequently, the mean of the kth DNLP (feature) of block Ci is set to µ(k)

i = z(k)v ,

v ∈ Ci, i ∈ {0, 1}. Note that z(k)v does not match the traditional definition of the expectation E(z

(k)v ),

although the two definitions are consistent when n1, n0 →∞ due to Lemma 3.5.

Choosing the initial seed set to lie within one single community, e.g.∑v∈C1

x(0)v = 1, and using

some algebraic manipulations (see Section C of the Supplement), we obtain

µ(k)1 − µ(k)

0 = cλk2 , c =1− λ2

n1(n1p1 + n0q). (4)

Recall that λ2 stands for the second largest eigenvalue of the mean-field random walk matrix W . Theresult in (4) shows that the distance between the means of the DNLPs of the two classes decays withk at a rate λ2. This result is similar to its counterpart in [20] for LPs {x(k)

v }k≥0, but the results in [20]additionally requires dv = du for vertices u and v belonging to different blocks. By only using thedifference µ(k)

1 − µ(k)0 without the variance, the authors of [20] proposed to use the discriminant

G(x) = (µ1 − µ0)T x, which corresponds to PPR with α = λ2.

Characterization of the variances. Characterizing the variance of each feature is significantly harderthan characterizing the means. Nevertheless, the results reported in Lemma 3.5 and Lemma 3.2 allowus to determine both E(z

(k)v − z(k)

v )2 and E(x(k)v − x(k)

v )2. Let us consider z(k)v first. Lemma 3.5

implies that with high probability, ‖z(k)− z(k)‖2 ≤ k(λ+ o(1)

)k−1for all k. Figure 1 (Left) depicts

the empirical value of E[‖z(k) − z(k)‖22] for a given set of parameter choices. As it may be seen, theexpectation decays with a rate roughly equal to λ2k

2 , where λ2 is the second largest eigenvalue of theRW matrix W . With regards to x(k)

v , Lemma 3.2 establishes that ‖x(k) − x(k)‖2 is upper bounded byk(λ+ o(1)

)k−1; for large k, the norm is dominated by the first term in (1), induced by the variance

of the degrees. Figure 1 (Left) plots the empirical values of E[‖x(k) − x(k)‖22] to support this finding.

Combining the characterizations of the means and variances, we arrive at the following conclusions.

Normalized degree case. Although the expression established in (4) reveals that the distancebetween the means of the landing probabilities decays as λk2 , the corresponding standard deviationσ(k) ∝ E[‖z(k) − z(k)‖2] also roughly decays as λk2 . Hence, for the classifier SF (·), the appropriate

weights are γk =µ

(k)1 −µ

(k)0

(σ(k))2 ∼ λk2/λ2k2 =

(λ2/λ2

)kλ−k2 . The first term

(λ2/λ2

)kin the product

weighs different DNLPs according to their effect sizes [42]. Since λ2 → λ2 as n→∞, the ratio maydecay very slowly as k increases. As shown in the Figure 1 (Right), the classification error rate basedon a one-step DNLP remains largely unchanged as k increases to some value exceeding the mixingtime. The second term in the product, λ−k2 , may be viewed as a factor that balances the scale of allDNLPs. Due to the observed variance, DNLPs with large k should be assigned weights much largerthan those used in G(x), i.e., γk = µ

(k)1 − µ(k)

0 = λk2 as suggested in [20].

6

0 5 10 15 20 25 30

steps (k)

10-9

10-8

10-7

10-6

10-5

E[‖z(k)−

z(k) ‖

2 2/b2

k]

b = λ2

b = λ2

b = 1.02 ∗ λ2

0 5 10 15 20 25 30

steps (k)

10-6

10-5

10-4

10-3

10-2

E[‖nx(k)−

nx(k) ‖

2 2]

0 5 10 15 20 25 30

steps (k)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

err

or

rate

Degree Normalization

q = 0.01q = 0.02q = 0.03

0 5 10 15 20 25 30

steps (k)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

err

or

rate

No Normalization

q = 0.01q = 0.02q = 0.03

Figure 1: Left: Empirical results illustrating the decay rate of the variances ‖z(k) − z(k)‖22 and ‖x(k) − x(k)‖22,for an SBM with parameters (500, 0.2, 500, 0.2, 0.05), averaged over 1000 tests. With high-probability, λ2

slightly exceeds the corresponding mean-field value λ2 [43]; Right: Classification errors based on single-stepDNLPs or LPs for SBMs with parameters (500, 0.05, 500, 0.05, q), q ∈ {0.01, 0.02, 0.03}.

The unnormalized degree case. The standard deviation σ(k) ∝ E[‖x(k) − x(k)‖2] roughly scalesas φ+ λk2 , where φ captures the noise introduced by the degrees. Typically, for a small number ofsteps k, the noise introduced by degree variation is small compared to the total noise (See Figure 1(Left)). Hence, we may assume that φ < λ0

2 = 1. The classifier SF (·) suggests using the weights

γk =µ

(k)1 −µ

(k)0

(σ(k))2 ∼ λk2/(φ+ λk2)× (φ+ λk2)−1, where λk2/(φ+ λk2) represents the effect size of thek-th LP. This result is confirmed by simulations: In Figure 1 (Right), the classification error rate basedon a one-step LP decreases for small k and increase after k exceeds the mixing time. Moreover, byrecalling that x(k)

v → dv/∑v dv as k →∞, one can confirm that the degree-based noise deteriorates

the classification accuracy.

Inverse PR. As already observed, for finite n and with high probability, λ2 only slightly exceeds λ2.Moreover, for SBMs with unknown parameters or for real world networks, λ2 may not be well-defined,or it may be hard to compute numerically. Hence, in practice, one may need to use the heuristic valueλ2 = λ2 = θ, where θ is a parameter to be tuned. In this case, SF (·) with degree normalization isassociated with the weights γk = θ−k, while SF (·) without degree normalization is associated withthe weights γk = θk/(φ+ θk)2. When k is small, γk roughly increases as θ−k; we term a PR withthis choice of weights as the Inverse PR (IPR). Note that IPR with degree normalization may notconverge in practice, and LP information may be estimated only for a limited number of k steps. Ourexperiments on real world networks reveal that a good choice for the maximum value of k is 4− 5times the maximal length of the shortest paths from all unlabeled vertices to the set of seeds.

Other insights. Note that IPR resembles HPR when k is small and γk increases, as it dampensthe contributions of the first several steps of the RW. This result also agrees with the combinatorialanalysis in [12] that advocates the use of HPR for community detection. Note that IPR withdegree normalization has monotonically increasing weights, which reflects the fact that communityinformation is preserved even for large-step LPs. To some extent, this result can be viewed as atheoretical justification for the empirical fact that PPR is often used with α ' 1 to achieve goodcommunity detection performance [30].

5 Experiments

We evaluate the performance of the IPR method over synthetic and large-scale real world networks.

Datasets. The network data used for evaluation may be classified into three categories. The firstcategory contains networks sampled from two-block SBMs that satisfy the assumptions used toderive our theoretical results. The second category includes three real world networks, Citeseer [44],Cora [45] and PubMed [46], all frequently used to evaluate community detection algorithms [47, 48].These networks comprise several non-overlapping communities, and may be roughly modeled asSBMs. The third category includes the Amazon (product) network and the DBLP (collaboration)network from the Stanford Network Analysis Project [49]. These networks contain thousands ofoverlapping communities, and their topologies differ significantly from SBMs (see Table 2 in theSupplement for more details). For synthetic graphs, we use single-vertex seed-sets; for real worldgraphs, we select 20 seeds uniformly at random from the community of interest.

Comparison of the methods. We compare the proposed IPRs with PPR and HPR methods, bothwidely used for SE community detection [14, 50]. Methods that rely on training the weights were not

7

0 5 10 15 20 25 30

steps (k)

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

reca

ll

q=0.02

IPR-d λ2

PPR-d 0.95

HPR-d 10

BP

0 20 40 60 80 100

steps (k)

0.55

0.6

0.65

0.7

reca

ll

Citeseer

IPR-d 0.99PPR-d 0.95HPR-d 10

0 20 40 60 80 100

steps (k)

0.65

0.7

0.75

0.8

reca

ll

Cora


0 20 40 60 80 100

steps (k)

0.55

0.6

0.65

0.7

0.75

0.8

reca

ll

PubMed


0 5 10 15 20 25 30

steps (k)

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

reca

ll

q=0.03

IPR-d λ2

PPR-d 0.95

HPR-d 10

BP

0 20 40 60 80 100

percentile: Q/|C|× 100

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

recall

Citeseer


0 20 40 60 80 100


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

recall

Cora


0 20 40 60 80 100


0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

recall

PubMed


Figure 2: (Left): Recalls (mean ± std) for different PRs over SBMs with parameters (500, 0.05, 500, 0.05, q),q ∈ {0.02, 0.03}; (Right): Results over the Citeseer, Cora and PubMed networks (from left to right). First line:Recalls (mean ± std) of different PRs vs steps. The second line: Averaged recalls of different PRs for the top-Qvertices, obtained by accumulating the LPs with k ≤ 50.

considered as they require outside-community vertex labels. For all three approaches, the defaultchoice is degree-normalization, indicated by the suffix “-d”. For synthetic networks, the parameter θin IPR is set to λ2 = 0.05−q

0.05+q , following the recommendations of Section 4.2. For real world networks,we avoid computing λ2 exactly and set θ ∈ {0.99, 0.95, 0.90}. The parameters of the PPR and HPRare chosen to satisfy α ∈ {0.9, 0.95} and h ∈ {5, 10} and to offer the best performance, as suggestedin [50, 51, 14]. The results for all PRs are obtained by accumulating the values over the first k steps;the choice for k is specified for each network individually.

Evaluation metric. We adopt a metric similar to the one used in [50]. There, one is given a graph,a hidden community C to detect, and a vertex budget Q. For a potential ordering of the vertices,obtained via some GPR method, the top-Q set of vertices represents the predicted community P .The evaluation metric used is |P ∩ C|/|C|. By default, Q = |C|, if not specified otherwise. Othermetrics, such as the Normalized Mutual Information and the F-score may be used instead, but sincethey require additional parameters to determine the GPR classification threshold, the results may notallow for simple and fair comparisons. For SBMs, we independently generated 1000 networks forevery set of parameters. For each network, the results are summarized based on 1000 independentlychosen seed sets for each community-network pair and then averaged over over all communities.

5.1 Performance Evaluation

Synthetic graphs. In synthetic networks, all three PRs with degree normalization perform signifi-cantly better than their unnormalized degree counterparts. Thus, we only present results for the firstclass of methods in Figure 2 (Left). As predicted in Section 4.2, IPR-d offers substantially betterdetection performance than either PPR-d and HPR-d, and is close in quality to belief propagation(BP). Note that the recall of IPR-d keeps increasing with the number of steps. This means that evenfor large values of k, the landing probabilities remain predictive of the community structures, anddecreasing the weights with k as in HPR and PPR is not appropriate for these synthetic graphs. Theclassifier G(x), i.e., a PPR with parameters p−q

p+q suggested by [20], has worse performance than thePPR method with parameter 0.95 and is hence not depicted.

Citeseer, Cora and PubMed. Here as well, PRs with degree normalization perform better than PRswithout degree normalization. Hence, we only display the results obtained with degree normalization.The first line of Figure 2 (Right) shows that IPR-d 0.99 significantly outperforms both PPR-d andHPR-d for all three networks. Moreover, the performance of IPR-d 0.99 improves with increasing k,once again establishing that LPs for large k are still predictive. The results for IPR-d 0.90, 0.95 anda related discussion are postponed to Section A.1 in the Supplement.

The second line of Figure 2 (Right) illustrates the rankings of vertices within the predicted communitygiven the first 50 steps of the RW. Note that only for the Citeseer network does PPR provide a betterranking of vertices in the community for small Q; for the other two networks, IPR outperforms PPRand HPR on the whole ranking of vertices.

8

0 5 10 15 20 25 30

steps (k)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

reca

ll

AmazonAmazon-dDBLPDBLP-d

Figure 3: Recalls based on one-stepLPs and one-step DNLPs.

Steps k 5 10 15 20 5 10 15 20

Amazon (std: ±0.12) DBLP (std: ±0.09)IPR0.99 46.63 48.03 48.43 48.53 27.58 28.78 29.18 29.27IPR0.95 46.64 48.04 48.44 48.53 27.60 28.94 29.20 29.28IPR0.90 46.67 48.08 48.45 48.53 27.64 29.14 29.26 29.32

PPR 46.57 47.92 48.30 48.43 27.46 28.49 28.90 29.06HPR 47.20 48.36 48.54 48.55 28.24 28.80 28.85 28.85

Table 1: Recalls (mean ± std) for different PRs over the Amazon andthe DBLP networks. The boldfaced values are those within one std awayfrom the optimal values for a given fixed k.

Amazon, DBLP. We first preprocess these networks by following a standard approach describedin Section A.2 of the Supplement. As opposed to the networks in the previous two categories,the information in the vertex degrees is extremely predictive of the community membership forthis category. Figure 3 shows the predictiveness based on one-step LPs and DNLPs for these twonetworks. As may be seen, degree normalization may actually hurt the predictive performance ofLPs for these two networks. This observation coincides with the finding in [50]. Hence, for thiscase, we do not perform degree normalization. As recommended in Section 4.2, the weights arechosen as γk = θk

(θk+φ)2 , where θ, φ are parameters to be tuned. The value of φ typically depends onhow informative the degree of a vertex is. Here, we simply set φ = θ10 which makes γk achieve itsmaximal value for k = 10. We also find that for both networks, α = 0.95 is a good choice for PPRwhile for HPR, h = 10 and h = 5 are adequate for the Amazon and the DBLP network, respectively.

Further results are listed in Table 1, indicating that HPR outperforms other PR methods when k = 5;HPR is used with parameter ≥ 5, and the weights for the first 5 steps in HPR increase. This yet againconfirms our findings regarding the predictiveness of large-step LPs. For larger k, IPR matches theperformance of HPR and even outperforms HPR on the DBLP network. Vertex rankings within thecommunities are available in Section A.2 of the Supplement.

6 Discussion and Future DirectionsThere are many directions that may be pursued in future studies, including:(1) Our non-asymptotic analysis works for relatively dense graphs for which the minimum degreeequals dmin = ω(log n). A relevant problem is to investigate the behavior of GPR over sparse graphs.(2) The derived weights ignore the correlations between LPs corresponding to different step-lengths.Characterizing the correlations is a particularly challenging and interesting problem.(3) Recently, research for network analysis has focused on networks with higher-order structures.PPR and HPR-based methods have been generalized to the higher-order setting [52, 53]. Analysishas shown that these higher-order GPR methods may be used to detect communities of networks thatapproximates higher-order network (motif/hypergraph) conductance [54, 53]. Related works alsoshowed that PR-based approaches are powerful for practical community detection with higher-orderstructures [55]. Hence, generalizing our analysis to higher-order structure clustering is another topicfor future consideration. A follow-up work on the mean-field analysis of higher-order GPR methodsmay be found in [56].(4) Our work provides new insights regarding SE community detection. Re-deriving the non-asymptotic results for other GPR-based applications, including recommender system design and linkprediction, is another class of problems of interest. For example, GRP/RW-based approaches arefrequently used on commodities-user bipartite graphs of recommender systems. There, one maymodel the network as a random graph with independent edges that correspond to one-time purchasesgoverned by preference scores of the users. Similarities of vertices can also be characterized by GPRsand used to predict emerging links in networks [4]. In this setting, it is reasonable to assume that thegraph is edge-independent but with different edge probabilities. Analyzing how the GPR weightsinfluence the similarity scores to infer edge probabilities may improve the performance of currentlink prediction methods.

7 AcknowledgementThis work was supported by the NSF STC Center for Science of Information at Purdue University.The authors also gratefully acknowledge useful discussions with Prof. David Gleich from PurdueUniversity.

9

References

[1] L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing orderto the web.” Stanford InfoLab, Tech. Rep., 1999.

[2] S. Fortunato, “Community detection in graphs,” Physics reports, vol. 486, no. 3-5, pp. 75–174,2010.

[3] J. J. Whang, D. F. Gleich, and I. S. Dhillon, “Overlapping community detection using seedset expansion,” in Proceedings of the 22nd ACM international conference on Conference oninformation & knowledge management. ACM, 2013, pp. 2099–2108.

[4] D. Liben-Nowell and J. Kleinberg, “The link-prediction problem for social networks,” Journalof the American society for information science and technology, vol. 58, no. 7, pp. 1019–1031,2007.

[5] P. Massa and P. Avesani, “Trust-aware recommender systems,” in Proceedings of the 2007 ACMconference on Recommender systems. ACM, 2007, pp. 17–24.

[6] Z. Abbassi and V. S. Mirrokni, “A recommender system based on local random walks andspectral methods,” in Proceedings of the 9th WebKDD and 1st SNA-KDD workshop. ACM,2007, pp. 102–108.

[7] D. F. Gleich, “Pagerank beyond the web,” SIAM Review, vol. 57, no. 3, pp. 321–363, 2015.[8] R. Baeza-Yates, P. Boldi, and C. Castillo, “Generalizing pagerank: Damping functions for

link-based ranking algorithms,” in Proceedings of the 29th annual international ACM SIGIRconference on Research and development in information retrieval. ACM, 2006, pp. 308–315.

[9] J. Kleinberg, “Link prediction with combinatorial structure: Block models and simplicialcomplexes,” in Companion of the The Web Conference 2018 (BigNet)., 2018.

[10] G. Jeh and J. Widom, “Scaling personalized web search,” in Proceedings of the 12th interna-tional conference on World Wide Web. Acm, 2003, pp. 271–279.

[11] F. Chung, “The heat kernel as the pagerank of a graph,” Proceedings of the National Academyof Sciences, vol. 104, no. 50, pp. 19 735–19 740, 2007.

[12] ——, “A local graph partitioning algorithm using heat kernel pagerank,” Internet Mathematics,vol. 6, no. 3, pp. 315–330, 2009.

[13] F. R. Chung and F. C. Graham, Spectral graph theory. American Mathematical Soc., 1997,no. 92.

[14] K. Kloster and D. F. Gleich, “Heat kernel based community detection,” in Proceedings of the20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,2014, pp. 1386–1395.

[15] H. Avron and L. Horesh, “Community detection using time-dependent personalized pagerank,”in Proceedings of International Conference on Machine Learning, 2015, pp. 1795–1803.

[16] D. F. Gleich and R. A. Rossi, “A dynamical system for pagerank with time-dependent teleporta-tion,” Internet Mathematics, vol. 10, no. 1-2, pp. 188–217, 2014.

[17] B. Jiang, K. Kloster, D. F. Gleich, and M. Gribskov, “Aptrank: an adaptive pagerank modelfor protein function prediction on bi-relational graphs,” Bioinformatics, vol. 33, no. 12, pp.1829–1836, 2017.

[18] D. Berberidis, A. Nikolakopoulos, and G. Giannakis, “Adaptive diffusions for scalable learningover graphs,” in Mining and Learning with Graphs Workshop @ ACM KDD 2018, 8 2018, p. 1.

[19] Z. Yin, M. Gupta, T. Weninger, and J. Han, “A unified framework for link recommendationusing random walks,” in Advances in Social Networks Analysis and Mining (ASONAM), 2010International Conference on. IEEE, 2010, pp. 152–159.

[20] I. M. Kloumann, J. Ugander, and J. Kleinberg, “Block models and personalized pagerank,”Proceedings of the National Academy of Sciences, vol. 114, no. 1, pp. 33–38, 2017.

[21] P. W. Holland, K. B. Laskey, and S. Leinhardt, “Stochastic blockmodels: First steps,” Socialnetworks, vol. 5, no. 2, pp. 109–137, 1983.

[22] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and locally connectednetworks on graphs,” arXiv preprint arXiv:1312.6203, 2013.

10

[23] M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional neural networks on graphswith fast localized spectral filtering,” in Advances in neural information processing systems,2016, pp. 3844–3852.

[24] T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,”arXiv preprint arXiv:1609.02907, 2016.

[25] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” inAdvances in Neural Information Processing Systems, 2017, pp. 1024–1034.

[26] P. Velickovic, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attentionnetworks,” arXiv preprint arXiv:1710.10903, 2017.

[27] J. Klicpera, A. Bojchevski, and S. Günnemann, “Predict then propagate: Graph neural networksmeet personalized pagerank,” in International Conference on Learning Representations, 2019.

[28] A. Bojchevski, J. Klicpera, B. Perozzi, M. Blais, A. Kapoor, M. Lukasik, and S. Günnemann,“Is pagerank all you need for scalable graph neural networks?” in ACM KDD, MLG Workshop,2019.

[29] J. Klicpera, S. Weißenberger, and S. Günnemann, “Diffusion improves graph learning,” inAdvances in Neural Information Processing Systems, 2019, pp. 13 333–13 345.

[30] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, “Community structure in largenetworks: Natural cluster sizes and the absence of large well-defined clusters,” Internet Mathe-matics, vol. 6, no. 1, pp. 29–123, 2009.

[31] K. Avrachenkov, A. Kadavankandy, L. O. Prokhorenkova, and A. Raigorodskii, “Pagerankin undirected random graphs,” in International Workshop on Algorithms and Models for theWeb-Graph. Springer, 2015, pp. 151–163.

[32] K. Avrachenkov, A. Kadavankandy, and N. Litvak, “Mean field analysis of personalized pager-ank with implications for local graph clustering,” arXiv preprint arXiv:1806.07640, 2018.

[33] R. Andersen, F. Chung, and K. Lang, “Local graph partitioning using pagerank vectors,” inProceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science. IEEEComputer Society, 2006, pp. 475–486.

[34] P. Erdos and A. Rényi, “On random graphs i,” Publ. Math. Debrecen, vol. 6, pp. 290–297, 1959.[35] W. Aiello, F. Chung, and L. Lu, “A random graph model for power law graphs,” Experimental

Mathematics, vol. 10, no. 1, pp. 53–66, 2001.[36] B. Karrer and M. E. Newman, “Stochastic blockmodels and community structure in networks,”

Physical review E, vol. 83, no. 1, p. 016107, 2011.[37] N. Chen, N. Litvak, and M. Olvera-Cravioto, “Generalized pagerank on directed configuration

networks,” Random Structures & Algorithms, vol. 51, no. 2, pp. 237–274, 2017.[38] R. A. Fisher, “The use of multiple measurements in taxonomic problems,” Annals of eugenics,

vol. 7, no. 2, pp. 179–188, 1936.[39] E. Mossel, J. Neeman, and A. Sly, “Belief propagation, robust reconstruction and optimal

recovery of block models,” in Conference on Learning Theory, 2014, pp. 356–370.[40] P. Zhang and C. Moore, “Scalable detection of statistically significant communities and hierar-

chies, using message passing for modularity,” Proceedings of the National Academy of Sciences,vol. 111, no. 51, pp. 18 144–18 149, 2014.

[41] E. Abbe and C. Sandon, “Detection in the stochastic block model with multiple clusters: proof ofthe achievability conjectures, acyclic bp, and the information-computation gap,” arXiv preprintarXiv:1512.09080, 2015.

[42] K. Kelley and K. J. Preacher, “On effect size.” Psychological methods, vol. 17, no. 2, p. 137,2012.

[43] T. Tao, Topics in random matrix theory. American Mathematical Soc., 2012, vol. 132.[44] C. L. Giles, K. D. Bollacker, and S. Lawrence, “Citeseer: An automatic citation indexing

system,” in Proceedings of the third ACM conference on Digital libraries. ACM, 1998, pp.89–98.

[45] A. K. McCallum, K. Nigam, J. Rennie, and K. Seymore, “Automating the construction ofinternet portals with machine learning,” Information Retrieval, vol. 3, no. 2, pp. 127–163, 2000.

11

[46] G. Namata, B. London, L. Getoor, B. Huang, and U. EDU, “Query-driven active surveying forcollective classification,” in 10th International Workshop on Mining and Learning with Graphs,2012.

[47] T. Yang, R. Jin, Y. Chi, and S. Zhu, “Combining link and content for community detection: adiscriminative approach,” in Proceedings of the 15th ACM SIGKDD international conferenceon Knowledge discovery and data mining. ACM, 2009, pp. 927–936.

[48] P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classifica-tion in network data,” AI magazine, vol. 29, no. 3, p. 93, 2008.

[49] J. Leskovec and R. Sosic, “Snap: A general-purpose network analysis and graph-mining library,”ACM Transactions on Intelligent Systems and Technology (TIST), vol. 8, no. 1, p. 1, 2016.

[50] I. M. Kloumann and J. M. Kleinberg, “Community membership identification from smallseed sets,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 2014, pp. 1366–1375.

[51] Y. Li, K. He, D. Bindel, and J. E. Hopcroft, “Uncovering the small community structure in largenetworks: A local spectral approach,” in Proceedings of the 24th international conference onworld wide web. International World Wide Web Conferences Steering Committee, 2015, pp.658–668.

[52] P. Li, N. He, and O. Milenkovic, “Quadratic decomposable submodular function minimization,”in Advances in Neural Information Processing Systems, 2018, pp. 1062–1072.

[53] M. Ikeda, A. Miyauchi, Y. Takai, and Y. Yoshida, “Finding cheeger cuts in hypergraphs via heatequation,” arXiv:1809.04396, 2018.

[54] P. Li, N. He, and O. Milenkovic, “Quadratic decomposable submodular function minimization:Theory and practice,” arXiv preprint arXiv:1902.10132, 2019.

[55] H. Yin, A. R. Benson, J. Leskovec, and D. F. Gleich, “Local higher-order graph clustering,” inProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery andData Mining. ACM, 2017, pp. 555–564.

[56] E. Chien, P. Li, and O. Milenkovic, “Landing probabilities of random walks for seed-setexpansion in hypergraphs,” 2019.

[57] K. He, Y. Sun, D. Bindel, J. Hopcroft, and Y. Li, “Detecting overlapping communities fromlocal spectral subspaces,” in Data Mining (ICDM), 2015 IEEE International Conference on.IEEE, 2015, pp. 769–774.

[58] F. Chung and M. Radcliffe, “On the spectra of general random graphs,” the electronic journal ofcombinatorics, vol. 18, no. 1, p. 215, 2011.

[59] H. Weyl, “Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differential-gleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung),” MathematischeAnnalen, vol. 71, no. 4, pp. 441–479, 1912.

12

optimizing generalized pagerank methods for seed-expansion … · optimizing generalized pagerank...

Documents