a tale of three graphs: sampling design on hybrid social ... · sampling framework, e.g., when...

A Tale of Three Graphs: Sampling Design onHybrid Social-Affiliation Networks

Junzhou Zhao #1, John C.S. Lui ∗2, Don Towsley +3, Pinghui Wang $4, Xiaohong Guan #5

# MOEKLINNS Lab, Xi’an Jiaotong University, China1 [email protected] [email protected]

∗ Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong2 [email protected]

+ School of Computer Science, University of Massachusetts at Amherst, USA3 [email protected]

$ Huawei Noah’s Ark Lab, Hong Kong4 [email protected]

Abstract—Random walk-based graph sampling methods havebecome increasingly popular and important for characterizinglarge-scale complex networks. While powerful, they are knownto exhibit problems when the graph is loosely connected, whichslows down the convergence of a random walk and can result inpoor estimation accuracy. In this work, we observe that manygraphs under study, called target graphs, usually do not existin isolation. In many situations, a target graph is often relatedto an auxiliary graph and an affiliation graph, and the targetgraph becomes better connected when viewed from these threegraphs as a whole, or what we called a hybrid social-affiliationnetwork. This viewpoint brings extra benefits to the graphsampling framework, e.g., when directly sampling a target graphis difficult or inefficient, we can efficiently sample it with theassistance of auxiliary and affiliation graphs. We propose threesampling methods on such a hybrid social-affiliation networkto estimate target graph characteristics, and conduct extensiveexperiments on both synthetic and real datasets, to demonstratethe effectiveness of these new sampling methods.

I. INTRODUCTION

Online social networks (OSNs) such as Facebook, SinaWeibo, and Twitter have attracted much attention in recentyears because of their ever-increasing popularity and impor-tance in our daily lives [1]–[3]. An OSN not only providesa platform for people to connect with their friends, but alsoprovides an opportunity to study user characteristics, whichare valuable in a variety of applications. For example, Twitterusers’ tweeting activities (e.g., number of tweets related to amovie) can be used to predict movie box-office revenues [4],and OSN users’ mood characteristics can forecast stock marketprices [5]. Therefore, measuring user characteristics in OSNsis an important task.

Exactly calculating user characteristics requires the com-plete OSN data. For third parties who do not possess thedata, they can only rely on public APIs to crawl the OSN.However, commercial OSNs are typically unwilling to grantthird parties full permission to access the data due to userprivacy and business secrecy. They often impose barriers tolimit third party’s large-scale crawling [6], and restrict the rate

of requesting APIs [7]. As a result, crawling the complete dataof a large-scale OSN is practically impossible.

To address this challenge, sampling methods have beendeveloped, i.e., a small fraction of OSN users are sampledand used to estimate user characteristics. In the literature,random walk-based graph sampling methods have gainedpopularity [8]–[15]. In random walk sampling, a walker islaunched over a graph, which continuously moves from anode to one of its neighbors selected uniformly at random,to obtain a collection of node samples. These samples canyield unbiased estimates of graph characteristics [15,16].

Motivation. While random walk sampling is powerful, if agraph is loosely connected, e.g., consists of communities, itwill suffer from slow mixing [17], i.e., requires a long “burn-in” period to reach steady state, which results in the need ofa large number of samples in order to achieve good estima-tion accuracy. Recent studies have found that mixing timesin many real-world networks are larger than expected [18].To overcome the slow mixing problem, one approach is toincorporate uniform node sampling (UNI) into random walksampling, which is also known as random walk with jumps(RWwJ) [9,12,14]. In UNI, nodes are independently sampleduniformly at random by querying randomly generated nodeIDs in a graph. For example, users in many OSNs have uniquenumeric IDs, and UNI can be conducted by generating randomIDs in the user ID space and including those valid IDs intosamples. RWwJ then leverages UNI to perform jumps on agraph. Specifically, at each step of RWwJ, the walker jumpswith a probability determined by the node where it currentlyresides, to a node sampled by UNI. By incorporating UNIinto random walk sampling, the walker can jump out of acommunity or disconnected component of a graph, and avoidbeing trapped, thereby reducing mixing times [9].

The main drawback of RWwJ is that UNI can be resource-intensive when the valid node ID space of a graph is verysparsely populated. For example, the fractions of valid IDs

in MySpace and Flickr are only about 10% and 1.3% respec-tively [12], and as a result, one has to generate about 10 (or 77)random IDs to obtain a valid ID in MySpace (or Flickr). Thisproblem can become even worse in some practical situationsas illustrated by the following example.

Example 1. A restaurant company wants to build a newchain store in one of two small candidate cities in China. Amarket surveyor is sent to evaluate the consumption abilitiesof inhabitants of the two cities. In China, Sina Weibo [19]is the most popular microblogging website and provides acheck-in service1 that enables people to share consumptioninformation with their friends. For example, a Weibo usercan share the location of a restaurant and photos of herdinner with her friends via a mobile application. Since mostcitizens use this check-in service to share their consumptioninformation on Weibo, the surveyor decides to use Weibo asa platform to conduct his research. To measure the averageconsumption abilities of citizens in the two cities, he plansto uniformly sample two collections of Weibo users in thetwo cities respectively. It is known that every Weibo userID consists of ten digits ranging from “1000000000” to“5058913818” (as of March 25, 2014). He generates randomnumbers in this range as test IDs and finds that about 11%of the test IDs are valid Weibo users. However, because thepopulation sizes of the two cities are small, e.g., hundredsof thousands of citizens compared to hundreds of millions ofWeibo users, the percentage of a valid user residing in the twocities is on the order of 0.1%.

In the above example, the surveyor expects a test ID to fallinto one of the two cities to obtain a valid Weibo user sample,but UNI becomes extremely inefficient because the probabilitythat UNI obtains a valid sample equals P (ID is valid) ×P (ID falls into one city) = 0.11×0.001 ≈ 10−4. This resultsin the need for a surveyor to try 104 times on average toobtain a single valid user residing in one of the two cities.Even worse, in some OSNs such as Pinterest [21], userIDs are represented by arbitrary-length strings, which makesUNI practically impossible. Without the ability to efficientlyconduct UNI on a graph, RWwJ will also become inefficient.This raises the following question: how do we sample nodesefficiently on a graph when uniform node sampling is ineffi-cient or impractical at all?

Present Work. In the previous example, the problem is howto effectively sample Weibo users in the two cities, andUNI is inefficient because of the sparsity of user ID space.Since directly sampling users is inefficient, we propose tosample users in an indirect manner. We notice that check-ininformation (i.e., which user checked in which place) sharedby users often contain the venue information, e.g., the locationof the restaurant (i.e., a latitude-longitude coordinate on a map)

1The Check-in service [20] in Weibo allows a registered user to “check in”at venues using a mobile application by selecting from a list of venues nearbythe user, and the location of the user is usually based on GPS hardware inthe mobile device. In addition, a user can attach a text description or severalphotos associated with the venue while checking in.

where the user lunched. Most such OSNs provide APIs forquerying venues within an area specified by a rectangle re-gion with south-west and north-east corners latitude-longitudecoordinates given [22,23]. This function can be used to designefficient sampling methods for sampling venues in an area ona map [24]–[26]. Since we can easily sample venues within anarea of interest, we are then able to indirectly sample Weibousers in an area by relating users to venues through check-inrelations that exist between them. This will be more efficientthan directly sampling users in an area. We present the detaileddesign of this sampling method in Section III and evaluate itin Section IV.

An important lesson learnt from Example 1 is that, whendirect sampling of the user space is inefficient, we can switchto sample the venue space, and the relations that exist betweenthe two sample spaces allow us to efficiently sample the userspace. In general, we use three graphs to represent the twosample spaces and their relations. In Example 1, we considera venue as another type of node besides user node, and buildthe following three graphs: (1) a user graph formed by usersand their relations, (2) a venue graph formed by venues andtheir relations (the edge set is actually empty in Example 1),and (3) a bipartite graph formed by users, venues and theircheck-in relations. In Example 1, although directly samplingthe user graph is very difficult or extremely inefficient, we caneasily sample the venue graph, and use the bipartite graph toconnect the two sample spaces to indirectly sample the usergraph with efficiency.

Because the affiliation relationship between users andvenues plays an important role in this method, we refer tothe three graphs jointly as a hybrid social-affiliation network.The formal definition of hybrid social-affiliation network willbe given in Section II, and the detailed design of the samplingmethods on hybrid social-affiliation networks will be presentedin Section III.Contributions. We make three contributions in this work:• We introduce the concept of hybrid social-affiliation

network and formulate a sampling problem over it tocharacterize graphs. (Section II).

• We design three sampling methods over such a hybridsocial-affiliation network. These methods allow us toefficiently sample a graph when it is difficult to samplethe graph directly (Section III).

• We conduct extensive experiments to validate the pro-posed methods on both synthetic and real-world datasets(Section IV).

II. PROBLEM DEFINITION

We begin with introducing the graph characteristics ofinterest to us, and then formally define the hybrid social-affiliation network.

A. Graph Characteristics

We model an OSN as an undirected graph G(U , E), where Uand E are sets of users and relations among users, respectively.Users in the graph are labeled. We let L be a set of user labels

associated with the graph, and each user is mapped to a subsetof labels he owns by a characteristic function L : U 7→ 2L.For example, if L = {male, female}, then L(u) represents thegender of user u.

In many applications, we are interested in estimating thefractions of users having some labels, e.g., the fraction ofmale/female customers buying a product. This can be rep-resented by the label distribution {θs}s⊆L, where θs is thefraction of users with labels s in graph G. That is

θs =1

n

∑u∈U

1 {s ⊆ L(u)} , s ⊆ L,

where n = |U| is the graph size, 1 {·} is the indicator functionand 1 {C} = 1 if condition C is true; otherwise 1 {C} = 0.

With this definition of graph characteristics, our objectiveof this paper is to design an algorithm for collecting nodesamples within a budget B from graph G and for providingunbiased estimates of {θs}s⊆L with low statistical error.

B. Hybrid Social-Affiliation Network

Example 1 motivates us to introduce a hybrid social-affiliation network which can be used to design efficientestimators when direct sampling of graph G is difficultor inefficient. According to our previous analysis, a hybridsocial-affiliation network consists of three graphs: G(U , E),G′(V, E ′), and Gb(U ,V, Eb), where U ,V are sets of nodes,and E , E ′, Eb are sets of edges. More specifically,

• G(U , E) is the target graph, whose characteristics{θs}s⊆L are of interest to us and need to be measured,e.g., the user social network in Example 1.

• G′(V, E ′) is an auxiliary graph, which can be moreefficiently sampled than the target graph, e.g., the venuegraph (with E ′ = ∅) in Example 1.

• Gb(U ,V, Eb) is an affiliation graph [27, Chapter 8], whichis a bipartite graph connecting nodes in the target andauxiliary graphs, e.g., the graph formed by users, venuesand their check-in relations in Example 1.

An example of such a hybrid social-affiliation networkis illustrated in Fig. 1, from which we observe that thedisconnected target graph becomes better connected with theassistance of the auxiliary and affiliation graphs, so the targetgraph can be more efficiently sampled.

G

Gb

G′

auxiliary graph

affiliation graph

target graph

Fig. 1. Illustration of a hybrid social-affiliation network. Target graphtogether with auxiliary and affiliation graphs form a better connected graphthan target graph itself, which improves sampling efficiency on target graph.

In addition to Example 1, many other graph measurementproblems can be formulated as hybrid social-affiliation net-work sampling problems. To illustrate, we consider anotherexample.

Example 2. Mtime [28] is an online movie database in China,which comprises two types of accounts: users and actors.Users in Mtime can follow each other to form a social network.Movie actors can also form connections with each other if theycooperated in same movies. Moreover, a user can follow movieactors if he/she is a fan of the actor.

In the above example, if we want to measure the character-istics of the graph formed by Mtime users, and direct samplingof this user graph is inefficient because the user graph is notwell connected due to user interest differences, geographicconstraints, etc. (A detailed analysis of the Mtime networkcan be found in Section IV.) However, we can build a hybridsocial-affiliation network as follows:• Target graph G consists of Mtime users and their follow-

ing relations.• Auxiliary graph G′ consists of actors and their coopera-

tion relations.• Affiliation graph Gb consists of Mtime users, actors and

the fan relations between them.Unlike ordinary people, movie actors, especially pop stars,

are more easily to form connections with each other becausethey have more opportunities to participate in the same eventssuch as Oscar and Cannes. In other words, auxiliary graph ismore likely to be well connected than target graph. We canleverage this feature to measure target graph characteristicsmore efficiently.

III. SAMPLING DESIGN ON HYBRID SOCIAL-AFFILIATIONNETWORKS

In this section, we design three methods for characterizinga target graph by sampling a hybrid social-affiliation network.

A. Indirectly Sampling Target Graph by Vertex Sampling onAuxiliary Graph (VSA)

The first method is based on the assumption that vertexsampling, e.g., UNI, is easy to conduct on the auxiliary graphbut not on the target graph, as is the case in Example 1. Wepresent a sampling method VSA to indirectly sample the targetgraph under this setting. The basic idea of VSA is illustratedin Fig. 2.

u1 u2 u3 · · · un G

v1 v2 v3 · · · vn′ G′

pv1 pv2 pv3 pvn′

Fig. 2. Illustration of VSA. Edges in target and auxiliary graphs are omitted.

In VSA, we assume that a node v ∈ V can be sampled withprobability pv in auxiliary graph G′. For example, when graphG′ supports uniform node sampling, then pv = 1/n′,∀v ∈ V ,where n′ = |V| is the size of graph G′. When a node v ∈ Vis sampled, we collect all of its neighbors in the affiliationgraph as samples. We will describe how these samples canyield unbiased estimates of target graph characteristics. Thedetailed design of VSA is described in the follows.Sampling Design. Sampling design of VSA consists of thefollowing two steps:

(i) Sample a collection of B′ nodes with replacement in theauxiliary graph G′. Denote these samples as sequenceS ′ = [y1, . . . , yB′ ].

(ii) For each v ∈ S ′, let Uv ⊆ U denote the set of neighborsof v in Gb. Include the pair (v,Uv) into sequence S =[(y1,Uy1), . . . , (yB′ ,UyB′ )].

Using sequence S , VSA estimates target graph characteris-tics {θs}s⊆L according to following estimators.Estimators. If in advance, we know the graph size of G is n,we can use the following estimator to estimate θs,

θVSA

s =1

nB′

B′∑i=1

1

pyi

∑u∈Uyi

1 {s ⊆ L(u)}d(b)u

, (1)

where d(b)u is the degree of node u in affiliation graph Gb.When n is unknown, we introduce another estimator

θVSA

s =1

Z

B′∑i=1

1

pyi

∑u∈Uyi

1 {s ⊆ L(u)}d(b)u

, (2)

where

Z =

B′∑i=1

1

pyi

∑u∈Uyi

1

d(b)u

.

The following theorem guarantees the unbiasedness of thetwo estimators.

Theorem 1. Estimator (1) is an unbiased estimator of θs.Estimator (2) is an asymptotically unbiased estimator of θs.

Proof. To prove Estimator (1) is unbiased, we show that

E[θVSA

s

]=

1

nB′

B′∑i=1

E

1

pyi

∑u∈Uyi

1 {s ⊆ L(u)}d(b)u

=

1

n

∑v∈V

pv1

pv

∑u∈Uv

1 {s ⊆ L(u)}d(b)u

=1

n

∑u∈U

1 {s ⊆ L(u)}

= θs.

The second equality holds because that y1, . . . , yB′ are i.i.drandom variables. The third equality holds because that whenwe merge two summations into one summation with respectto u, an item in the inner summation is added d(b)u times foreach u. Hence, θVSA

s is unbiased.

To prove that estimator (2) is asymptotically unbiased, wecan use the ratio form of the law of large numbers in [16,Theorem 17.2.1 on P. 428]. Hence,

limB′→∞

θVSA

s =E[nB′θVSA

s

]E [Z]

= θs,

where E [Z] = nB′ can be proved in a similar way as of theproof of unbiasedness of θVSA

s .

Remark. It is important to know that VSA can provideunbiased estimates of target graph characteristics under thecondition that every node in the target graph is connected tonodes in the auxiliary graph. Because VSA can only samplenodes in U satisfying d

(b)u > 0 according to the design of

VSA. If a node u is not connected to any node in V , u cannotbe indirectly sampled by VSA. In Example 1, since we areonly interested in users who share their check-ins in Weibo,therefore Example 1 satisfies this condition.

B. Random Walk on Target Graph Incorporating with VertexSampling on Auxiliary Graph (RWTVSA)

In some situations, d(b)u = 0 for some u ∈ U . For example,some Mtime users in Example 2 may not follow any movieactor, and these users cannot be sampled by VSA (and as aresult, VSA can not provide unbiased estimates of Mtime usercharacteristics). To address this issue, we propose a secondsampling method RWTVSA, which combines random walksampling on the target graph with vertex sampling on theauxiliary graph.

The basic idea of RWTVSA is that, we launch a random walkon the target graph, and at each step allow the walker to jumpwith a probability dependent on the node of which it currentlyresides. This is similar to RWwJ [9,12,14] on the target graphG, but with the major difference that in RWTVSA the walkerjumps to a node in G by jumping first to a node in G′, andthen randomly selecting one of its neighbors in Gb. We refer tothis as an indirect jump, and show in experiments that indirectjumps in RWTVSA bring similar benefits as the direct jumpsin RWwJ. An additional advantage of using random walk onthe target graph is that it better characterizes highly connectednodes than uniform node sampling as random walks are biasedtowards high degree nodes in G. We depict RWTVSA in Fig. 3,where each node in G is connected to a virtual jumper nodeto conduct indirect jumps, through doing vertex sampling overauxiliary graph G′.

Similar to VSA, we assume that a node v in G′ can besampled with probability pv . In RWTVSA, we virtually connecteach node u ∈ U to a jumper node j with edge (u, j). Eachedge (u, j) is assigned a weight ωu, and the walker residing atnode u moves to j to perform an indirect jump with probability

puj =ωu

du + ωu.

To determine {ωu}u∈U , we note that in an indirect jump, thewalker jumps from j to node u with probability

pju = qu =∑v∈Vu

pv

d(b)v

, (3)

u1 u2 u3 · · · un

j

G

Jumper node

v1 v2 v3 · · · vn′ G′

pv1 pv2 pv3 pvn′

ωu1 ωu2 ωu3 ωun

(i)(ii)

Fig. 3. Illustration of RWTVSA and indirect jump. Each node u in Gis virtually connected to a jumper node j with weight ωu. An indirect jumpis performed by: (i) randomly sampling a node v in G′, and (ii) randomlychoosing a neighbor of v in Gb as the target node to jump to.

where Vu ⊆ V is the set of neighbors of u in the affiliationgraph Gb, and d(b)v is the degree of node v in Gb. By settingthe weights to satisfy

ωu = αqu, u ∈ U , (4)

for any α ≥ 0, we ensure that the stationary probabilities ofthe random walk visiting u ∈ U , and j are

πu =du + ωu

2|E|+ 2αand πj =

α

2|E|+ 2α.

Note that if d(b)u = 0, then puj = pju = 0, and the walkercannot jump from u (or to u). However, u can still be sampledby the random walk on G.

RWTVSA exhibits similar properties as RWwJ. That is,when α = 0, RWTVSA becomes a simple random walk onthe target graph, and when α → ∞, RWTVSA is equivalentto VSA.Sampling Design. Suppose the random walk starts at nodex1 ∈ U , and at step i the random walk is at node xi.We calculate the probability qxi

according to Eq. (3) andωxi

= αqxi. At step i, the walker jumps with probability

ωxi/(dxi +ωxi); otherwise, the walker moves to a neighbor uof xi chosen uniformly at random and xi+1 = u. An indirectjump is performed as follows:

(i) sample a node v ∈ V in the auxiliary graph withprobability pv .

(ii) sample a neighbor u of v uniformly at random in theaffiliation graph, and let xi+1 = u.

Estimator. Based on the sample path S = (xi, . . . , xB)collected by RWTVSA, an estimator for θs is

θRWTVSA

s =1

Z

B∑i=1

1 {s ⊆ L(xi)}dxi

+ ωxi

, (5)

where Z =∑B

i=1 1/(dxi+ ωxi

).

Theorem 2. Estimator (5) is asymptotically unbiased.

Proof. First, we know that when RWTVSA reaches steadystate, each node u is sampled with probability

pu =πu

1− πj=du + ωu

2|E|+ α. (6)

Next, let D ,∑B

i=1 1 {s ⊆ L(xi)} /(dxi+ ωxi

), then

E [D] = E

[B∑i=1

1 {s ⊆ L(xi)}dxi

+ ωxi

],

= B∑u∈U

pu1 {s ⊆ L(u)}du + ωu

=Bn

2|E|+ αθs.

Similarly, we can show that

E [Z] = E

[B∑i=1

1

dxi+ ωxi

],

= B∑u∈U

pu1

du + ωu=

Bn

2|E|+ α.

Now, we invoke Theorem 17.2.1 in [16, P. 428], which is theratio form of the law of large numbers, and we have

limB→∞

θRWTVSA

s =E [D]

E [Z]= θs.

Remark. Note that RWTVSA requires vertex sampling (e.g.,UNI) on the auxiliary graph G′. If vertex sampling is also notallowed on G′, RWTVSA cannot be applied. However, one canreplace the vertex sampling on G′ by a random walk on G′.Unfortunately, this naive approach can perform very poorlywhen the auxiliary graph G′ is not well connected, because apoorly connected graph can easily trap a simple random walkin a community. In what follows, we design a third method toaddress this challenge.

C. Random Walk on Target Graph Incorporating with RandomWalk on Auxiliary Graph (RWTRWA)

When both the target and auxiliary graphs do not supportvertex sampling, neither VSA nor RWTVSA can be applied.Therefore, we design the RWTRWA method to address thissetting. RWTRWA consists of two parallel random walks on Gand G′ respectively. The two parallel random walks cooperatewith each other, and can be considered as two RWwJs, asillustrated in Fig. 4. Different from RWTVSA, nodes in G andG′ are both virtually connected to two jumper nodes j and j′

to perform indirect jumps on G and G′ respectively.

u1 u2 u3 · · · un

j

G

ωu1 ωu2 ωu3 ωun

v1 v2 v3 · · · vn′

j′

G′

Jumper nodes

wv1 wv2 wv3 wvn′

Fig. 4. Illustration of RWTRWA and indirect jumps. Nodes in G andG′ are virtually connected to two jumper nodes j and j′ to perform indirectjumps, respectively. Indirect jumps are illustrated using dashed blue arcs.

The basic idea behind RWTRWA is as follows. Suppose thetwo random walks are RW on G and RW ′ on G′, and at stepi, they reside at xi ∈ U and yi ∈ V , respectively. If one randomwalk needs to jump at step i, say RW on G, then it jumps toa uniformly at random chosen neighbor of yi in the affiliationgraph, which is assigned to xi+1. Similar jumping procedurealso applies to RW ′ on G′. Hence, they are equivalent to twoRWwJs, and both can avoid being trapped on G and G′.

We need to determine edge weights {ωu}u∈U and {wv}v∈V ,which control the probability of jumping of the randomwalks on G and G′, respectively. Obviously, the stationarydistributions {πu}u∈U and {πv}v∈V of the two random walkson G and G′ are related to these weights. Here we can leverageour previous analysis of indirect jumps in RWTVSA, andderive that, when parameters ωu and wv satisfy the followingconditions

ωu = α∑v∈Vu

πv

d(b)v

, u ∈ U , (7)

wv = β∑u∈Uv

πu

d(b)u

, v ∈ V, (8)

for any α, β > 0, the stationary distributions of the tworandom walks on G and G′ (discarding states j and j′) are

πu =du + ωu

2|E|+ α, u ∈ U , (9)

πv =dv + wv

2|E ′|+ β, v ∈ V. (10)

Arranging Eqs. (7)–(10) in matrix forms, we obtain

πU =dU + ωU2|E|+ α

, πV =dV + wV2|E ′|+ β

, (11)

ωU = αAD−1V πV , wV = βATD−1U πU , (12)

where An×n′ is the adjacency matrix of Gb, ωU = [ωu]Tu∈U ,wV = [wv]Tv∈V , πU = [πu]Tu∈U , πV = [πv]Tv∈V , dU = [du]Tu∈Uand dV = [dv]Tv∈V are vectors, DU = diag(d

(b)u1 , . . . , d

(b)un) and

DV = diag(d(b)v1 , . . . , d

(b)vn′ ) are diagonal matrices.

Equations (11)–(12) uniquely determine ωU and wV , i.e.,

ω∗U=c(I−cc′AD−1V ATD−1U )−1AD−1V (dV+c′ATD−1U dU ),

w∗V=c′(I−cc′ATD−1U AD−1V )−1ATD−1U (dU+cAD−1V dV)

where c=α/(2|E ′|+ β) and c′=β/(2|E|+ α) are constants.The above results illustrate that, when α and β are given,

ωU and wV are uniquely determined. However, one needscomplete knowledge of G, G′ and Gb to determine theirvalues. While, we are interested in sampling the target graphwithout having to know either G, G′ or Gb in advance. In whatfollows, we design RWTRWA in a way that only makes useof local knowledge of these graphs during the random walks.

First, we need to know what happens when ωU 6= ω∗Uor wV 6= w∗V . If ωU deviates from ω∗U (e.g., ωU is givenand different from ω∗U ), we can still derive a “stationarydistribution” {πu} for the random walk over target graph byEq. (9). Because {wv} are related to {πu} according to Eq. (8),we can use local knowledge (by querying Uv from Gb) to

obtain wv,∀v, and derive a “stationary distribution” {πv} forthe random walk over auxiliary graph by Eq. (10). However,because ωU 6= ω∗U , Eq. (7) will not hold, and as a result, {πu}and {πv} will not be the stationary distributions.

To solve this contradiction, we notice that Eq. (7) actuallyrelates to the indirect jumps on target graph G. The walkerindirectly jumps to a node u ∈ U with probability qu , ωu/αaccording to our analysis in RWTVSA (see Eq. (3)). When ωUdeviates from ω∗U , we derive a different ω′u by Eq. (7), whichindicates the worker actually jumping to u with probabilityq′u , ω′u/α 6= qu. The walker expects to jump to a node thatfollows distribution {qu}, but it actually jumps to a node thatfollows distribution {q′u}. This is another way to explain thecontradiction. Fortunately, with this understanding, the contra-diction can be easily solved by applying a Metropolis-Hastings(MH) sampler [29, Chapter 7], by considering {qu} as thedesired distribution and {q′u} as the proposal distribution. Wecan use a MH sampler to build a Markov chain (referred asthe MH chain) that generates samples with desired distribution{qu}, and each time when the walker requires jumping, itjumps to a latest sample of MH chain. This guarantees thatthe walker jumps to u following distribution {qu}, and ensuresthat {πu} and {πv} are still the stationary distributions of therandom walks on target and auxiliary graphs.

Sampling Design. The complete sampling design ofRWTRWA comprises three parallel Markov chains as illus-trated in Fig. 5, and we need to specify a desired distribution{qu}u∈U in advance, e.g., a uniform distribution.

x1 · · · xi xi+1 · · ·RW on G:

x′1 · · · x′i x′i+1 · · ·MH chain:

y1 · · · yi yi+1 · · ·RW on G′:

Fig. 5. Three parallel Markov chains in RWTRWA.

•Random Walk on Auxiliary Graph G′: Suppose the randomwalk resides at node yi ∈ V at step i. Then we can calculatewyi

according to Eq. (8). At step i + 1, the random walkexecutes one of the following two steps.

Jump: With probability wyi/(dyi + wyi), the walker jumpsto a random neighbor v ∈ V of node xi in Gb, andyi+1 = v;

Walk: Otherwise, the walker moves to a random neighborv ∈ V of yi in G′, and yi+1 = v.

•Metropolis-Hastings (MH) Chain: Suppose the MH chainresides at node x′i at step i. At step i+1, we randomly choosea neighbor u ∈ U of yi in Gb. This is equivalent to sample anode u ∈ U with probability q′u.

Acceptance: With probability ri, we accept u and x′i+1 =u,where ri =min{1, (quq′x′

i)/(qx′

iq′u)};

Rejection: Otherwise, we reject u and x′i+1 = x′i.

•Random Walk on Target Graph G: Suppose the random walkresides at node xi ∈ U at step i. ωxi

= αqxiwhere qxi

isspecified in advance. At step i + 1, the walker executes oneof the following two steps.

Jump: With probability ωxi/(dxi

+ ωxi), the walker jumps

to x′i+1, and xi+1 = x′i+1;Walk: Otherwise, the walker moves to a random neighbor

u ∈ U of xi in G, and xi+1 = u.This sampling design ensures that we use only local knowl-

edge of the three graphs to obtain a sample path S =(x1, . . . , xB), which can yield unbiased estimates of targetgraph characteristics.

Estimator. We use the sample path S = (x1, . . . , xB) gener-ated by the random walk on G to design an estimator for θsas follows,

θRWTRWA

s =1

Z

B∑i=1

1 {s ⊆ L(xi)}dxi

+ ωxi

, (13)

where Z =∑B

i=1 1/(dxi+ ωxi

).

Theorem 3. Estimator (13) is asymptotically unbiased.

Proof. First we note that the random walk on target graph Ghas the same form of stationary distribution with RWTVSA

given by Eq. (6). (Note that we have removed jumper nodej when calculating πu in Eq. (9). Hence, πu in Eq. (9) isequivalent to pu in Eq. (6).) The remaining of the proof issimilar to the proof of Theorem 2.

IV. EXPERIMENTS

In this section, we conduct experiments on both syntheticand real datasets to evaluate our proposed methods. Our goalis to demonstrate the unbiasedness of proposed estimators andstudy their estimation errors with respect to different factorssuch as sampling budget B (i.e., the number of sampled nodesin target graph) and parameter values, i.e., α and β.

For a graph, it is natural to consider node degrees as labels,i.e., L = {0, 1, . . . ,M} where M is the maximum degree inthe graph. If we choose the characteristic function L(u) ={du}, and let θl , θ{l} = 1

n

∑u∈U 1 {l ∈ L(u)} , l ∈ L, then

θl is the fraction of nodes with degree l in the graph, and{θl}l∈L therefore is the degree distribution (or PDF) of thegraph. We can also consider choosing L(u) = {0, 1, . . . , du−1}, then θl is the fraction of nodes with degree larger thanl in the graph, and {θl}l∈L is the complementary cumulativedegree distribution (or CCDF) of the graph. To distinguish thetwo characteristic functions, in the following discussion, wewill use θl to denote the parameter for PDF and Θl for CCDF.2

A. Experiments on Synthetic Data

In our first experiment, we examine the soundness of theproposed sampling methods using synthetic data.

2Usually, CCDF is the plot of choice when people show degree distribution.Therefore, in some experiments, we mainly show the results of CCDF in thispaper, and results for PDF can be found in our technical report [30].

Synthetic Data. We generate a hybrid social-affiliation net-work by connecting three Barabasi-Albert (BA) graphs [31],namely G1, G2 and G3. Each BA graph contains 100,000nodes, and the three BA graphs have different average degrees:4, 10 and 20 respectively. G1 and G3 are connected by oneedge to form the target graph G. G2 is the auxiliary graph G′,and the affiliation graph Gb is formed by connecting nodes inG and G′ according to the following two steps:

1) connect every node in G to a randomly selected node inG′;

2) randomly choose 200,000 pairs of nodes in G and G′

and connect them to form the remaining edges in Gb.

The first step ensures that every node in U satisfies d(b)u > 0so that we can apply VSA method on this dataset.

Results and Analysis. First we demonstrate that the proposedestimators θVSA

l , θRWTVSA

l and θRWTRWA

l are indeed asymptoti-cally unbiased. To show this, we apply these sampling methodsto estimate θ2 and θ12, i.e., the fraction of nodes with degree2 and 12 in target graph, and compare their estimates to theground truth for different sampling budgets B (from 0 to0.01n, where n = |U|). The results are shown in Fig. 6. Itis clear to see that when sampling budget B increases, allestimators converge to the ground truth. Hence, the proposedestimators are asymptotically unbiased.

Next, we study the estimation error of each estimator forestimating the PDF and CCDF of degree distribution. Wechoose the normalized rooted mean squared error (NRMSE)as a metric to evaluate the estimation error of an estimator,which is defined as follows

NRMSE(θl) =

√E[(θl − θl)2

]θl

.

NRMSE measures the relative difference between an estimatedvalue and a real value. The smaller the NRMSE, the better anestimator is. NRMSE can also be defined on Θl, which weomit here. To compare the NRMSE of different estimators, wefix the sampling budget B to be 1% of the number of nodes inthe target graph, and calculate the averaged empirical NRMSEover 1, 000 runs. The results are shown in Fig. 7.

In Figure 7, we also show the NRMSE for a simplerandom walk (RW) estimator on the target graph. Because thetarget graph G has a bottleneck, i.e., its two components areconnected by a single edge. RW can hardly converge withinB = 0.01n steps. Therefore, we observe that NRMSE for RWis almost the largest among all estimators. Comparing VSA

with RW in Figs. 7(a) and 7(d), we find that VSA can providesmaller NRMSE for low degree nodes than RW. However,VSA produces larger NRMSE for large degree nodes than RW.Therefore, VSA can better estimate small degree nodes thanlarge degree nodes in a graph.

The weakness of VSA can be overcome by RWTVSA

and RWTRWA. From Figs. 7(b), 7(e) and 7(c), 7(f) we cansee that when indirect jumps are incorporated into randomwalks in RWTVSA and RWTRWA, NRMSE for large degree

0.0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500 2000

θV

SA

l

# of samples in target graph

θ2=0.250

θ12=0.052

(0.01n)

(a) θVSA

l

0.0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500 2000

θR

WT V

SA

l


θ2=0.250

θ12=0.052

(0.01n)

(b) θRWTVSA

l (α = 10)

0.0

0.1

0.2

0.3

0.4

0.5

0 500 1000 1500 2000

θR

WT R

WA

l


θ2=0.250

θ12=0.052

(0.01n)

(c) θRWTRWA

l (α = β = 10)

Fig. 6. Asymptotic unbiasedness of the estimators (l = 2, 12).

10-2

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

RWVSA

(a) VSA PDF NRMSE

10-2

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

RWRWTVSA (α=1)RWTVSA (α=10)RWTVSA (α=100)

(b) RWTVSA PDF NRMSE

10-2

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

RWRWTRWA (α=β=1)RWTRWA (α=β=10)RWTRWA (α=β=100)

(c) RWTRWA PDF NRMSE

10-2

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

RWVSA

(d) VSA CCDF NRMSE

10-2

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

RWRWTVSA (α=1)RWTVSA (α=10)RWTVSA (α=100)

(e) RWTVSA CCDF NRMSE

10-2

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

RWRWTRWA (α=β=1)RWTRWA (α=β=10)RWTRWA (α=β=100)

(f) RWTRWA CCDF NRMSE

Fig. 7. NRMSE for different estimators. (Each result is averaged over 1, 000 runs, and B = 0.01n.)

nodes decreases, and NRMSE for small degree nodes remainssmaller than RW. If we increase the probability of jumping ateach step of random walk by increasing α and β, we observethat NRMSE for small degree nodes decreases, but NRMSEfor large degree nodes increases. This behavior is similar toRWwJ [9,12] because we have declared that RWTVSA andRWTRWA are equivalent to RWwJs in their designs.

B. Experiments on LBSN Datasets

In the second experiment, we apply VSA method on tworeal-world location-based social network (LBSN) datasets tosolve the problem mentioned in Example 1, i.e., measure usercharacteristics within an area of interest.

LBSN Datasets. We obtain two public LBSN datasets fromBrightkite and Gowalla [32]. Brightkite and Gowalla areonce two popular LBSNs where users shared their locationsby checking-in. Users in the two social networks are also

connected by undirected friendship relations, which form twouser social networks. The statistics of these two datasets aresummarized in Table I.

Because we are only interested in users that have check-ins, VSA can be applied to these two datasets. Suppose thewe want to measure characteristics of users located aroundNew York City (NYC), as specified by a rectangle region on amap: latitude range 40.4◦ ∼ 41.4◦, longitude range −74.3◦ ∼−73.3◦ (see Fig. 8). The goal is to estimate degree distributionof the users who checked in this region. As we explained inIntroduction, directly sampling users is inefficient. Here, weapply the VSA method along with a venue sampling method— Random Region Zoom-In (RRZI) [26] to sample users inNYC more efficiently.Venue Sampling. RRZI [26] utilizes a venue query APIprovided by most LBSNs to sample venues on a map. TheAPI requires a user to specify a rectangle region by provid-

TABLE ISUMMARY OF TWO LBSN DATASETS.

dataset Brightkite Gowalla

G

network type undirected undirected# of users 58, 228 196, 591# of friendship edges 214, 078 950, 327# of users in LCC1 56, 739 196, 591# of edges in LCC 212, 945 950, 327

G′ and Gb

# of venues 772, 966 1, 280, 969# of users having check-ins 51, 406 107, 092# of check-ins 4, 491, 143 6, 442, 890

G′ and Gb

for NYC

# of venues in NYC2 23, 484 26, 448# of users checking in NYC 4, 257 7, 399# of check-ins in NYC 33, 656 113, 423

1 The largest connected component.2 The New York City (Fig. 8).

40.4

40.5

40.6

40.7

40.8

40.9

41

41.1

41.2

41.3

41.4

-74.3 -74.2 -74.1 -74 -73.9 -73.8 -73.7 -73.6 -73.5 -73.4 -73.3

latit

ude

longitude

Queens

Stamford

New YorkManhattan

Fig. 8. Venue distribution in New York City and illustration of accessiblesubregions used by RRZI. Each subregion contains less than K venues.

ing the south-west and north-east corners latitude-longitudecoordinates, and then the API returns a set of venues in thisregion. Usually, the API can only return at most K venues ina queried region. RRZI regularly zooms in the region until thesubregion is fully accessible, i.e., the API returns less than Kvenues in the subregion. The zooming-in process is equivalentto dividing the region into many non-overlapping accessiblesubregions (as illustrated in Fig. 8), and each subregion isassociated with a fixed probability related to the zooming-in strategy. This feature enables RRZI to provide samples ofvenues within an area of interest.

Results. Combining VSA with RRZI, we conduct two ex-periments to indirectly sample users in NYC on Brightkiteand Gowalla. We totally sample 5% of venues in NYC andcalculate the degree distribution of users in NYC. The resultsare shown in Figs. 9 and 10.

From Figures 9(a) and 10(a), we observe that RRZI-VSA

method can provide good estimates of user characteristics inNYC. The estimates for low degree users are better than highdegree users, and this is clear to see from the NRMSE plots inFigs. 9(b), 9(c) and 10(b), 10(c). This feature coincides withour previous analysis using synthetic data. From the NRMSEplots, we can also find an approximate law that a larger K, i.e.,the maximum number of venues the API can return, reducesthe estimation error of RRZI-VSA. However, it is not true for

estimating large degree users on Gowalla in Fig. 10(c). In fact,a better way to reduce estimation error is to combine VSA withother better venue sampling methods introduced in [24]–[26].However, we omit this due to space limitation.

C. Experiments on Mtime Dataset

In the third experiment, we apply RWTVSA and RWTRWA

on Mtime to measure Mtime user characteristics as we haveintroduced in Example 2.Mtime Dataset. As we introduced in Example 2, users andactors in Mtime naturally form a hybrid social-affiliationnetwork. To build a ground-truth dataset as the testbed ofRWTVSA and RWTRWA, we downloaded the complete Mtimenetwork data by traversing user IDs ranging from 100000 to10000000, and actor IDs ranging from 892000 to 2100000.

For each Mtime user, we collect the set of users he/shefollows and users who follow him. This builds up a directedfollower network among Mtime users. Each Mtime usermaintains a list including a subset of movie actors he/she isinterested in. This information is used to build up the fan-relations between users and actors. For each movie actor, wecollect the movies he/she participated in, and if two actorsparticipated in a same movie, we connect them. This buildsup a cooperative network among actors. The complete Mtimedataset is summarized in Table II.

TABLE IISUMMARY OF THE MTIME DATASET.

G

user follower network type directedtotal users (isolated and non-isolated)3 1, 878, 127# of non-isolated users in follower network 1, 035, 164# of following relations 14, 861, 383# of users in LCC 987, 055# of following relations in LCC 14, 791, 482

G′

actor cooperative network type undirectedtotal actors (isolated and non-isolated) 1, 123, 340# of non-isolated actors in cooperative network 1, 122, 166# of cooperative relations 10, 344, 364# of actors in LCC 1, 114, 065# of cooperative relations in LCC 10, 328, 904

Gb

# of fan relations 225, 558, 343# of users following actors 1, 419, 339# of isolated users following actors 842, 963# of actors having fans 441, 413# of isolated actors having fans 1, 174# of isolated actors having only isolated fans 225# of isolated users following only isolated actors 393

3 An isolated node in a graph is a node with zero degree.

Analysis of the Dataset. First we provide some statistics ofMtime dataset. In Table II, we compare the first block withsecond block, which are related to target graph G and auxiliarygraph G′ respectively. We find that about 19% of the userIDs and 93% of the actor IDs are valid. This indicates thatconducting UNI on the auxiliary graph is more efficient thanconducting UNI on the target graph. Moreover, we find thatmore than 47% of the Mtime users are not in LCC, but thesame number for actors is less than 0.1%. This indicates thatthe auxiliary graph is better connected than the target graph.Although a large fraction of users are isolated nodes in the

10-4

10-3

10-2

10-1

100

100 101 102 103

CC

DF

degree

ground truthK=10K=20K=50

(a) RRZI-VSA estimates

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

K=10K=20K=50

(b) RRZI-VSA PDF NRMSE

10-1

100

101

102

100 101 102 103

NR

MS

E

degree

K=10K=20K=50

(c) RRZI-VSA CCDF NRMSE

Fig. 9. User characteristics estimation in NYC on Brightkite. (B′ = 0.05n′ and each result is averaged over 1000 runs.)

10-510-410-310-210-1100

100 101 102 103 104

CC

DF

degree

ground truthK=10K=20K=50

(a) RRZI-VSA estimates

10-1

100

101

102

100 101 102 103 104

NR

MS

E

degree

K=10K=20K=50

(b) RRZI-VSA PDF NRMSE

0.05

20

0.1

1

10

100 101 102 103 104

NR

MS

E

degree

K=10K=20K=50

(c) RRZI-VSA CCDF NRMSE

Fig. 10. User characteristics estimation in NYC on Gowalla. (B′ = 0.05n′ and each result is averaged over 1000 runs.)

target graph, from the last block (regarding the affiliation graphGb), we find that almost all the isolated users are connectedto non-isolated actors (except a few hundreds of them). Sothe majority of isolated users are indirectly connected toother users through actors. This is illustrated in Fig. 11. Theadvantage of introducing the hybrid social-affiliation networkis now clear for Mtime dataset, i.e., we can study a largeruser space than simply the LCC of target graph (when UNIon target graph is inefficient or not allowed).

LCC of G(987, 055 users)

Isolated parts of G(858, 733 users)

LCC of G′

(1, 114, 065 actors)Isolated parts of G′

(1, 658 actors)

Fig. 11. The Mtime network components. Dashed red lines denote fanrelations between actors and users.

Results. Using the Mtime dataset as a testbed, we demonstratethat RWTVSA and RWTRWA methods can provide goodestimates of user characteristics. Although the user followernetwork is directed, we can build an undirected version of thetarget graph on-the-fly while sampling because a user’s in-coming and out-going neighbors are known once the user isqueried [10,12]. Different from previous experiments, here theuser labels can be in-degrees or out-degrees. Hence, we will

evaluate the in-degree and out-degree distribution estimationsof the two estimators, respectively.

Results of the RWTVSA method are depicted in Fig. 12.In Fig. 12(a) and (e), we show the in-degree and out-degreeCCDF estimates. We can see that RWTVSA can provideunbiased estimates. From Fig. 12(b) and (f), we observe thatwhen sampling budget increases, the NRMSE decreases forboth in-degree and out-degree estimations. From Fig. 12(c)and (g), we observe that when more jumps are allowed byincreasing α from 1 to 100, estimation accuracy also increases.

Results for RWTRWA are similar to the results of RWTVSA,and we show them in Fig. 13. First, from Fig. 13(a) and(e), we observe that RWTRWA can also well estimate the in-degree and out-degree distributions. Second, from Fig. 13(b)and (f), we can find that when sampling budget increases, theestimation accuracy increases significantly for both in-degreeand out-degree estimations. Last, from Fig 13(c) and (g), wefind that when more jumps are allowed (by increasing α andβ), the NRMSE also decreases.

However, it is worth noting that α and β should not betoo large for both RWTVSA and RWTRWA. Because weknow that when α → ∞, RWTVSA becomes VSA, which isbiased on the Mtime dataset, and hence causes large NRMSE.Similar behavior happens to RWTRWA, too. We depict theseobservations in Figs. 12(d), 12(h) and 13(d), 13(h).

V. RELATED WORK

We briefly review the related literature in this section.Sampling methods, especially random walk-based graph

sampling methods, have been widely used to characterizelarge-scale complex networks. These applications include, butare not limited to, estimating peer statistics in peer-to-peernetworks [8,33], uniformly sampling users from online social

10-710-610-510-410-310-210-1100

0 100 101 102 103 104 105 106

CC

DF

in-degree

ground truthB=0.01nB=0.02nB=0.05n

(a) In-degree estimates (α = 1)

0.5

0.6

0.7

0.8

0.9

0 100 101 102 103 104 105 106

NR

MS

E

in-degree

B=0.01nB=0.02nB=0.05n

(b) CCDF NRMSE (α = 1)

0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 100 101 102 103 104 105 106

NR

MS

E

in-degree

α=1α=10α=100

(c) CCDF NRMSE (B = 0.01n)

0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 100 101 102 103 104 105 106

NR

MS

E

in-degree

α=100α=1000

(d) Too frequent jumping (B = 0.01n)

10-710-610-510-410-310-210-1100

0 100 101 102 103 104

CC

DF

out-degree


(e) Out-degree estimates (α = 1)

0.5 0.6 0.7 0.8 0.9

1

0 100 101 102 103 104

NR

MS

E

out-degree

B=0.01nB=0.02nB=0.05n

(f) CCDF NRMSE (α = 1)

0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 100 101 102 103 104

NR

MS

E

out-degree

α=1α=10α=100

(g) CCDF NRMSE (B = 0.01n)

0.3 0.4 0.5 0.6 0.7 0.8 0.9

1

0 100 101 102 103 104

NR

MS

E

out-degree

α=100α=1000

(h) Too frequent jumping (B = 0.01n)

Fig. 12. RWTVSA degree distribution estimates and NRMSE analysis. Each result is averaged over 10, 000 runs.

10-710-610-510-410-310-210-1100

0 100 101 102 103 104 105 106

CC

DF

in-degree


(a) In-degree estimates (α = β = 0.1)

0.1

0.2

0.3

0.4

0.5

0 100 101 102 103 104 105 106

NR

MS

E

in-degree

B=0.01nB=0.02nB=0.05n

(b) CCDF NRMSE (α = β = 0.1)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 100 101 102 103 104 105 106

NR

MS

E

in-degree

α=β=0.01α=β=0.05α=β=0.1

(c) CCDF NRMSE (B = 0.01n)

0

0.5

1

1.5

2

0 100 101 102 103 104 105 106

NR

MS

E

in-degree

α=β=0.1α=β=10

(d) Too frequent jumping (B = 0.01n)

10-710-610-510-410-310-210-1100

0 100 101 102 103 104

CC

DF

out-degree


(e) Estimates (α = β = 0.1)

0.1

0.2

0.3

0.4

0.5

0 100 101 102 103 104

NR

MS

E

out-degree

B=0.01nB=0.02nB=0.05n

(f) CCDF NRMSE (α = β = 0.1)

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0 100 101 102 103 104

NR

MS

E

out-degree

α=β=0.01α=β=0.05α=β=0.1

(g) CCDF NRMSE (B = 0.01n)

0

0.5

1

1.5

2

0 100 101 102 103 104

NR

MS

E

out-degree

α=β=0.1α=β=10

(h) Too frequent jumping (B = 0.01n)

Fig. 13. RWTRWA degree distribution estimates and NRMSE analysis. Each result is averaged over 10, 000 runs.

networks [11,13,14,34], characterizing structure properties oflarge-scale networks [35]–[38], and measuring statistics ofpoint-of-interests on maps [26]. The above literature is mostlyconcerned with sampling methods that seek to directly samplenodes (or samples) in target graphs (or sample spaces). How-ever, direct sampling is not always efficient as we argued inthis work.

When the target graph (or sample space) can not be directlysampled or direct sampling is inefficient, several methodsbased on graph manipulation have been proposed to improvesampling efficiency. For example, Gjoka et al. [39] study anapproach to improve sampling efficiency through building a

multigraph using different kinds of relations (i.e., edges) thatexist on an OSN. A multigraph is better connected than anyindividual graph formed by only one kind of relations. There-fore, the random walk can converge fast on this multigraph.Zhou et al. [40] exploit several criteria to rewire the targetgraph on-the-fly to increase the graph conductance [17] andreduce mixing time of random walks. Our method differs fromtheirs that we do not manipulate target graphs. We study anew approach that utilizes an auxiliary graph and an affiliationgraph to assist sampling on target graph indirectly.

Birnbaum and Sirken [41] designed a survey method forestimating the number of diagnosed cases of a rare disease in

a population. Directly sampling patients of a rare disease fromthe huge human population is obviously inefficient, so theystudied how to sample hospitals so as to sample patients indi-rectly. Their method motivates us to design the VSA method.However, as we pointed out, VSA method cannot sample nodesthat are not connected to auxiliary graph, and we overcome thisproblem by designing RWTVSA and RWTRWA methods. Ourwork also complements existing sampling methods related torandom walk with jumps [9,12,14] by removing the necessityof uniform node sampling on target graphs.

VI. CONCLUSION

When graphs become large in scale, sampling methodsbecome necessary tools in the study of characterizing theirproperties. Among these sampling methods, random walk-based crawling methods have gained popularity. However, ifthe graph under study is not well connected, random walk-based graph sampling methods suffer from the slow mixingproblem. In this work, we observe that a graph usually doesnot exist in isolation. Usually, the target graph is accompaniedwith an auxiliary graph and an affiliation graph, and they forma hybrid social-affiliation network together. We find that thetarget graph becomes better connected with the assistances ofthe other two graphs. This new viewpoint brings benefits to thegraph sampling framework. We design three sampling methodsto measure the target graph from this new viewpoint, and thesemethods are demonstrated to be effective on both syntheticand real datasets. Therefore, our method complements existingmethods in the literature of graph sampling.

ACKNOWLEDGEMENT

This research was sponsored in part by the ARL CooperativeAgreement W911NF-09-2-0053 of US; National Natural Sci-ence Foundation (61221063, U1301254, 61202392, 61103240,61103241, 91118005), 863 High Tech Development Plan(2012AA011003) and 111 International Collaboration Pro-gram, of China.

REFERENCES

[1] D. J. Watts, “The new science of networks,” Annual Review of Sociology,vol. 30, no. 1, pp. 243–270, 2004.

[2] D. Lazer and et al., “Computational social science,” Science, vol. 323,no. 5915, pp. 721–723, 2009.

[3] S. Aral and D. Walker, “Identifying influential and susceptible membersof social networks,” Science, vol. 337, no. 6092, pp. 337–341, 2012.

[4] S. Asur and B. A. Huberman, “Predicting the future with social media,”in WI-IAT, 2010.

[5] J. Bollen, H. Mao, and X.-J. Zeng, “Twitter mood predicts the stockmarket,” Journal of Computational Science, vol. 2, no. 1, pp. 1–8, 2011.

[6] M. Mondal, B. Viswanath, P. Druschel, K. P. Gummadi, A. Clement,A. Mislove, and A. Post, “Defending against large-scale crawls in onlinesocial networks,” in CoNEXT, 2012.

[7] “Weibo API limits,” http://open.weibo.com/wiki/Rate-limiting, 2014.[8] L. Massoulie, E. L. Merrer, A.-M. Kermarrec, and A. Ganesh, “Peer

counting and sampling in overlay networks: Random walk methods,” inPODC, 2006.

[9] K. Avrachenkov, B. Ribeiro, and D. Towsley, “Improving random walkestimation accuracy with uniform restarts,” in WAW, 2010.

[10] B. Ribeiro and D. Towsley, “Estimating and sampling graphs withmultidimensional random walks,” in IMC, 2010.

[11] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, “Practicalrecommendations on crawling online social networks,” JSAC, vol. 29,no. 9, pp. 1872–1892, 2011.

[12] B. Ribeiro, P. Wang, F. Murai, and D. Towsley, “Sampling directedgraphs with random walks,” in INFOCOM, 2012.

[13] C.-H. Lee, X. Xu, and D. Y. Eun, “Beyond random walk and Metropolis-Hastings samplers: Why you should not backtrack for unbiased graphsampling,” in SIGMETRICS, 2012.

[14] X. Xu, C.-H. Lee, and D. Y. Eun, “A general framework of hybrid graphsampling for complex network analysis,” in INFOCOM, 2014.

[15] P. Wang, J. Zhao, J. C. S. Lui, D. Towsley, and X. Guan, “Samplingnode pairs over graphs,” in ICDE, 2013.

[16] S. Meyn and R. L. Tweedie, Markov Chains and Statistic Stability,2nd ed. Cambridge University Press, 2009.

[17] A. Sinclair and M. Jerrum, “Approximate counting, uniform generationand rapidly mixing markov chains,” Information and Computation,vol. 82, no. 1, pp. 93–133, 1989.

[18] A. Mohaisen, A. Yun, and Y. Kim, “Measuring the mixing time of socialgraphs,” in IMC, 2010.

[19] “Sina Weibo,” http://weibo.com, July 2014.[20] “Weibo place,” http://place.weibo.com, July 2014.[21] “Pinterest,” http://www.pinterest.com, July 2014.[22] “Weibo search API,” http://open.weibo.com/wiki/2/location/pois/search/

by area, July 2014.[23] “Foursquare search API,” https://developer.foursquare.com/docs/venues/

search, July 2014.[24] Y. Li, M. Steiner, L. Wang, Z.-L. Zhang, and J. Bao, “Dissecting

Foursquare venue popularity via random region sampling,” in CoNEXT,2012.

[25] Y. Li, L. Wang, M. Steiner, J. Bao, and T. Zhu, “Region sampling andestimation of geosocial data with dynamic range calibration,” in ICDE,2014.

[26] P. Wang, W. He, and X. Liu, “An efficient sampling method forcharacterizing points of interests on maps,” in ICDE, 2014.

[27] S. Wasserman and K. Faust, Social Network Analysis: Methods andApplications. Cambridge University Press, 1994.

[28] “Mtime,” http://www.mtime.com, July 2014.[29] C. P. Robert and G. Casella, Monte Carlo Statistic Methods, 2nd ed.

Springer, 2004.[30] “Technical report,” http://nskeylab.xjtu.edu.cn/dataset/jzzhao/

ICDE2015TR.pdf, Nov. 2014.[31] A. L. Barabasi and R. Albert, “Emergence of scaling in random

networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999.[32] E. Cho, S. A. Myers, and J. Leskovec, “Friendship and mobility: User

movement in location-based social networks,” in KDD, 2011.[33] C. Gkantsidis, M. Mihail, and A. Saberi, “Random walks in peer-to-peer

networks: Algorithms and evaluation,” Performance Evaluation, vol. 63,no. 3, pp. 241–263, 2006.

[34] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, “Walking inFacebook: A case study of unbiased sampling of OSNs,” in INFOCOM,2010.

[35] L. Katzir, E. Liberty, and O. Somekh, “Estimating sizes of socialnetworks via biased sampling,” in WWW, 2011.

[36] S. J. Hardiman and L. Katzir, “Estimating clustering coefficients andsize of social networks via random walk,” in WWW, 2013.

[37] C. Seshadhri, A. Pinar, and T. G. Kolda, “Triadic measures on graphs:The power of wedge sampling,” in SDM, 2013.

[38] P. Wang, J. C. Lui, B. Ribeiro, D. Towsley, J. Zhao, and X. Guan,“Efficiently estimating motif statistics of large networks,” TKDD, 2014.

[39] M. Gjoka, C. T. Butts, M. Kurant, and A. Markopoulou, “Multigraphsampling of online social networks,” JSAC, vol. 29, no. 9, pp. 1893–1905, 2011.

[40] Z. Zhou, N. Zhang, Z. Gong, and G. Das, “Faster random walks byrewiring online social networks on-the-fly,” in ICDE, 2013.

[41] Z. W. Birnbaum and M. G. Sirken, “Design of sample surveys to estimatethe prevalence of rare diseases: Three unbiased estimates,” Vital andHealth Statistics, vol. 2, no. 11, pp. 1–8, 1965.

a tale of three graphs: sampling design on hybrid social ... · sampling framework, e.g., when...

Documents