network sampling: methods and applications · network sampling: methods and ... are within a...
TRANSCRIPT
8/12/2013
1
Network Sampling: Methods and Applications
Mohammad Al Hasan
Assistant Professor, Computer Science
Indiana University Purdue University, Indianapolis, IN
Nesreen K. Ahmed
Final year PhD Student, Computer Science
Purdue University, West Lafayette, IN
Jennifer Neville
Associate Professor, Computer Science
Purdue University, West Lafayette, IN
Tutorial Outline
• Introduction (15 minutes)
– Motivation, Challenges
• Sampling Background (15 minutes)
– Assumption, Definition, Objectives
• Network Sampling methods (full access, restricted access, streaming access)
– Estimating nodal or edge characteristics (45 minutes)
– Sampling representative sub‐networks (30 minutes)
– Sampling and counting of sub‐structure of networks (30 minutes)
• Conclusion and Q/A (15 minutes)
8/12/2013
2
Introductionmotivation and challenges
Network analysis in different domains
Protein‐Interaction network Political Blog network Social network
LinkedIn network Food flavor network Professional Network
8/12/2013
3
Network characteristics (1)
• Assume G(V, E) is a graph
–
• Average degree
– For any vertex v, d(v) represents the degree of v
– Average degree,
• Average clustering co‐efficient
– C(v), is the fraction of the (ordered) pairs such that ,
• Diameter of the network
– The maximum value for the length of a shortest path between a pair of vertices
– For disconnected network, we compute diameters only over the giant component
• Max k‐core: Maximum k‐value such that an induced subgraph exist in which every vertices in that subgraph has a minimum degree of k
| | , | |V n E m
1( )
v V
d d vn
,u w ( ), and ( , ), adj vu u w Ew
( )G
Network characteristics (2)• Degree distribution
– , for a degree value , is the fraction of vertices with that degree, ∑ 1
– For directed graph, we can consider both in‐degree distribution, and out‐degree distribution
• Hop‐plot distribution
– , for a integer , denotes the fraction of ordered pairs of vertices , that are within a distance of or less
– This is a cumulative distribution
• Clustering coefficient distribution
– ,for a degree value , is the average clustering coefficient over all the vertices with degree
• Distribution of between‐ness centrality, and close‐ness centrality of vertices
• Other network parameters are distribution of singular values of the graph adjacency matrix
8/12/2013
4
Network analysis tasks
• Study the node/edge properties in networks
– E.g., Investigate the correlation between attributes and local structure
– E.g., Estimate node activity to model network evolution
– E.g., Predict future links and identify hidden links
Network evolution
• 46% of online American adults 18 and older use a social networking site like MySpace, Facebook or LinkedIn
• 65% of teens 12‐17 use online social networks as of Feb 2008
8/12/2013
5
Link and communities!
Communities of physicists that work on social networks
Link Prediction: What is the probability that and will be connected in future?
Network analysis tasks
• Study the node/edge properties in networks
– E.g., Investigate the correlation between attributes and local structure
– E.g., Estimate node activity to model network evolution
– E.g., Predict future links and identify hidden links
• Study the connectivity structure of networks and investigate the behavior of processes overlaid on the networks
– E.g., Estimate centrality and distance measures in communication and citation networks
– E.g., Identify communities in social networks
– E.g., Study robustness of physical networks to attack
8/12/2013
6
Various centrality metrics
Red nodes has high closeness centrality Red nodes has high between‐ness centrality
Other centralities:Degree centralityEigenvector CentralityPagerank centrality
Centrality analysis is used to identify new pharmacological strategies
• Yellow node is the disease nodes, protein complexes are circles, and diamond nodes are drugs.
• Links between the disease node and protein complexes represent associations between genes involved in these complexes and the named disease, as specified by the Disease Ontology.
• A drug is connected to a protein complex if at least one protein target of the drug is also a subunit of the protein complex.
Source: Modularity in Protein Complex and Drug
Interactions Reveals New PolypharmacologicalProperties, Jose C. Nacher mail, and Jean‐Marc Schwartz, PloS One, 2012
8/12/2013
7
Network analysis tasks
• Study the node/edge properties in networks
– E.g., Investigate the correlation between attributes and local structure
– E.g., Estimate node activity to model network evolution
– E.g., Predict future links and identify hidden links
• Study the connectivity structure of networks and investigate the behavior of processes overlaid on the networks
– E.g., Estimate centrality and distance measures in communication and citation networks
– E.g., Identify communities in social networks
– E.g., Study robustness of physical networks to attack
• Study local topologies and their distributions to understand local phenomenon
– E.g., Discovering network motifs in biological networks
– E.g., Counting graphlets to derive network “fingerprints”
– E.g., Counting triangles to detect Web (i.e., hyperlink) spam
Graphlet histogram is used for building network fingerprints
• Build fingerprints for large networks through frequency counts of graphlets
• Useful in anomaly detection (e.g., security applications) and differentiate network from different domains (e.g., biology applications)
8/12/2013
8
Computational complexity makes analysis difficult for very large graphs
• Best time complexities for various tasks: vertex count (n), edge count (m)
– Computing centrality metrics
– Community Detection using Girvan‐Newman Algorithm,
– Triangle counting .
– Graphlet counting for size k
– Eigenvector computation
• Pagerank computation uses eigenvectors
• Spectral graph decomposition also requires eigenvalues
For graphs with billions of nodes, none of these tasks can be solved in a reasonable amount of time!
Other issues for network analysis
• Many networks are toomassive in size to process offline
– In October 2012, Facebook reported to have 1 billions users. Using 8 bytes for userID, 100 friends per user, storing the raw edges will take 1 Billions x 100 x 8 bytes = 800 GB
• Some network structure may hidden or inaccessible due to privacy concerns
– For example, some networks can only be crawled by accessing the one‐hop neighbors of currently visiting node – it is not possible to query the full structure of the network
• Some networks are dynamic with structure changing over time
– By the time a part of the network has been downloaded and processed, the structure may have changed
8/12/2013
9
Network sampling motivation
• We can sample a set of vertices (or edges) and estimate nodal or edge properties of the original network
– E.g., Average degree and degree distribution
• Instead of analyzing the whole network, we can sample a small subnetwork similar to the original network
– Goal is to maintain global structural characteristics as much as possiblee.g., degree distribution, clustering coefficient, community structure, pagerank
• We can also sample local substructures from the networks to estimate their relative frequencies or counts
– E.g., sampling triangles, graphlets, or network motifs
Task 1
Task 2
Task 3
Sampling BackgroundAssumption, Definitions, Objectives
8/12/2013
10
Sampling scenarios
• Full access assumption
– The entire network is visible
– A random node or a random edge in the network can be selected
• Restricted access assumption
– The network is hidden, however it supports crawling, i.e. it allows to explore the neighbors of a given node.
– Access to one seed node or a collection of seed nodes are given.
• Streaming access assumption (limited memory and fast moving data)
– In the data stream, edges arrives in an arbitrary order (arbitrary edge order)
– In the data stream, edges incident to a vertex arrives together (incident edge order)
– Stream assumption is particularly suitable for dynamic networks
Sampling scenarios (cont.)
• For static and/or small network, computation model can assume that the entire network is in the memory
• For large but static network, part of the network can be loaded in the memory, but the entire network remains in disk or graph database
• Streaming scenario works well for both dynamic and large networks
8/12/2013
11
Sampling objectives (Task 1)
• Estimate network characteristics by sampling vertices (or edges) from the original networks
• Population is the entire vertex set (for vertex sampling) and the entire edge set (for edge sampling)
• Sampling is usually with replacement
Sample (S)
sampling
Estimating degree distribution of the original network
Task 1 applications (estimate node/edge attributes)
Sample (S)
sampling
• Full Network:
– Node‐set:
– Node‐attributes:
• Sample:
• Goal: Compute for some function faof node attribute a
Compute fa(S)using S
( , )G V E
1 2, , , nv v v
1 2( ), ( ), , ( ) for i i k i ia v a v a v v V
S V
( ) ( )a af S f V
8/12/2013
12
Sampling objectives (Task 2)
• Goal: From G, sample a subgraph with k nodes which preserves the value of key network characteristics of G, such as:
– Clustering coefficient, degree Distribution, diameter, centrality, and community structure
– Note that, the sampled network is smaller, so there is a scaling effect on some of the statistics; for instance, average degree of the sampled network is smaller
• Population: All subgraph of size k
Sample subgraph (GS)
sampling
Netw
ork C
haracte
ristics Original graph (G)
Sampling objective (Task 3)
• Sample sub‐structure of interest
– Find frequent induced subgraph (network motif)
– Sample sub‐structure for solving other tasks, such as counting, modeling, and making inferences
Sample (S)
sampling
Build patterns for graph mining applications
8/12/2013
13
Evaluation of sample quality
• For evaluating scalar measurement, such as average degree, average clustering coefficients, effective diameter
– Analytically prove that the sampler provides unbiased estimate
– Empirically estimate the accuracy of sample mean and variance
• For comparing two distributions
– Kolmogorov‐Smirnov (KS) D‐Statistics can be used, which is the maximum difference between two cdfs
– Another option to use K‐L divergence between two distribution smooth version of it:
– Neither of the above evaluation metrics address the issue of scaling, however, while comparing between two distributions that have scale mismatch (Task 2), D‐statistics should be used
1 2 1 2, ) | ( ) ( ) |( xF max F x FD F x
1 2 2 1( (1 ) (1 ) )f fKL f f ‖
Sampling methodsmethodologies, comparison, analysis
8/12/2013
14
TASK 1
Estimating node/edge properties of the original network
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
8/12/2013
15
Real‐life Example : Birds of same feather flocks together!
• Red borders are women, blue borders are men
• Size of a node represents BMI
• Orange nodes are obese, and green nodes are normal
• Purple links are close genetic connection
• Gray node denotes non‐genetic ties (friends, spouse, co‐workers)
The New England Journal of Medicine. The study's authors suggest that obesity isn't just spreading; rather, it may be contagious between people, like a common cold.
Read more: http://www.time.com/time/health/article/0,8599,1646997,00.html#ixzz2XAtfg02V
How to conduct the same analysis on Facebook data where n=1 billion?
https://www.facebook.com/note.php?note_id=469716398919
8/12/2013
16
Sampling methodsTask 1, Full Access assumption
• Uniform node sampling
– Random node selection (RN)
• Non‐uniform node sampling
– Random degree node sampling (RDN)
– Random pagerank node sampling (RPN)
• Uniform edge sampling
– Random Edge selection (RE)
• Non‐uniform edge sampling
– Random node‐edge (RNE)
– RNE‐RE Hybrid (HYB)
Random Node Selection (RN)• In this strategy, a node is selected uniformly and independently from the set of all
nodes
– The sampling task is trivial if the network is fully accessible.
• It provides unbiased estimates for any nodal attributes:
– Average degree and degree distribution
– Average of any nodal attribute
– f(u) where f is a function that is defined over the node attributes
Sample (S)
RN
8/12/2013
17
Random Degree node selection (RDN)• In this sampling, selection of a node is proportional to its degree
– If is the probability of selecting a node,
– Node can be sampled using inverse‐transform method, For , we simply need to construct its cmf, say , then the sampled node, x is obtained by,
– A second method is to first choose one edge uniformly, then choose one of its end‐point with equal probability, clearly,
( )u ( )( )
2
d uu
m
1( ), where, ~ (0,1)U U Unix
( )
1 ( )( )
2 2e inc x
d ux
m m
• High degree nodes have higher chances to be selected
– Average degree estimation is higher than actual, and degree distribution is biased towards high‐degree nodes.
– Any nodal estimate is biased towards high‐degree nodes.
Random pagerank node sampling (RPN) [Leskovec ’06]
• Pagerank in the stationary distribution vector of a specially constructed Markov process
– Visit a neighbor following an outgoing link with probability c (typically kept at 0.85), and jump to a random node (uniformly) with probability 1‐c
– Pagerank vector satisfies following eigenvector equations: , where is the eigenvector of M with the eigenvalue 1
– Pagerank can be computed efficiently by power method
• Node u is sampled with a probability which is proportional to
– When c=1, the sampling is similar to RDN for the directed graph,
– When c=0 the sampling is similar to RN
• Nodes with high in‐degree has higher chances to be selected
– Due to the uniform random jump, high degree bias of RDN is somewhat reduced. So, it provides better estimation accuracy than RDN for average degree and degree distribution.
1 (1 )TcA D c U M
( ) 1ind u c
m nc
8/12/2013
18
Random Edge Selection (RE)• In a random edge selection, we uniformly select a set of edges and the
sampled network is assumed to comprise of those edges.
• A vertex is selected in proportion to its degree
– If , the probability of a vertex u to be selected is:
– when the probability is:
– With more sample degree bias is reduced
– The selection of vertices are not independent as both endpoint of an edge are selected
• Nodal statistics will be biased to high‐degree vertices
• Edge statistics is unbiased due to the uniform edge selection.
| |
| |SE
E ( )1 (1 )d u
0 ( )d u
Random node‐edge selection (RNE) [Leskovec ’06]
• Select a vertex uniformly, and then pick an edge incident to the selected vertex (uniformly)
• Probability of selecting a vertex, is proportional to
• Node sampling is biased towards high degree vertices that are adjacent to many low degree vertices.
– If the graph is assortative (social networks exhibits this property), the probability is almost uniform for all the vertices, so nodal estimates are better than the case of RE.
• Edge sampling is also non‐uniform, edges that are incident to high degree nodes are under‐sampled, and those the are incident to low degree nodes, are oversampled
– An edge is sampled with the probability:
( )
1 11
| | ( )x adj uV adj x
( )
1 1
| | ( )x inc eV adj x
8/12/2013
19
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
Public Facebook data can only be accessed by generating user IDs
and crawling
http://www.wired.com/wiredscience/2012/04/facebook‐disease‐friends/
8/12/2013
20
Sampling under restricted (or full) access Assumption
• Assumptions
– The network is connected, (if not) we can ignore the isolated nodes
– The network is hidden, however it supports crawling, i.e. it allows to explore the neighbors of a given node. Access to one seed node or a collection of seed nodes is given.
• Methods
– Graph traversal techniques (exploration without replacement)
• Breadth‐First Search (BFS)
• Depth‐First Search (DFS)
• Forest Fire (FF)
• Snowball Sampling (SBS)
• Respondent Driven Sampling (RDS)
– Random walk techniques (exploration with replacement)
• Classic Random Walk
• Markov Chain Monte Carlo (MCMC) using Metropolis‐Hastings algorithm
• Random walk with restart (RWR)
• Random walk with random jump (RWJ)
• During the traversal or walk, the visited nodes are collected in a sample set and those are used for estimating network parameters
Breadth‐first Sampling• At each iteration, earliest discovered but not yet
visited node is selected
• For a selected node, the node is visited and all its neighbors are discovered
• It samples nodes from a specific region of the network
• It discovers all nodes within some distance from the seed node
• Nodal statistics are taken over the selected nodes.
• This sampling is biased as high‐degree nodes have higher change to be sampled
• Nodal estimations are biased towards nodes with higher degree.
8/12/2013
21
Depth‐first Sampling
• At each iteration, we select the latest explored but, not‐yet‐visited node
• DFS explores first the nodes that are faraway from the seed
• The sampling is biased as high degree nodes has higher chance to be selected
• Same as BFS walk, estimation is biased towards nodes with higher degree.
Forest Fire (FF) Sampling [Leskovec ’06]
• FF is a randomized version of BFS
• Every neighbor of current node is visited with a probability p. For p=1 FF becomes BFS.
• FF has a chance to die before it covers all nodes.
• It is inspired by a graph evolution model and is used as a graph sampling technique
• its performance is similar as BFS sampling
8/12/2013
22
Snowball Sampling
• Similar to BFS
• n‐name snowball sampling is similar to BFS
• at every node v, not all of its neighbors but exactly n neighbors are chosen randomly to be scheduled
• A neighbor is chosen only if it has not been visited before.
• Performance of snowball sampling is also similar to BFS sampling
Snowball: n=2
Classic Random Walk Sampling (RWS)
• At each iteration, one of the neighbors of currently visiting node is selected to visit
• For a selected node, the node and all its neighbors are discovered
• The sampling follows depth‐first search pattern
,
1 / ( ) f
0 otherwise
( )u v
adjd u i vp
u
, ,
1( ) ( )
2u v v uu P v Pm
( )( )
2
d uu
m
• This sampling is biased as high‐degree nodes have higher change to be sampled, probability that node u is sampled is
• Note that this method sample each edge uniformly, as it satisfy the detailed balanced equation,
8/12/2013
23
Other variants of random walk
• Random walk with restart (RWR)
– Behaves like RWS, but with some probability (1‐c) the walk restarts from a fixed node, w
– The sampling distribution over the nodes models a non‐trivial distance function from the fixed node, w
• Random walk with random jump (RWJ)
– Access to arbitrary node is required
– RWJ is motivated from the desire of simulating RWS on a directed network, as on directed network RWS can get stuck in a sink node, and thus no stationary distribution can be achieved
– Behaves like RWS, but with some probability (1‐c), the walk jumps to an arbitrary node with uniform probability.
– The stationary distribution is proportional to the pagerank score of a node, so the analysis is similar to RPN sampling
Exploration based sampling will be biased toward high degree nodes
How can we modify the algorithms to ensure nodes are sampled uniformly at random?
8/12/2013
24
Uniform sampling by exploration
• Traversal/Walk based sampling are biased towards high‐degree nodes
• Can we perform random walk over the nodes while ensuring that we sample each node uniformly?
• Challenges
– We have no knowledge about the sample space
– At any state, only the currently visiting nodes and its neighbors are accessible
• Solution
– Use random walk with the Metropolis‐Hastings correction to accept or reject a proposed move
– This can guaranty uniform sampling (with replacement) over all the nodes
A bit of Theory: Metropolis‐Hastings (MH) algorithm
• Problem: Assume that we want to generate a random variable V taking values 1, 2,⋯ , , according to a target distribution , ∈ (the vertices of the given network)
– , all are strictly positive, is large, and normalizing constant ∑ is hard to compute
• Solution: Simulate a Markov chain such that stationary distribution of the chain coincides with the target distribution
– Construct a Markov chain , 0, 1,⋯ on using an arbitrary transition probability matrix, ( is also called the proposal distribution)
– If , generate such that , , ∈
– Now, withprobability, min , 1
withprobability1
– The stationary distribution of the above Markov chain is
8/12/2013
25
Uniform node sampling with Metropolis‐Hastings method [Henzinger ’00]
• It works like random walk sampling (RWS), but it applies a correction so that high‐degree bias of RWS is eliminated systematically
– The proposal distribution, chooses one of the neighbor (say, of the current node (say, uniformly, thus proposal distribution works as RWS, 1/
• Now, withprobability, min , 1
withprobability1
– For uniform target , for RWS like proposal distribution, and
– Thus, min , 1 if , the choice is accepted only with probability 1,
otherwise with probability
• If a graph have vertices, using the above MH variant, every node is sampled with probability
ANALYSIS
8/12/2013
26
Task 1 performance summary over all the sampling method
Sampling Direct sampling ofNodes or edges
Exploration (walk or traversal)
Uniform node sampling RN MH (uniform target distribution)
Almost Uniform node sampling RNE
Exactly Degree proportional RDN RWS
Apparently degree proportional (when sample size is small)
RE BFS, FF, Snowball
PageRank proportional Sampling RPN RWJ
Comparison Method Vertex Selection Probability, | | , | | ,
RN, MH‐uniform target
1
RDN, RWS 2
RPN, RWJ
⋅ 1 ⋅ (undirected)
⋅ 1 ⋅ (directed)
RE ∼
RNE 1 1
1
∈
• Node property estimation
– Uniform node selection (RN) is the best as it selects each node uniformly
– Average degree and degree distribution is unbiased
• Edge property estimation
– Uniform edge selection (RE) is the best as it select each edge uniformly
– For example, we can obtain an unbiased estimate of assortativityby RE method
8/12/2013
27
Expected average degree and degree distribution
• For a degree value assume is the fraction of vertices with that degree
• Uniform Sampling, node sampling probability,
– Expected value of :
– Expected observed node degree:
• Biased Sampling (degree proportional),
– Expected value of :
– Expected observed node degree:
0
1kk
p
( )
1( ) 1 | |
| |k d v k k kv V
q v p V pV
0 0k k
k k
k q k p d
kpk
1 1( )
| |v
V n
( ) ( )( )
2 | | 2
d v d vv
E m
( )( ) 1 | |2 | |
kk d v k k
v V
k pkq v p V
E d
2
20
0
kk
kk
k pd
k qd d
kp
kp
Overestimate high‐degree vertices, underestimate low‐degree vertices
Overestimate average degree
Example
• Actual degree distribution
• Average degree = 3
• RWS degree distribution
• RWS average degree
– 3.39
1 3 4 3 1, , , ,
12 12 12 12 12kq
1 6 12 12 5, , , ,
36 36 36 36 36kq
8/12/2013
28
Expected average degree for traversal based methods [Kurant ‘10]
• Note that, for walk based method, the sampling is with replacement, so their analysis does not chance with the fraction of sampling
• Traversal based method behaves like RWS when the sample size is small, but as the sample size is increases its estimation quickly converges towards the true estimation
• Behavior of all the traversal method is almost identical. For traversal methods, it is better to have one sample with a large fraction than many samples of short fractions
d
2d
d
When sampling algorithm selects nodes non‐uniformly…
Node weights can be adjusted to remove the bias
8/12/2013
29
Correction for bias in biased sampling [eg. Kurant ’10]
• From ⊂ , we can compute estimated degree distribution, (which is biased)
• We can derive unbiased estimated degree distribution ̂ from
• For RWS, if ∈ , the probability of sampling is proportional to u’s degree, and we have, ⋅
• So, ̂ ∝ , which implies ̂ ⋅
• C is a normalizing constant so that ∑ ̂ 1 , thus, ∑ | |/∑ ∈
• Corrected average degree,
0 0
ˆ | |ˆ 11/ ( )
kk
k ku S
q Sd k p k C C
k d u
Correction of arbitrary nodal attributes for biased sampling (cont.)
• Assume, are the nodes sampled to compute expectation of nodal attribute,
• is a distribution that we achieve by a biased sampling, and is a distribution which is unbiased
• Let’s define a weight function : → such that /
• For degree proportional sampling, / ⋅ , ∈
• Unbiased expectation of is:
• Correction of nodal attribute if n and m are unknown
– So, we use weight which is correct up to a multiplicative constant:
– Then the un‐biasing works as below:
1(
))
(w
di
i
[1.. ]
[1.. ]
/ ( )ˆ ( )
( )ˆ ( ) 1/ ( )
i ttu
i t
f d iwf
fw d i
E
1
1ˆ ( ) ( ) ( ) ( ) ( )
t
t s s us
wf w u f u E wf E ft
8/12/2013
30
EMPIRICAL RESULTS
Comparison between different sampling strategies [Gjoka ’10, Gjoka ‘11]
• Gjoka et al. sampled Facebook network and compared the above sampling methods
• They confirmed that MH sampling and reweighted RWS can estimate degree distribution almost perfectly.
Average Degree Estimation
BFS 285.9
RWS 338
MCMC 95.2
Actual 94.1
RWS (corrected)
93.9
MH RWS‐weighted
BFSRWS
8/12/2013
31
Average degree estimation (astro‐phynetwork)
0
20
40
60
80
100
5 10 15 20 25 30 35 40
Percentage of nodes sampled
Sam
pled
Ave
rage
Deg
ree
RWS-weightedMCMC Walk
Forest FireBFS
Snow BallRandom Walk
d
2d
d
Average degree estimation (skitter network)
d
2d
d
0
20
40
60
80
100
5 10 15 20 25 30 35 40
Percentage of nodes sampled
RWS-weightedMCMC Walk
Forest FireBFS
Snow BallRandom Walk
1400
1420
1440
1460
1480
1500
Sam
pled
Ave
rage
Deg
ree
8/12/2013
32
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
Interaction networks can be extracted from dynamic communications
8/12/2013
33
Sampling under data streaming access assumption
• Previous approaches assume:
• Full access of the graph or Restricted access of the graph –access to only node’s neighbors
• Data streaming access assumption:
• A graph is accessed only sequentially as a stream of edges
• Massive stream of edges that cannot fit in main memory
• Efficient/Real-time processing is important
. . .
Stream of edges over time Graph with timestamps
Sampling under data streaming access assumption
• The complexity of sampling under streaming access assumption defined by:
– Number of sequential passes over the stream
– Space required to store the intermediate state of the sampling algorithm and the output
• Usually in the order of the output sample
• No. Passes
• Space
. . .
Stream of edges over time
8/12/2013
34
Sampling methods under data streaming access assumption
• Most stream sampling algorithms are based on random reservoir sampling
• Random Reservoir Sampling [Vitter’85]
– A family of random algorithms for sampling from data streams
– Choosing a set of n records from a large stream of N records• n << N
– N is too large to fit in main memory and usually unknown
– One pass, O(N) algorithm
Reservoir Sampling Algorithm
Step 1: Add first n records to reservoir
2315 22 261813 19 14 17stream … …2315 22 261813 19 14 17stream … …
reservoir
n
13 18 19 14 17 15reservoir
n
8/12/2013
35
Reservoir Sampling Algorithm
Step 2: Select the next record with probability
2315 22 261813 19 14 17stream … …
13 18 19 14 17 15reservoir
n
23
X
• When a record is chosen for the reservoir, it becomes a candidate and replaces one of the former candidates
Reservoir Sampling Algorithm
Step 2: Select the next record with probability
2315 22 261813 19 14 17stream … …
13 23 19 14 17 15reservoir
n
8/12/2013
36
Reservoir Sampling Algorithm
After repeating Step 2 …
2315 22 261813 19 14 17stream … …
13 26 19 22 17 15reservoir
n
• at the end of the sequential pass, the current set of n candidates is output as the final sample
Sampling methods under data streaming access assumption
• Streaming Uniform Edge Sampling
– Extends traditional random edge sampling (RE) to streaming access
• Streaming Uniform Node Sampling
– Extends traditional random node sampling (RN) to streaming access
8/12/2013
37
Sampling methods under data streaming access assumption
• Streaming Uniform Edge Sampling
– Extends traditional random edge sampling (RE) to streaming access
• Streaming Uniform Node Sampling
– Extends traditional random node sampling (RN) to streaming access
Stream of edges
Streaming Edge Sampling (RE)
Step 0: Start with an empty reservoir of size m = 2
reservoir
m
8/12/2013
38
Stream of edges
Step 1: Add first m edges to reservoir
reservoir
m
Streaming Edge Sampling (RE)
Stream of edges
Step 2: Sample next edge with prob.
reservoir
m
Streaming Edge Sampling (RE)
8/12/2013
39
Stream of edges
Edge is selected
reservoir
m
Streaming Edge Sampling (RE)
Stream of edges
At the end of the sequential pass . . .
reservoir
m
Streaming Edge Sampling (RE)
• at the end of the sequential pass, the current set of m edges is output as the final sample
8/12/2013
40
Min‐Wise Sampling briefly …
• A random “tag" drawn independently from the Uniform(0, 1) distribution
• This “tag” is called the “hash value” associated with each arriving item
• The sample consists of the items with the n smallest tags/hash values seen thus far
• Uniformity follows by symmetry: every size‐n subset of the stream has the same chance of having the smallest hash values
Sampling methods under data streaming access assumption
• Streaming Uniform Edge Sampling
– Extends traditional random edge sampling (RE) to streaming access
• Streaming Uniform Node Sampling
– Extends traditional random node sampling (RN) to streaming access
8/12/2013
41
Sampling from the stream directly selects nodes proportional to their degree
Not suitable for sampling nodes uniformly
Use Min‐Wise Sampling
Streaming Uniform Node Sampling (RN)
• Assumption: nodes arrive into the system only when an edge that contains the new node is streaming into the system
• It is difficult to identify which n nodes to select a priori with uniform probability
• Use min‐wise sampling
• Maintain a reservoir of nodes with top‐n minimum hash values
• a node with smaller hash value may arrive late in the stream and replace a node that was sampled earlier
8/12/2013
42
Stream of edges
reservoir
n
Streaming Uniform Node Sampling (RN) – Example
Step 0: Start with an empty reservoir
reservoir
n
Stream of edges
Step 1: Add first n nodes to reservoir
• Apply a uniform random Hash Function to the node id
• defines a true random permutation on the node id’s
Streaming Uniform Node Sampling (RN) – Example
8/12/2013
43
Stream of edges
Step 1: Add first n nodes to reservoir
• Apply a uniform random Hash Function to the node id
• defines a true random permutation on the node id’s
Streaming Uniform Node Sampling (RN) – Example
reservoir
n
Streaming Uniform Node Sampling (RN) – Example
reservoir
n
Step 2: Sample the next node . . .
Stream of edges
8/12/2013
44
reservoir
n
Step 2: Sample the next node . . .
Streaming Uniform Node Sampling (RN) – Example
Stream of edges
reservoir
n
At the end of the sequential pass . . .
Streaming Uniform Node Sampling (RN) – Example
Stream of edges
• at the end of the sequential pass, the current set of n nodes is output as the final sample
8/12/2013
45
Sampled Set of Nodes
TASK 2
Representative Sub‐network Sampling
8/12/2013
46
Real‐life example: People believe what they want to believe!
Linking pattern among conservative (red) and liberal (blue) blogs (from Lada Adamic and Natalie Glance, LinkKDD 2005)
Researchers conclusion:
• people seek out information consistent with their world views.
• In doing so, they are less exposed to new ideas that do not comport with their ideology.
• The tight clustering suggests that “many of these writers are simply echoing each other”
How to sample a representative set of communities from MMOG networks?
http://forum.gephi.org/viewtopic.php?t=2314
8/12/2013
47
Representative Sub‐network SamplingSampled Sub-network
‐ Select a representative sampled sub‐network
‐ The sample is representative if its structural properties are similar to the full network
Full Network
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
8/12/2013
48
Sampling methods under full access assumptions
• Node sampling
– Starts by sampling nodes
– Add all edges between sampled nodes (Induced subgraph)
• Edge Sampling
– Uniform edge sampling
– Uniform Edge Sampling with graph induction
• Exploration Sampling
– Graph traversal techniques
– Random walk techniques
• Sampling the network community structure
– Graph traversal to maximize the sample expansion
• Metropolis‐Hastings sampling
– Search the space for the best representative sub‐network
• Assume we have the following network
• Goal: sample a representative sub‐network of size = 4 nodes
Original Network
8/12/2013
49
Sampling methods under full access assumptions
• Node sampling
– Starts by sampling nodes
– Add all edges between sampled nodes (Induced subgraph)
• Edge Sampling
– Uniform edge sampling
– Uniform Edge Sampling with graph induction
• Exploration Sampling
– Graph traversal techniques
– Random walk techniques
• Sampling the network community structure
– Graph traversal to maximize the sample expansion
• Metropolis‐Hastings sampling
– Search the space for the best representative sub‐network
Node Sampling
Original Network Sampled Nodes Sampled Sub‐network
‐ ‐ ‐ ‐ Edges added by graph induction
8/12/2013
50
Node Sampling (NS)
– Node sampling samples nodes uniformly
– Generally biased to low degree nodes
– Sampled sub‐network may include isolated nodes (zero degree)
– The sampled sub‐network typically fails to preserve many properties of the original network
• diameter, hop‐plot, clustering co‐efficient, and centrality values of vertices.
Sampling methods under full access assumptions
• Node sampling
– Starts by sampling nodes
– Add all edges between sampled nodes (Induced subgraph)
• Edge Sampling
– Uniform edge sampling
– Uniform Edge Sampling with graph induction
• Exploration Sampling
– Graph traversal techniques
– Random walk techniques
• Sampling the network community structure
– Graph traversal to maximize the sample expansion
• Metropolis‐Hastings sampling
– Search the space for the best representative sub‐network
8/12/2013
51
Uniform Edge Sampling (ES)
Original Network Sampled Edges Sampled Sub‐Network
Edge Sampling (ES)
– Edge sampling samples nodes proportional to their degree
– Generally biased to high degree nodes (hub nodes)
– The sampled sub‐network typically fails to preserve many properties of the original network
• diameter, hop‐plot, clustering co‐efficient, and centrality values of vertices.
8/12/2013
52
Uniform Edge Sampling with graph induction (ES‐i) [Ahmed ’13]
Original Network Sampled Edges Sampled Sub‐Network
‐ ‐ ‐ ‐ Edges added by graph induction
Edge Sampling with graph induction (ES‐i)
• Similar to the edge sampling (ES)– ES‐i samples nodes proportional to their degree
– Generally biased to high degree nodes (hub nodes)
• Due to the additional graph induction step– The sampled sub‐network preserves many properties of the original
network
• degree, diameter, hop‐plot, clustering co‐efficient, and centrality values of vertices.
8/12/2013
53
Sampling methods under full access assumptions
• Node sampling
– Starts by sampling nodes
– Add all edges between sampled nodes (Induced subgraph)
• Edge Sampling
– Uniform edge sampling
– Uniform Edge Sampling with graph induction
• Exploration Sampling
– Graph traversal techniques
– Random walk techniques
• Sampling the network community structure
– Graph traversal to maximize the sample expansion
• Metropolis‐Hastings sampling
– Search the space for the best representative sub‐network
Exploration Sampling – Breadth First
Original Network
Sampled sub‐network consists of all nodes/edges visited
Sampled Sub‐Network
Start
8/12/2013
54
Exploration Sampling
– Generally biased to high degree nodes (hub nodes)
– The sampled sub‐network can preserve the diameter of the network if enough nodes are visited
– The sampled sub‐network usually fails to preserve clustering co‐efficient
– One solution: induce the sampled sub‐network
Sampling methods under full access assumptions
• Node sampling
– Starts by sampling nodes
– Add all edges between sampled nodes (Induced subgraph)
• Edge Sampling
– Uniform edge sampling
– Uniform Edge Sampling with graph induction
• Exploration Sampling
– Graph traversal techniques
– Random walk techniques
• Sampling the network community structure
– Graph traversal to maximize the sample expansion
• Metropolis‐Hastings sampling
– Search the space for the best representative sub‐network
8/12/2013
55
Sampling Community Structure [Maiya’10]
• Sampling a sub‐network representative of the original network community structure
– Expansion Sampling (XS)
• Expansion Sampling Algorithm:
– Step1: Starts with a random vertex u, add u to the sample set S
S = S U {u}
– Step2: Choose v from S’s neighborhood, s.t. vmaximizes
|Neigh(v) – {Neigh(S) U S}|
– Step3: Add v to S
S = S U {v}
– Repeat Step 2 and 3 until |S| = k
Three Candidates for next step
Start with random node
8/12/2013
56
Select the one with maximum expansion
Four Candidates for next step
Select the one with maximum expansion
Four Candidates for next step
8/12/2013
57
Select the one with maximum expansion
Sampled Sub‐network
8/12/2013
58
Sampling methods under full access assumptions
• Node sampling
– Starts by sampling nodes
– Add all edges between sampled nodes (Induced subgraph)
• Edge Sampling
– Uniform edge sampling
– Uniform Edge Sampling with graph induction
• Exploration Sampling
– Graph traversal techniques
– Random walk techniques
• Sampling the network community structure
– Graph traversal to maximize the sample expansion
• Metropolis‐Hastings sampling
– Search the space for the best representative sub‐network
Representative subgraph sampling by Metropolis Algorithm [Hubler’08]
• Given a graph, we want to generate a smaller graph which is representative to the given graph
– We assume that the given graph is entirely visible
– We can compute some characteristics of the given network that we want to compute
– If defines a value (singular or multi‐dimension) defining a topological property of , and ⊂ is an induced subgraph of , our objective is to find a subgraph that satisfies the following optimization problem
argmin⊂ :
Δ ,
• Characteristics that are generally considered
– Degree distribution, clustering co‐efficient, hop‐plot distribution, small subgraph concentration
8/12/2013
59
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
Sampling methods under restricted access assumptions
• Exploration Sampling
– Graph traversal techniques
– Random walk techniques
• Sampling nodes or edges is not feasible under restricted access assumption
8/12/2013
60
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
Sampling under data streaming access assumption
• Sampling a subgraph (sub-network) while the graph is streaming as a sequence of edges of any arbitrary order
Graph with timestamps
. . .
Stream of edges over time
Subgraph
8/12/2013
61
Sampling methods under data streaming access assumption
• Streaming Uniform Node Sampling
• Streaming Uniform Edge Sampling
• Partially Induced Edge sampling (PIES)
• Exploration Sampling
– Streaming Breadth First Search
Sampling methods under data streaming access assumption
• Streaming Uniform Node Sampling
– Similar to Task 1 but adding the edges from the graph induction
• Streaming Uniform Edge Sampling
• Partially Induced Edge sampling (PIES)
• Exploration Sampling
– Streaming Breadth First Search
8/12/2013
62
Sampling methods under data streaming access assumption
• Streaming Uniform Node Sampling
– Similar to Task 1 but adding the induced edges
• Streaming Uniform Edge Sampling
– Similar to Task 1
• Partially Induced Edge sampling (PIES)
• Exploration Sampling
– Streaming Breadth First Search
Sampling methods under data streaming access assumption
• Streaming Uniform Node Sampling
– Similar to Task 1 but adding the induced edges
• Streaming Uniform Edge Sampling
– Similar to Task 1
• Partially Induced Edge sampling (PIES)
• Exploration Sampling
– Streaming Breadth First Search
8/12/2013
63
Stream of edges
Step 0: Start with an empty reservoir
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
Stream of edges
Step 1: Add the subgraph with first n nodes to reservoir
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
8/12/2013
64
Step 3: Add the induced edge . . .
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
Stream of edges
Step 3: Sample the next edge . . . With prob.
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
Stream of edges
8/12/2013
65
Step 3: Sample the next edge . . . With prob.
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
Stream of edges
Flip a coin with prob.
Step 3: Sample the next edge . . . With prob.
Edge is selected
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
Stream of edges
8/12/2013
66
Step 3: Sample the next edge . . . With prob.
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
Stream of edges
At the end of the sequential pass . . .
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
reservoirn
Node List
Edge List
Stream of edges
8/12/2013
67
Sampled Sub‐network
Example Partially Induced Edge Sampling (PIES) [Ahmed’13]
Stream of edges
Sampling methods under data streaming access assumption
• Streaming Uniform Node Sampling
– Similar to Task 1 but adding the induced edges
• Streaming Uniform Edge Sampling
– Similar to Task 1
• Partially Induced Edge sampling (PIES)
• Exploration Sampling
– Streaming Breadth First Search
8/12/2013
68
Streaming Breadth First Search
• Streaming Breadth First Search in one pass
• Implements breadth first search in a window of consecutive edges
• Slide the window each time an edge is streaming in
ANALYSIS & RESULTS
8/12/2013
69
Degree Distribution – ArXiv Citation Graph
ArXiv Hep‐PH
34K nodes
421K Edges
Avg. Degree = 24.4
How different sampling methods sample from the degree distribution?
Uniform Node Sampling
Low Degree High Degree
Edge/Exploration Sampling
Ahmed’13
More Analysis
• Expected value of degree k in a sampled sub‐network using uniform node sampling
• Expected value of degree k in a sampled sub‐network using edge sampling with graph induction
8/12/2013
70
CondMAT Network – No.Nodes = 23K – No.Edges = 93k – Sample Size = 20% / Nodes Edge sampling with graph induction (Es‐i) performs better than other methods
Ahmed’13
Preserving Max K‐core number kmax
• Max K‐core number kmax
– Largest subgraph in the network with min degree kmax
• Sample size = 20% /Nodes
8/12/2013
71
Youtube Network – No.Nodes = 1Millions– No.Edges = 3Millions – Sample Size = 20% / Nodes Partially Induced Edge Sampling (PIES) performs better than other methods
Ahmed’13
More conclusions …
• Sampling from the stream in one pass is a hard task
– Exploration methods (Breadth‐First) generally don’t perform well
• Solution: Perform more than one sequential pass over the stream
– Edge‐based methods generally perform well
• But sometimes fail to preserve clustering coefficient
• Stream sampling perform better for sparse graphs
8/12/2013
72
Differentiating species via network analysis of protein‐protein interactions
http://bioinformatics.oxfordjournals.org/content/27/2/245/F1.expansion.html
Task 3 (Sampling sub‐structures from networks)
• In this case, our objective is to sample small substructure of interest
– Sampling graphlets (Bhuiyan et al., 2012)
• Is used for build graphlet degree distribution which characterize biological networks
– Sampling triangles and triples (Rahman et al., 2013, Buriolet al., 2006)
• Is used for estimating triangle count
– Sampling network motifs (Kashtan et al., 2004, Wernicke 2006)
– Sampling frequent patterns (Hasan et al. 2009)
8/12/2013
73
Triple sampling from a network: Motivation
• A triple, at a vertex , is a path of length two for which is the center vertex.
• A Triple can be closed (triangle), or open (wedge)
• It is useful for approximating triangle counting and transitivity
• is a set of uniformly sampled triples, and ⊆ , is the set of closed triples
• Approximate transitivity is
• Approximate triangle count in a network is:
Open triple
closed triple
( , , )u v w vv
-( )
2v V
d vtriple cnt
| | / | |C T
-3
triple count
Induced ‐Subgraph Sampling: motivation
• Network motifs are induced subgraphs that occur in the network far more often than in randomized networks with identical degree distribution
• To identify network motif, we first construct a set of random networks (typically few hundreds) with identical degree distributions.
• If the concentration of some subgraph in the given network is significantly higher than the concentration of that subgraph in the random networks, those subgraphs are called network motif
• Concentration of a size‐ subgraph is defined as:
• Exact counting of ‐subgraphs are costly, so sampling methods are used to approximate ‐subgraph concentrations
( )kiS
( )( )
( )
: size( )
))
( )(
(
kk i
i kj
j kj
f G
fC
GG
8/12/2013
74
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
Triple sampling with full access [Seshadhri ’13]
• Take a vertex uniformly, choose two of its neighbors, and ; return the triple
• This does not yield uniform sampling, as the number of triples centered to a vertex is non‐uniform,
• For triangle counting from a set of sampled triples (say, , we can apply the same un‐biasing trick that we applied for nodal characteristics estimation,
here is the probability of selecting a triple, and
triple ( , , ) is sampled}( )
2
1P{ u v w
d vn
{ is closed}
-1
3 t tt T
triple cntwTC
W
v u w( , , )u v w
tt T
W w
8/12/2013
75
Uniform Triple sampling with full access
• Method
– Select a node v in proportion to the number of triples that are centered at v, which is equal to
– Return one of the triple centered at v uniformly
– For the first step, we need to know the degree of all the vertices
• Triangle count estimate is easy, we just need to use the formula
A
{ is closed}
-1
3 | | tt T
triple cnt
TTC
( )( ( ) 1)
2
d u d u
Node A has degree 3, it has 3 triples centered around it
• Method
1. Select an edge uniformly2. Populate Neighborhood3. Choose one neighbor uniformly4. If size ≠ k , goto 2 else return the
pattern
k=4
Sampling of subgraphs of size [Kashtan ‘04
Probability of generating this pattern (in this edge order) :
1
m
1 1
7m
1 1 1
7 7m
1
4 9 m
8/12/2013
76
k=4
Sampling of subgraphs of size KashProbability of generating this pattern (in this edge order) :
1
m
1 1
3m
1 1 1
3 3m
1
9 m
• Sampling probability is not uniform, but we can compute the probability
• We can apply the same unbiasing trick that we have used before to obtain uniform estimate of the concentration of some subgraph of size
• Assume, we take samples of size‐ subgraph, Then the concentration of some subgraph type is:,
where , and wj 1{ jtypesi }j1
B
W
wj 1
Pj
W wjj1
B
Uniform Sampling of k‐subgraph by probabilistic enumeration [Wernicke ’06]
• Complete enumeration generates each k‐subgraph exactly once.
• For sampling, for every child of a node at level‐i, enumerate the child with probability pi.
• Each pattern is generated with probability (uniform)1
k
ii
p
8/12/2013
77
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
Sampling to detect internet attacks and/or web spam
http://www3.nd.edu/~networks/Image%20Gallery/gallery.htm
8/12/2013
78
Triple sampling by exploration
• Perform a random walk over the triple space
– Given a seed node, form a triple from its neighborhood information
– Define a neighborhood in the triple space that can be instantiated online using local information
– Choose the next triple from the neighborhood
• The stationary distribution of the walk is proportional to the number of neighbors of a triple in the sampling space
j
Neighborhood
3
2
4
6
5
1
9
7
8
10
1112
Two neighboring triples share two vertices among them, entire neighborhood graph is not shown
1 2
3
1 2
6
1 3
4
1 3
5
2 3
6 2 3
5
4 5
3 4 7
3
2 3
4
8/12/2013
79
Triple sampling by exploration based solution for restricted access
• If we want to obtain unbiased estimate of the concentration of triangle (transitivity), we can use the same un‐biasing strategy as before:
– The concentration of triangle is: ,
}
1
{1
1
, here, W= and 1/ ( )
B
j j type clos
j
ed
jj
jB
w
Ww w d j
Neighborhood
3
2
4
6
5
1
9
7
8
10
1112
1 2
3
1 2
6
1 3
4
1 3
5
2 3
6 2 3
5
4 5
3 4 7
3
2 3
4
Degree of the triple 1, 2, 3 is 6 as can be seen in the above graph, so while unbiasing,
(1,2,3)
1
6w
8/12/2013
80
Another exploration‐based triple sampling that uses MH algorithm
• MH algorithm uses simple random walk as proposal distribution
• Then is uses the MH correction (accept/reject) so that the target distribution is uniform over all the triples
• Like before, if triple is currently visiting state, and the triple is chosen by the proposal move, then this move is accepted with probability
• MH algorithm thus confirm uniform sampling, so the transitivity can be easily computed as
Ti
Tj
min 1,d(i)
d( j)
| |
{ is closed}1
11
| | k
T
TkT
Neighborhood
3
2
4
6
5
1
9
7
8
10
1112
1 2
3
1 2
6
1 3
4
1 3
5
2 3
6 2 3
5
4 5
3 4 7
3
2 3
4
• For uniform sampling, Every move from triple to triple is
accepted with probability /
• If 1, 2, 3 and 1,3,4 , move from to is accepted
with probability max 1, 1, but a move from to is
accepted with probability, max 1, 4/6
8/12/2013
81
Sampling subgraphs of size using exploration
• Both random walk and M‐H algorithm can also be used for sampling subgraph of size
• As we did in triangle sampling, we simply perform a random walk on the space of all subgraphs of size‐
• We also define an appropriate neighborhood
– For example, we can assume that two ‐subgraphs are neighbors, if they share 1 vertices
• This idea has recently been used in GUISE algorithm, which uniformly sampled all subgraphs of size 3, 4, and 5. It used M‐H methods
– Motivation of GUISE was to build a fingerprint for graphs that can be used for characterizing networks from different domains.
Task 1 Task 2 Task 3
Full Access
Restricted Access
Data Stream Access
Sampling Objective
Data Access
Assumption
8/12/2013
82
Triple Sampling from edge stream
• Sampling subgraphs using streaming method is not easy, existing works only covered subgraphs with three and four vertices
• Large number of works exist that aim to sample triples for approximating triangle counting from stream data
• For graph steam, there are two scenarios:
– Arbitrary edge stream: Stream of the edges appear in arbitrary order
– Incident edge stream: Stream of the edges incident to some vertices come together, whereas the ordering of vertices is arbitrary
• There is also another stream scenario for dynamic graph, which is known as turns as “turnstile model”
– Each edge appears as a pair where the second term is either + (addition) or –(deletion) like below:
1 2 3 2{( , ),( , ), ( , ), ( , ), }e e e e
W
Y
R
K
P
X
Q
U
N
M
ZT
Arbitrary edge stream
W
Y
R
K
P
X
Q
U
N
M
ZT
Incident edge stream
8/12/2013
83
3‐pass triangle counting (arbitrary edge stream) [Buriol ’06]
W
Y
R
K
P
X
Q
U
N
M
ZT
• First Pass: Count edges = 18
• Second Pass: Uniformly sample an edge , and a vertex
• 3rd Pass: Look for existence of , and , in the stream, if exist set 1,
else 0
2 2 332 3
32
⋅2
3
One pass arbitrary edge stream
• First pass only finds an edge uniformly, which can be done online along with subsequent pass by reservoir sampling
• Second pass constructs the triples and the third pass confirms its type ( , or )
• If we combine the second and the third pass
– Sampling population still remains the same, which is 2 3
– However, we can detect a triangle , , only if its edge , is sampled before the other two edges appear in the downstream.
– Since, the appearance of edges are taken from uniform permutation, Only for one third of the triangles , the 1 will be correctly recorded.
– So, ⋅ ⋅ 2
• To obtain the , we need to have samples, then, ∑
• All the samples can be obtained in one pass by maintaining a storage of with time complexity log
8/12/2013
84
3‐pass triangle counting (incident edge stream) [Buriol ’06]
W
Y
R
K
P
X
Q
U
N
M
ZT
• First Pass: Count number of triples,
∑2
• Second Pass: Uniformly sample a triple , , centered at P
• 3rd Pass: Look for existence of , edge in the stream, if exist set 1, else 0
23
33
3
3
RESULTS
8/12/2013
85
Triangle count approximation using uniform triple sampling
Dataset Incident edge stream(15k sample)
Arbitrary edge stream(1500k samples)
Exact Counting Time
Error (%) Time (sec)
Error (%) Time (sec) Time (sec)
Flicker(V=1.6M, E=15.5M)
0.29 2.42 1.97 11.09 33.00
Orkut(V=3.1M, E=117.2M)
0.51 2.92 10.74 9.03 124.80
Soc‐LiveJournal(V=4.8M, E=42.9M)
0.31 2.77 10.6 7.28 24.76
Wikipedia 2005(V=1.6M, E=18.5M)
1.0 2.29 7.4 7.36 22.8
Wikipedia 2006(V=3.1M, E=37.0M)
1.38 2.54 10.2 7.21 72.57
Wikipedia 2007(V=3.5M, E=42.4M)
2.31 2.57 13.54 7.42 88.98
Graphlet sampling [Bhuiyan ’12]
• Given a large graph, we want to count the occurrences of each of the above graphets
• Using the graphlet counts, we want to build a frequency histogram to characterize the graph
8/12/2013
86
Graphlet Frequency Distribution (GFD)
• Graphlet Frequency Distribution ( GFD) is a histogram using the concentration of various graphlets in the graph G.
• is the frequency of each graphlet. Then the concentration of a graphlet is
Sampling graphlets
• GFD only needs relative frequencies of each of the graphlets
• Using uniform sampling of graphlets, we can approximate GFD
• A method similar to triple sampling will work
– The sampling space is all the graphlets
– The random walk moves from one graphlet to another graphlet
– It accepts of reject a move based on the degree of a graphlet state.
8/12/2013
87
Exact Vs App GFD
Comparison with Actual and Approximate GFD for (a) ca‐GrQc (b) ca‐Hepth (c) yeast (d) Jazz
Exact counting vs Sampling to constricut GFD
8/12/2013
88
Use of GFD [Rahman ’13]
• GFD can represent a group of graphs constructed using same mechanism.
• We used GFD in agglomerative hierarchical clustering which produced good quality (purity close to 90%) cluster.
Sampling Objective
Data Access
Assumption
Full Access
Restricted Access
Data Stream Access
Algorithms covered in tutorial
Conclusion
8/12/2013
89
Concluding remarks
• Generally for many tasks, you will need uniform sampling of “objects” from the network (eg. Task 1 and Task 3 in our slides)
– Uniform sampling with full access is generally easier for node or edge sampling,
– sampling higher order structure is still hard because the sampling space is not instantly realizable without brute‐force enumeration.
– For uniform sampling under access restriction, both M‐H, and SRW‐weighted (with bias correction) works; both guaranty uniformity by theory, but SRW‐weighted provides marginally better empirical sampling accuracy.
Concluding remarks
• For sampling a representative sub‐networks (Task 2), unbiased sampling is typically NOT a good choice. From empirical observation, biased sampling helps obtain better representative structure
– It is difficult to select a subgraph that preserves ALL properties, so sampling mechanism should focus on properties of interest
– Theory has generally focused on kind of nodes that are sampled. More works are needed to show the connection between types of nodes sampled and resulting higher order topological structure
8/12/2013
90
Concluding remarks (cont.)
• In present days, streaming is becoming more natural data access mechanism for graphs. So, currently it is an active area of research
– It presents unique challenges to sample topology in an unbiased way, because local views of stream do not necessarily reflect connectivity.
– Existing works have solved some isolated problems exactly (eg. triangle sampling, pagerank computation), and some other empirically (eg. representative sub‐network)
Given a new domain,
how should you sample the network for analysis?
8/12/2013
91
Define your task
Select sampling objective
Consider data access
restrictions
Choose sampling method
Evaluate sampling quality
References
• Jure Leskovec and Christos Faloutsos. 2006. “Sampling from Large Graphs”. In proc. of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Miing, KDD ’06, pp. 631‐636
• Bhuiyan, M.A.; Rahman, M.; Rahman, M.; Al Hasan, M., "GUISE: Uniform Sampling of Graphlets for Large Graph Analysis," Data Mining (ICDM), 2012 IEEE 12th International Conference on , pp.91‐100
• N. Kashtan, S. Itzkovitz, R. Milo, and U. Alon. 2004. “Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs”. Bioinformatics 20, 11 (July 2004), 1746‐1758.
• Sebastian Wernicke and Florian Rasche. 2006. “FANMOD: a tool for fast network motif detection”. Bioinformatics 22, 9 (May 2006), 1152‐1153.
• Mohammad Al Hasan and Mohammed J. Zaki. 2009. “Output space sampling for graph patterns”. Proc. VLDB Endow. 2, 1 (August 2009), 730‐741.
• Christian Hubler, hans‐Peter Kriegel, Karsten Borgwardt, and Zoubin Ghahramani. 2008. “Metropolis algorithms for representative subgraph mining”. Data Mining (ICDM), 2008 IEEE 8th International Conference on , pp.283‐292
8/12/2013
92
References
• Arun Maiya and Tanya Berger‐Wolf. 2010. “Sampling Community Structure”. In Proc. of the 19th
International Conference on World wide web, WWW ’10, pp. 701‐710• Minas Gjoka, Maciej Kurant, Carter Butts, Athina Markopoulou. 2009. “Walking in Facebook: A case study
of unbiased sampling of OSN”, INFOCOM, 2010: 2498‐2506• Bruno Ribeiro and Don Towsley. 2012. “On the estimation accuracy of degree distributions from graph
sampling”, IEEE 51st Annual Conference on Decision and Control, pp. 5240‐5247• Maciej Kurant, Athina Markopoulou, and Patrick Thiran. 2011. “Towards unbiased BFS Sampling”, In IEEE
Journal on Selected Areas in Communications, Vol 29, Issue 9, 2011• Nesreen K. Ahmed, Jennifer Neville, Ramana Kompella. 2013. Network Sampling: From Static to Streaming
Graphs, Accepted to appear in TKDD 2013• D.D. Heckatorn. 1997. “Respondent‐driven Sampling: a new approach to the study of hidden populations”.
Social problems, pp174‐199, 1997.• Luciana S. Buriol, Gereon Frahling, Stefano Leonardi, Alberto marchetti‐Spaccamela, Christian Sohler. 2006.
“Counting triangles in Data Streams”. PODS ’06 Proc. of the 25th ACM SIGMOD‐SIGACT‐SIGART Symposium on Principles of Database Systems. Pp. 253‐262
• Monika R. Henzinger, Allan Heydon, Michael Mitzenmacher, and Marc Najork. 2000. On near‐uniform URL sampling. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications netowrking, 295‐308.
• Jeffrey S. Vitter. 1985. Random sampling with a reservoir. ACM Trans. Math. Softw. 11, 1• Gjoka, M.; Kurant, M.; Butts, C.T.; Markopoulou, A., "Practical Recommendations on Crawling Online Social
Networks," Selected Areas in Communications, IEEE Journal on , vol.29, no.9, pp.1872,1892, October 2011• C. Seshadhri† Ali Pinar‡ Tamara G. Kolda, Triadic Measure on Graphs: The power of Wedge Sapling, SDM
2013
Thank you!
Questions?