Reza Motamedi, Reza Rejaie, Walter Willinger, Daniel Lowd, Roberto Gonzalez
http://onrg.cs.uoregon.edu/WalkAbout
Inferring Coarse Views of Connectivity in Very Large Graphs
10/8/14 1
Introduction
! Large-scale networked systems (e.g. OSNs) are often represented as graphs
! Characterizing the connectivity structure of such a graph provides deeper insights about the system
! Coarse view of a graph allows a top-down analysis • Identify a few tightly connected regions along with their
inter-, and intra-region connectivity • If needed/desired, zoom in on individual region and recurse
! How can one capture a coarse view of large graphs?
10/8/14 2
Obtaining coarse view of a graph
! Community detection techniques optimize an objective function • Detects communities with 100s of nodes in real-world graphs • Some techniques have limited scalability
! Graph partitioning techniques divide the graph into strongly connected partitions • May produce balanced partitions • May require seeds for each partition or the number of
partitions as input
10/8/14 3
This paper presents
! The design of a scalable technique (WalkAbout) to infer coarse (regional) views of a graph
! An illustration of WalkAbout “in action” for inferring the regional connectivity of Flickr, Twitter, Google+
! A study of the relationship between regional- and community-level views of a large graph
! An initial attempt at answering the question “Are (Flickr) regions meaningful?”
10/8/14 4
Random Walks (RW)
! Consider an undirected, connected, non-bipartite graph
! The probability that a very long RW visits node x converges to
! The mixing time is the walk length at which the probability of being at node x is within of the stationary distribution • We use mixing time rather informally, not specifying
10/8/14 5
G = [V,E]
deg(x)2× | E |
TG (ε)ε
ε
Behavior of Many RWs
! Starting |V| RWs in parallel (one from each node)
! V(x,wl): the expected number of RWs that are at node x after wl steps
! As wl reaches the mixing time, the number of walkers at node x converges to
! degree/visit ratio (dvr) converges to average node
degree 10/8/14 6
V (x,wl) ≈|V | deg(x)2* | E |
=>deg(x)V (x,wl)
≈2 | E ||V |
Validation through Simulations
! Use simulation over synthetic graphs to explore the dependency of dvr on different parameters • More results in the paper
10/8/14 7
10 20 30 40 50 600
0.05
0.1
0.15
0.2PD
F
dvr
Avg. degree=24.74Avg. degree=33.94Avg. degree=44.30Avg. degree=53.29
40 45 50 55 600
0.05
0.1
0.15
0.2
PD
F
dvr
wl=10
wl=20
wl=50
wl=100
10 30 50 7040
45
50
55
60
wl
dvr
deg<50deg>100
Detecting Regions – Key Idea ! Suppose a graph consists of a few weakly connected
regions ! Starting RWs from randomly selected nodes on graph
G = [V, E] that has multiple regions • Region i is Gi = [Vi, Ei]
! If wl is close to the mixing time of regions, a majority of RWs remain in their starting region • the graph can be viewed as disconnected regions
! converges to average node degree of region i
10/8/14 8
dvri (x) =deg(x)
E[V (x,wl)]=2 | Ei ||Vi |
dvri (x)
Key Idea (cont’d)
! Regions with different average degree form separate peaks in the dvr histogram • Region: a non-overlapping range of dvr values
! Formation of peaks is a transient phenomenon • As wl increases beyond the mixing time of regions, dvr for all
nodes converges to a single value
Ø The similarity of dvr implies tighter connectivity among nodes in a region Ø dvr signal is indirect and efficient => scalable
10/8/14 9
Validation on Synthetic Graphs
! A graph with two regions (average degree of 70, 60) connected with b bridge edges.
! Only changing a single region or the bridge
10/8/14
10 2040
6080
5055
60650
0.01
0.02
0.03
dvrAvg. Degree
2040
6080
20004000
60008000
0
0.01
0.02
0.03
vdrRegion size
2040
6080
12
34
5
x 104
0
0.01
0.02
0.03
dvrBridge Size
WalkAbout
! Using many short RWs to infer/explore regional connectivity of large graphs • The number of regions, nodes per region, and determining
inter-, intra-region connectivity
! Basic challenges • The variation and rate of convergence of dvr is inversely
proportional with node degree (i.e. noise of low degree nodes) • Regions having similar average deg. & different mixing times
! Identifying regions in two steps: • Detecting the core (high degree) nodes of each region • Mapping low degree nodes to the detected cores per region
10/8/14 11
WalkAbout – Main Steps
! Emulating RWs and generating the dvr histogram • Removing low degree nodes ( ) to reduce noise
! Identifying core of each region • Search for the walk length that leads to pronounced peaks • Detect a peak & its associated dvr range => nodes per region
! Mapping low degree nodes to cores • Based on the relative reachability (using multiple RWs)
! Producing the regional view
10/8/14 12
Dmin
Inferring vs Exploring
! WalkAbout provides a few parameters that affect the resulting regional view ( , wl) • Parameters can be set based on the domain knowledge
! Sensitivity to these parameters offers insight about the graph structure
! Developing WalkAbout as an interactive tool with GUI
• Publicly available at http://onrg.cs.uoregon.edu/WalkAbout
10/8/14 13
Dmin
WalkAbout in Action
! Inferring regional view of connectivity of the LCC for Flickr, Twitter and Google+
! To contrast: Apply Louvain Communitiy detection method
! Default setting • See the tech report for results on the sensitivity to
10/8/14 14
Flickr Twitter GPlus Nodes 1.6M 41.6M 51.7M Edges 31.M 1,468M 869.4M
Communities 28K 39K 24K
Dmin = 500Dmin
Regional View of Flickr
10/8/14 15
20 25 30 350
0.005
0.01
0.015
0.02
0.025
dvr
R0
R1
R2
R3R4
1520
2530
1020
3040
500
0.01
0.02
0.03
0.04
dvrwl
Cores Regions Regions Regions
Size %Nodes %Edges Avg.Deg Mod.
R0 4000 92.8 58.2 11.9 0.4
R1 569 1.2 3.2 50.1 0.5
R2 3010 4 17.6 83.7 0.7
R3 2120 1.8 16.6 174.2 0.6
R4 1140 0.2 4.4 431 0.3
wl = 30,Dmin = 500
Lessons Learned
! Regions with closer dvr tend to have stronger inter-region connectivity • Incorrectly placed high degree nodes • Regions with different sizes and mixing times
! The number of peaks changes with walk length • The number/selection of peaks affect the regional view
! Identified regions could be very imbalanced in size • Detecting possible sub-regions in a hierarchical manner
10/8/14 16
Regions & Communities
! Comparing/relating the regional and community views • Typical community is much smaller and more modular • Largest communities have sizes comparable to regions Ø Orders of magnitude more communities
! The highest degree nodes per region are placed in a few communities with size & modularity comparable to regions!
10/8/14 17 TW G+ FL OR
0
0.2
0.4
0.6
0.8
Modularity
LouvainLarge LouvWA
TW G+ FL OR
101
102
103
Average Degree
TW G+ FL OR
102
104
106
Size
Mapping Communities to Regions
! Community c is mapped to region R that contains most of its nodes • Mapping confidence: fraction of c’s nodes located in R
! Across regions of all OSNs • For 75% of communities, the confidence is 100% • For 90% of communities, the confidence is more than 80%
Ø Regions can be viewed as a collection of communities Ø A coarser view of the graph
10/8/14 18
Per-Region Analysis of Communities ! Are the characteristics of communities generally
reveal the features of their region? • No strong relation between the modularity of communities in a
region and the modularity of the region
! The inter-connectivity among communities is critical to determine features of each region
10/8/14 19 100
101
102FL
R0 R1 R2 R3 R4
Aver
age
Deg
ree
100
101
102TW
R0 R1 R2 R3 R4 R5 100
101
102G+
R0 R1 R2 R3 R4 R50
0.2
0.4
0.6
0.8
1FL
R0 R1 R2 R3 R4
Modulairty
0
0.2
0.4
0.6
0.8
1TW
R0 R1 R2 R3 R4 R5 0
0.2
0.4
0.6
0.8
1G+
R0R1 R2 R3 R4 R5 100
101
102
FL
R0R1R2R3R4S
ize
100
101
102
TW
R0 R1 R2 R3 R4 R510
0
101
102
G+
R0 R1 R2 R3 R4 R5
Run-time
! Comparing the run times of WalkAbout and the Louvain community detection technique • On Intel X5650 (2.66GHz) computer with 72GB RAM
! Splitting WalkAbout run time to • dvr calculations to detect core, and • Mapping of low degree nodes to those cores
! WalkAbout exhibits a shorter run time for large graphs
10/8/14 20 102 104 105
G+
TW
FL
Second
LouvainWA: Map to CoreWA: dvr
A New Kind of Validation
10/8/14 21
! Do users in a region exhibit a similar social attributes • Need social context for users
! 99K social groups in Flickr: group name, users/group • Group name provides info about group interest or context • Map each group to a region where most users are located • Mapping confidence for R1-R4 is high even for large groups • e.g. group names in R1 related to male nudity.
Ø Social forces appear to derive the formation of regions
0
0.2
0.4
0.6
0.8
1
Gro
up M
appin
g C
onfid
ence
R0 R1 R2 R3 R4
Conclusion & Outlook
10/8/14 22
! WalkAbout, a new technique to infer/explore coarse views of large graphs
! Applying WalkAbout to three major OSNs ! Are regions meaningful?
• Relating the regional- and community-level views • Showing social cohesion of regions in Flickr
! Future plans • Exploring the recursive application of WalkAbout • Multi-scale characterization of graph connectivity and its
application to examine graph evolution
Reza Motamedi, Reza Rejaie, Walter Willinger, Daniel Lowd, Roberto Gonzalez
http://onrg.cs.uoregon.edu/WalkAbout
Inferring Coarse Views of Connectivity in Very Large Graphs
10/8/14 23