subsampling graphs 1. recap of pagerank-nibble 2

34
Subsampling Graphs 1

Upload: austin-lawson

Post on 13-Dec-2015

225 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

1

Subsampling Graphs

Page 2: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

2

RECAP OF PAGERANK-NIBBLE

Page 3: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

3

Why I’m talking about graphs• Lots of large data is graphs– Facebook, Twitter, citation data, and other social networks– The web, the blogosphere, the semantic web, Freebase, Wikipedia, Twitter, and other information networks– Text corpora (like RCV1), large datasets with discrete feature values, and other bipartite networks

• nodes = documents or words• links connect document word or word document

– Computer networks, biological networks (proteins, ecosystems, brains, …), …– Heterogeneous networks with multiple types of nodes

• people, groups, documents

Page 4: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

4

Our first operation on graphs

• Local graph partitioning• Why?– it’s about the best we can do in terms of subsampling/exploratory analysis of a graph

Page 5: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

5

What is Local Graph Partitioning?

Global Local

Page 6: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

6

What is Local Graph Partitioning?

Page 7: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

7

Main results of the paper (Reid et al)

1. An approximate personalized PageRank computation that only touches nodes “near” the seed– but has small error relative to the true PageRank vector2. A proof that a sweep over the approximate personalized PageRank vector finds a cut with conductance sqrt(α ln m)– unless no good cut exists

• no subset S contains significantly more mass in the approximate PageRank than in a uniform distribution

Page 8: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

8

Approximate PageRank: Key Idea

• p is current approximation (start at 0)

• r is set of “recursive calls to make”• residual error• start with all mass on s

• u is the node picked for the next call

Page 9: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

9

ANALYSIS OF PAGERANK-NIBBLE

Page 10: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

10

Analysis

linearity

re-group & linearity

Page 11: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

11

Approximate PageRank: Key Idea

By definition PageRank is fixed point of:Claim:

Proof:

Page 12: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

12

Approximate PageRank: Algorithm

Page 13: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

13

Analysis

So, at every point in the apr algorithm:

Also, at each point, |r|1 decreases by α*ε*degree(u), so: after T push operations where degree(i-th u)=di, we know

which bounds the size of r and p

Page 14: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

14

Analysis

With the invariant:This bounds the error of p relative to the PageRank vector.

Page 15: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

15

Comments – API

p,r are hash tables – they are small (1/εα)

push just needs p, r, and neighbors of u

Could implement with API:• List<Node> neighbor(Node u)• int degree(Node u)

d(v) = api.degree(v)

Page 16: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

16

Comments - Ordering

might pick the largest r(u)/d(u) … or…

Page 17: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

17

Comments – Ordering for Scanning

Scan repeatedly through an adjacency-list encoding of the graph

For every line you read u, v1,…,vd(u) such that r(u)/d(u) > ε:

benefit: storage is O(1/εα) for the hash tables, avoids any seeking

Page 18: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

18

Possible optimizations?• Much faster than doing random access the first few scans, but then slower the last few

• …there will be only a few ‘pushes’ per scan• Optimizations you might imagine:

– Parallelize?– Hybrid seek/scan:

• Index the nodes in the graph on the first scan• Start seeking when you expect too few pushes to justify a scan

– Say, less than one push/megabyte of scanning– Hotspots:

• Save adjacency-list representation for nodes with a large r(u)/d(u) in a separate file of “hot spots” as you scan• Then rescan that smaller list of “hot spots” until their score drops below threshold.

Page 19: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

19

After computing apr…• Given a graph– that’s too big for memory, and/or– that’s only accessible via API

• …we can extract a sample in an interesting area–Run the apr/rwr from a seed node– Sweep to find a low-conductance subset

• Then– compute statistics– test out some ideas– visualize it…

Page 20: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

20

Key idea: a “sweep”• Order vertices v1, v2, …. by highest apr(s)• Pick a prefix S={ v1, v2, …. vk }:

– S={}; volS =0; B={}– For k=1,…n:

• S += {vk}• volS += degree(vk)• B += {u: vku and u not in S}• Φ[k] = |B|/volS

– Pick k: Φ[k] is smallest

Page 21: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

21

Scoring a graph prefix

the edges leaving S

• vol(S) is sum of deg(x) for x in S• for small S: Prob(random edge leaves S)

Page 22: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

22

Putting this together• Given a graph– that’s too big for memory, and/or– that’s only accessible via API

• …we can extract a sample in an interesting area– Run the apr/rwr from a seed node– Sweep to find a low-conductance subset

• Then– compute statistics– test out some ideas– visualize it…

Page 23: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

23

Visualizing a Graph with Gephi

Page 24: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

24

Screen shots/demo

• Gelphi – java tool• Reads several inputs– .csv file– [demo]

Page 25: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

25

Page 26: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

26

Page 27: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

27

Page 28: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

28

Page 29: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

29

Page 30: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

30

Page 31: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

31

Page 32: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

32

Page 33: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

33

Page 34: Subsampling Graphs 1. RECAP OF PAGERANK-NIBBLE 2

34