jure leskovec ([email protected]) computer science department cornell university / stanford...

45
Size matters: 1) Cluster structure of large networks 2) Searching the world’s social network Jure Leskovec ([email protected]) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, Aniraban Dasgupta

Upload: constance-palmer

Post on 28-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

Size matters:1) Cluster structure of large networks2) Searching the world’s social networkJure Leskovec ([email protected])Computer Science DepartmentCornell University / Stanford University

Joint work with: Eric Horvitz, Michael Mahoney, Kevin Lang, Aniraban Dasgupta

Page 2: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

Rich data: Networks

Large on-line computing applications have detailed records of human activity: On-line communities: Facebook (120 million) Communication: Instant Messenger (~1 billion) News and Social media: Blogging (250 million)

We model the data as a network (an interaction graph)

Can observe and study phenomena at scales not

possible before Communication network

Page 3: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

3

Outline

The Small-world experiment:▪ On a 240 million node communication

network of Microsoft Instant Messenger

Small vs. large networks:▪ Modeling community (cluster) structure of

large networks

Zachary’s karate club (N=34) Tiny part of a large social network

Page 4: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

4

How expressed are communities?

How community like is a set of nodes?

Idea: Use approximation algorithms for NP-hard graph partitioning problems as experimental probes of network structure.

Conductance (normalized cut)

Φ(S) = # edges cut / # edges inside Small Φ(S) corresponds to more

community-like sets of nodes

S

S’

Page 5: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

5

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

What is “best”

community of 5 nodes?

Page 6: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

6

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

Bad communit

yΦ=5/6 = 0.83

What is “best”

community of 5 nodes?

Page 7: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

7

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

Better communit

y

Φ=5/7 = 0.7

Bad communit

y

Φ=2/5 = 0.4

What is “best”

community of 5 nodes?

Page 8: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

8

Community score (quality)

Score: Φ(S) = # edges cut / # edges inside

Better communit

y

Φ=5/7 = 0.7

Bad communit

y

Φ=2/5 = 0.4

Best communit

yΦ=2/8 = 0.25

What is “best”

community of 5 nodes?

Page 9: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

9

Network Community Profile Plot We define:

Network community profile (NCP) plotPlot the score of best community of size k

Community size, log k

log Φ(k)Φ(5)=0.25

Φ(7)=0.18

k=5 k=7

Page 10: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

10

NCP plot: Low-dimensional and random graphs

d-dimensional meshes Hierarchically nested clusters

Page 11: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

11

NCP plot: Zachary’s karate club

Zachary’s university karate club social network During the study club split into 2 The split (squares vs. circles) corresponds

to cut B

Page 12: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

12

NCP plot: Network Science Collaborations between scientists in

Networks [Newman, 2005]

Page 13: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

13

Present work: Large networks

Previous work mostly focused on community structure of small networks (~100 nodes)

We examined 108 different large networks

Page 14: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

14

Example of a large network Typical example:

General relativity collaboration network (4,158 nodes, 13,422 edges)

Page 15: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

15

More NCP plots of networks

Page 16: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

16

Φ(k

), (

con

du

ctan

ce)

k, (community size)

NCP: LiveJournal (N=5M, E=42M)

Better and better

communities

Communities get worse and worse

Best community has ~100

nodes

Page 17: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

17

Explanation: Downward part

Small clusters on the edge of the network are responsible for downward part of NCP plot

NCP plot

Best cluster

Page 18: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

18

Explanation: Upward part

Each additional edge inside the cluster costs more: NCP plot

Φ=2/4 = 0.5

Φ=8/6 = 1.3

Φ=64/14 = 4.5

Each node has twice as many

children

Φ=1/3 = 0.33

Page 19: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

19

Suggested network structure

Network structure: Core-

periphery (jellyfish, octopus)

Whiskers are responsible for

good communities

Denser and denser

core of the network

Core contains

~60% nodes and ~80%

edges

Page 20: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

20

What is a good model?

What is a good model that explains such network structure?

Pref. attachment Small World Geometric Pref. Attachment

FlatDown and Flat

Flat and Down

Page 21: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

21

Forest Fire model works

Forest Fire [LKF05]: connections spread like a fire New node joins the network Selects a seed node Connects to some of its neighbors Continue recursively

Notes:• Preferential attachment flavor - second neighbor is not uniform at random.• Copying flavor - since burn seed’s neighbors.• Hierarchical flavor - seed is parent.• “Local” flavor - burn “near” -- in a diffusion sense -- the seed vertex.As community grows it

blends into the core of

the network

Page 22: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

22

Forest Fire NCP plot

rewired

network

Page 23: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

23

Typical cluster size

How does the size of best cluster scale with the size of the network?

Page 24: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

24

Size of best cluster over time

Cluster size remains constant (even if one allows nesting) over time

Linked in network over time

Page 25: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

25

Cluster size vs. network size

Each dot is a different network

Page 26: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

26

Connections

The Dunbar number 150 individuals is maximum community size

What edges “mean” and community identification

Using node and edge types/attributes Implications for machine learning

No large clusters No/little (assortative) hierarchical structure Can’t be well embedded – no underlying

geometry

Page 27: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

27

The small-world of the MSN Instant Messenger

Joint work with Eric Horvitz, Microsoft Research

Page 28: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

28

The Small-world experiment

Milgram’s small world experiment

The Small-world experiment [Milgram ’67, Dodds-Muhamad-Watts ‘03] People send letters from Nebraska to Boston

How many steps does it take? 6.2 on the average, thus “6 degrees of separation”

Page 29: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

29

The Small-world experiment 1) Short paths exist in a social

network 2) People are able to find them

(using only partial knowledge of the network)

Local search: forwarding a message

ts

d(s,t)=h

Good nodes:d=h-1

Bad nodes: d≥h

Target

Page 30: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

30

Our dataset: Instant Messaging

Contact (buddy) list Messaging window

Page 31: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

31

MSN communication

We collected the data for June 20064.5Tb of compressed data: 245 million users logged in 180 million users engaged in

conversations 255 billion exchanged messages 1 billion conversations / day

Page 32: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

32

MSN network

The network: 180M nodes, 1.3B undirected edges

Page 33: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

33

MSN: path lengths

MSN Messenger network

Number of steps

between pairs of people

Avg. path length 6.690% of the people can be reached in

< 8 hops

Hops Nodes0 1

1 10

2 78

3 3,96

4 8,648

5 3,299,252

6 28,395,849

7 79,059,497

8 52,995,778

9 10,321,008

10 1,955,007

11 518,410

12 149,945

13 44,616

14 13,740

15 4,476

16 1,542

17 536

18 167

19 71

20 29

21 16

22 10

23 3

24 2

25 3

Page 34: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

34

Degree distribution:

A node that exchanged

messages with ~2 million people

Page 35: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

35

Robustness of shortest paths

Short paths exist and they are robust

Randomized network (same degree distr.)

All links

Both way links

Page 36: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

36

Learning to search in a network

What is the decision function that makes me forward the message to the target?

ts

d(s,t)=h

Good nodes:d=h-1

Bad nodes: d≥h

Target

What are the characteristics of shortest paths? How hard is it to

find them?

Page 37: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

37

Does geography help?

t s

Page 38: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

38

Does geography help?

t s

Page 39: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

39

How hard is to find a good node?

t s

Page 40: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

40

How hard is to find a good node?

Probability of success if we forward to a

random neighbor

t s

Page 41: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

41

Algorithm accuracy at hops

t s

Page 42: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

42

Algorithm accuracy at hops

t s

Use a decision tree to learn a classifier:Model: 0.4128Random : 0.0207

Page 43: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

43

The learned model

Green bar is prob. that node is good

Page 44: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

44

Comparing search heuristics Pick a pair of nodes: start at s Walk until hit the target t where next node is chosen:

Search alg. % found Mean path lengthRandom 0.0008 3,709MinGeoDist 0.0282 778MaxDeg 0.0158 4,964Deg/Geo2 0.1446 2,676Cntry 0.0108 402Cntry*Deg 0.1313 3,114Lang 0.0055 1,699Lang*Deg 0.0496 3,163 Age 0.0012 2,890 Age*Deg 0.0203 5,324 ts

It works!(in a network with 180 million nodes)

-- Milgram’s path completion is 29%-- Dodds,Muhhamad, Watts: 0.015% comp

Page 45: Jure Leskovec (jure@cs.stanford.edu) Computer Science Department Cornell University / Stanford University Joint work with: Eric Horvitz, Michael Mahoney,

45

Conclusions and reflections

Why are networks the way they are?

Only recently have basic properties been observed on a large scale Confirms social science intuitions; calls

others into question

Benefits of working with large data Observe structures not visible at

smaller scales