tools for large graph mining www 2008 tutorial part 4: case studies

113
Tools for large graph mining WWW 2008 tutorial Part 4: Case studies Jure Leskovec and Christos Faloutsos Machine Learning Department Joint work with: Lada Adamic, Deepay Chakrabarti, Natalie Glance, Carlos Guestrin, Bernardo Huberman, Jon Kleinberg, Andreas Krause, Mary McGlohon, Ajit Singh, and Jeanne VanBriesen.

Upload: falala

Post on 19-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

Tools for large graph mining WWW 2008 tutorial Part 4: Case studies. Jure Leskovec and Christos Faloutsos Machine Learning Department. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Tools for large graph miningWWW 2008 tutorial

Part 4: Case studies

Jure Leskovec and Christos Faloutsos Machine Learning Department

Joint work with: Lada Adamic, Deepay Chakrabarti, Natalie Glance, Carlos Guestrin, Bernardo Huberman, Jon Kleinberg, Andreas Krause, Mary McGlohon,

Ajit Singh, and Jeanne VanBriesen.

Page 2: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Tutorial outline

Part 1: Structure and models for networks What are properties of large graphs? How do we model them?

Part 2: Dynamics of networks Diffusion and cascading behavior How do viruses and information propagate?

Part 3: Case studies 240 million MSN instant messenger network Graph projections: how does the web look like

Leskovec&Faloutsos, WWW 2008 Part 4-2

Page 3: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Part 3: Outline

Case studies– Co-clustering– Microsoft Instant Messenger Communication

network• How does the world communicate

– Web projections• How to do learning from contextual subgraphs

– Finding fraudsters on eBay– Center piece subgraphs

• How to find best path between the query nodes

Leskovec&Faloutsos, WWW 2008 Part 4-3

Page 4: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-4

Co-clustering

Given data matrix and the number of row and column groups k and l

Simultaneously Cluster rows of p(X, Y) into k disjoint groups Cluster columns of p(X, Y) into l disjoint groups

Page 5: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-5

Co-clustering

Let X and Y be discrete random variables X and Y take values in {1, 2, …, m} and {1, 2, …, n} p(X, Y) denotes the joint probability distribution—if not

known, it is often estimated based on co-occurrence data Application areas: text mining, market-basket analysis,

analysis of browsing behavior, etc. Key Obstacles in Clustering Contingency Tables

High Dimensionality, Sparsity, Noise Need for robust and scalable algorithms

Reference:1. Dhillon et al. Information-Theoretic Co-clustering, KDD’03

Page 6: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-6

04.04.004.04.04.

04.04.04.004.04.

05.05.05.000

05.05.05.000

00005.05.05.

00005.05.05.

036.036.028.028.036.036.

036.036.028.028036.036.

054.054.042.000

054.054.042.000

000042.054.054.

000042.054.054.

5.00

5.00

05.0

05.0

005.

005.

2.2.

3.0

03. 36.36.28.000

00028.36.36.

m

m

n

nl

k

k

l

eg, terms x documents

Page 7: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-7

04.04.004.04.04.

04.04.04.004.04.

05.05.05.000

05.05.05.000

00005.05.05.

00005.05.05.

036.036.028.028.036.036.

036.036.028.028036.036.

054.054.042.000

054.054.042.000

000042.054.054.

000042.054.054.

5.00

5.00

05.0

05.0

005.

005.

2.2.

3.0

03. 36.36.28.000

00028.36.36.

term xterm-group

doc xdoc group

term group xdoc. group

med. terms

cs terms

common terms

med. doccs doc

Page 8: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-8

Co-clustering

Observations uses KL divergence, instead of L2 the middle matrix is not diagonal

we’ll see that again in the Tucker tensor decomposition

Page 9: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-9

Problem with Information Theoretic Co-clustering

Number of row and column groups must be specified

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”

Scalable to large graphs

Page 10: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-10

Cross-association

Desiderata:

Simultaneously discover row and column groups

Fully Automatic: No “magic numbers”

Scalable to large matrices

Reference:1. Chakrabarti et al. Fully Automatic Cross-Associations, KDD’04

Page 11: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-11

What makes a cross-association “good”?

versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Why is this better?

Page 12: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-12

What makes a cross-association “good”?

versus

Column groups Column groups

Row

gro

ups

Row

gro

ups

Why is this better?

simpler; easier to describeeasier to compress!

Page 13: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-13

What makes a cross-association “good”?

Problem definition: given an encoding scheme• decide on the # of col. and row groups k and l• and reorder rows and columns,• to achieve best compression

Page 14: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-14

Main Idea

sizei * H(xi) +Cost of describing cross-associations

Code Cost Description Cost

Σi Total Encoding Cost =

Good Compression

Better Clustering

Minimize the total cost (# bits)

for lossless compression

details

Page 15: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-15

Algorithmk = 5 row

groups

k=1, l=2

k=2, l=2

k=2, l=3

k=3, l=3

k=3, l=4

k=4, l=4

k=4, l=5

l = 5 col groups

Page 16: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Leskovec&Faloutsos, WWW 2008 2-16

Algorithm

Code for cross-associations (matlab):www.cs.cmu.edu

/~deepay/mywww/software/CrossAssociations-01-27-2005.tgz

Variations and extensions: ‘Autopart’ [Chakrabarti, PKDD’04] www.cs.cmu.edu/~deepay

Page 17: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Microsoft Instant Messenger Communication Network

How does the whole world communicate?

Leskovec and Horvitz: Worldwide Buzz: Planetary-Scale Views on an Instant-Messaging Network, WWW 2008

Page 18: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

The Largest Social Network

Leskovec&Faloutsos, WWW 2008 Part 4-18

Page 19: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Instant Messaging

Leskovec&Faloutsos, WWW 2008

• Contact (buddy) list• Messaging window

Part 4-19

Page 20: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

IM – Phenomena at planetary scale

Observe social phenomena at planetary scale: How does communication change with user

demographics (distance, age, sex)? How does geography affect communication? What is the structure of the communication

network?

Leskovec&Faloutsos, WWW 2008 Part 4-20

Page 21: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Communication data

The record of communication Presence data

user status events (login, status change) Communication data

who talks to whom Demographics data

user age, sex, location

Leskovec&Faloutsos, WWW 2008 Part 4-21

Page 22: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Data description: Presence

Events: Login, Logout Is this first ever login Add/Remove/Block buddy Add unregistered buddy (invite new user) Change of status (busy, away, BRB, Idle,…)

For each event: User Id Time

Leskovec&Faloutsos, WWW 2008 Part 4-22

Page 23: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Data description: Communication

For every conversation (session) we have a list of users who participated in the conversation

There can be multiple people per conversation For each conversation and each user:

User Id Time Joined Time Left Number of Messages Sent Number of Messages Received

Leskovec&Faloutsos, WWW 2008 Part 4-23

Page 24: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Data description: Demographics

For every user (self reported): Age Gender Location (Country, ZIP) Language IP address (we can do reverse geo IP lookup)

Leskovec&Faloutsos, WWW 2008 Part 4-24

Page 25: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Data collection

Log size: 150Gb/day Just copying over the network takes 8 to 10h Parsing and processing takes another 4 to 6h After parsing and compressing ~ 45 Gb/day Collected data for 30 days of June 2006:

Total: 1.3Tb of compressed data

Leskovec&Faloutsos, WWW 2008 Part 4-25

Page 26: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Data statistics

Activity over June 2006 (30 days) 245 million users logged in 180 million users engaged in conversations 17,5 million new accounts activated More than 30 billion conversations

Leskovec&Faloutsos, WWW 2008 Part 4-26

Page 27: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Data statistics per day

Activity on June 1 2006 1 billion conversations 93 million users login 65 million different users talk (exchange

messages) 1.5 million invitations for new accounts sent

Leskovec&Faloutsos, WWW 2008 Part 4-27

Page 28: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

User characteristics: age

Leskovec&Faloutsos, WWW 2008 Part 4-28

Page 29: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Age piramid: MSN vs. the world

Leskovec&Faloutsos, WWW 2008 Part 4-29

Page 30: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Conversation: Who talks to whom? Cross gender edges:

300 male-male and 235 female-female edges 640 million female-male edges

Leskovec&Faloutsos, WWW 2008 Part 4-30

Page 31: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Number of people per conversation

Max number of people simultaneously talking is 20, but conversation can have more people

Leskovec&Faloutsos, WWW 2008 Part 4-31

Page 32: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Conversation duration

Most conversations are short

Leskovec&Faloutsos, WWW 2008 Part 4-32

Page 33: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Conversations: number of messages

Sessions between fewer people run out of steam

Leskovec&Faloutsos, WWW 2008 Part 4-33

Page 34: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Time between conversations Individuals are highly

diverse What is probability to

login into the system after t minutes?

Power-law with exponent 1.5

Task queuing model [Barabasi ’05]

Leskovec&Faloutsos, WWW 2008 Part 4-34

Page 35: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Age: Number of conversationsU

ser s

elf r

epor

ted

age

High

LowLeskovec&Faloutsos, WWW 2008 Part 4-35

Page 36: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Age: Total conversation durationU

ser s

elf r

epor

ted

age

High

LowLeskovec&Faloutsos, WWW 2008 Part 4-36

Page 37: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Age: Messages per conversationU

ser s

elf r

epor

ted

age

High

LowLeskovec&Faloutsos, WWW 2008 Part 4-37

Page 38: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Age: Messages per unit timeU

ser s

elf r

epor

ted

age

High

LowLeskovec&Faloutsos, WWW 2008 Part 4-38

Page 39: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Who talks to whom: Number of conversations

Leskovec&Faloutsos, WWW 2008 Part 4-39

Page 40: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Who talks to whom: Conversation duration

Leskovec&Faloutsos, WWW 2008 Part 4-40

Page 41: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Geography and communication

Count the number of users logging in from particular location on the earth

Leskovec&Faloutsos, WWW 2008 Part 4-41

Page 42: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

How is Europe talking Logins from Europe

Leskovec&Faloutsos, WWW 2008 Part 4-42

Page 43: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Users per geo location

Blue circles have more than 1 million

logins.

Blue circles have more than 1 million

logins.

Leskovec&Faloutsos, WWW 2008 Part 4-43

Page 44: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Users per capita

Fraction of population using MSN:•Iceland: 35%•Spain: 28%•Netherlands, Canada, Sweden, Norway: 26%•France, UK: 18%•USA, Brazil: 8%

Fraction of population using MSN:•Iceland: 35%•Spain: 28%•Netherlands, Canada, Sweden, Norway: 26%•France, UK: 18%•USA, Brazil: 8%

Leskovec&Faloutsos, WWW 2008 Part 4-44

Page 45: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Communication heat map

For each conversation between geo points (A,B) we increase the intensity on the line between A and B

Leskovec&Faloutsos, WWW 2008 Part 4-45

Page 46: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Correlation: Probability:

Age

vs. A

ge

Leskovec&Faloutsos, WWW 2008 Part 4-46

Page 47: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

IM Communication Network

Buddy graph: 240 million people (people that login in June ’06) 9.1 billion edges (friendship links)

Communication graph: There is an edge if the users exchanged at least

one message in June 2006 180 million people 1.3 billion edges 30 billion conversations

Leskovec&Faloutsos, WWW 2008 Part 4-47

Page 48: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Buddy network: Number of buddies

Buddy graph: 240 million nodes, 9.1 billion edges (~40 buddies per user)

Leskovec&Faloutsos, WWW 2008 Part 4-48

Page 49: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Communication Network: Degree

Number of people a users talks to in a month

Leskovec&Faloutsos, WWW 2008 Part 4-49

Page 50: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Communication Network: Small-world

6 degrees of separation [Milgram ’60s] Average distance 5.5 90% of nodes can be reached in < 8 hops

Hops Nodes1 10

2 78

3 396

4 8648

5 3299252

6 28395849

7 79059497

8 52995778

9 10321008

10 1955007

11 518410

12 149945

13 44616

14 13740

15 4476

16 1542

17 536

18 167

19 71

20 29

21 16

22 10

23 3

24 2

25 3Leskovec&Faloutsos, WWW 2008 Part 4-50

Page 51: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Communication network: Clustering

How many triangles are closed?

Clustering normally decays as k-1

Communication network is highly clustered: k-0.37

High clustering Low clustering

Leskovec&Faloutsos, WWW 2008 Part 4-51

Page 52: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Communication Network Connectivity

Leskovec&Faloutsos, WWW 2008 Part 4-52

Page 53: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

k-Cores decomposition

What is the structure of the core of the network?

Leskovec&Faloutsos, WWW 2008 Part 4-53

Page 54: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

k-Cores: core of the network

People with k<20 are the periphery Core is composed of 79 people, each having 68

edges among themLeskovec&Faloutsos, WWW 2008 Part 4-54

Page 55: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Node deletion: Nodes vs. Edges

Leskovec&Faloutsos, WWW 2008 Part 4-55

Page 56: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Node deletion: Connectivity

Leskovec&Faloutsos, WWW 2008 Part 4-56

Page 57: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Web ProjectionsLearning from contextual

graphs of the web

How to predict user intention from the web graph?

Page 58: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Motivation Information retrieval traditionally considered

documents as independent Web retrieval incorporates global hyperlink

relationships to enhance ranking (e.g., PageRank, HITS) Operates on the entire graph Uses just one feature (principal eigenvector) of the

graph Our work on Web projections focuses on

contextual subsets of the web graph; in-between the independent and global consideration of the documents

a rich set of graph theoretic properties

Leskovec&Faloutsos, WWW 2008 Part 4-58

Page 59: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Web projections

Web projections: How they work? Project a set of web pages of interest onto the web

graph This creates a subgraph of the web called projection

graph Use the graph-theoretic properties of the subgraph for

tasks of interest Query projections

Query results give the context (set of web pages) Use characteristics of the resulting graphs for

predictions about search quality and user behavior

Leskovec&Faloutsos, WWW 2008 Part 4-59

Page 60: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Q

Query Results Projection on the web graph

Query projection graph

Generate graphical features

Query connection graph

• -- -- ----• --- --- ----• ------ ---• ----- --- --• ------ -----• ------ -----

Construct case library

Predictions

Query projections

Leskovec&Faloutsos, WWW 2008 Part 3-60

Page 61: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Questions we explore

How do query search results project onto the underlying web graph?

Can we predict the quality of search results from the projection on the web graph?

Can we predict users’ behaviors with issuing and reformulating queries?

Leskovec&Faloutsos, WWW 2008 Part 4-61

Page 62: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Is this a good set of search results?

Leskovec&Faloutsos, WWW 2008 Part 4-62

Page 63: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Will the user reformulate the query?

Leskovec&Faloutsos, WWW 2008 Part 4-63

Page 64: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Resources and concepts Web as a graph

URL graph: Nodes are web pages, edges are hyper-links March 2006 Graph: 22 million nodes, 355 million edges

Domain graph: Nodes are domains (cmu.edu, bbc.co.uk). Directed edge (u,v)

if there exists a webpage at domain u pointing to v February 2006 Graph: 40 million nodes, 720 million edges

Contextual subgraphs for queries Projection graph Connection graph

Compute graph-theoretic features

Leskovec&Faloutsos, WWW 2008 Part 4-64

Page 65: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

“Projection” graph

Example query: Subaru

Project top 20 results by the search engine

Number in the node denotes the search engine rank

Color indicates relevancy as assigned by human:– Perfect– Excellent– Good– Fair– Poor– Irrelevant

Leskovec&Faloutsos, WWW 2008 Part 3-65

Page 66: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

“Connection” graph Projection graph is

generally disconnected Find connector nodes Connector nodes are

existing nodes that are not part of the original result set

Ideally, we would like to introduce fewest possible nodes to make projection graph connected

Connector nodesProjection

nodes

Leskovec&Faloutsos, WWW 2008 Part 3-66

Page 67: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Finding connector nodes Find connector nodes is a Steiner tree problem which is NP

hard Our heuristic:

Connect 2nd largest connected component via shortest path to the largest

This makes a new largest component Repeat until the graph is connected

Largest component

2nd largest component

2nd largest component

Leskovec&Faloutsos, WWW 2008 Part 4-67

Page 68: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Extracting graph features The idea

Find features that describe the structure of the graph

Then use the features for machine learning

Want features that describe Connectivity of the graph Centrality of projection and

connector nodes Clustering and density of the core

of the graph

vs.

Leskovec&Faloutsos, WWW 2008 Part 4-68

Page 69: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Examples of graph features Projection graph

Number of nodes/edges Number of connected components Size and density of the largest

connected component Number of triads in the graph

Connection graph Number of connector nodes Maximal connector node degree Mean path length between

projection/connector nodes Triads on connector nodes

We consider 55 features total

vs.

Leskovec&Faloutsos, WWW 2008 Part 4-69

Page 70: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Q

Query Results Projection on the web graph

Query projection graph

Generate graphical features

Query connection graph

• -- -- ----• --- --- ----• ------ ---• ----- --- --• ------ -----• ------ -----

Construct case library

Predictions

Experimental setup

Leskovec&Faloutsos, WWW 2008 Part 3-70

Page 71: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Constructing case library for machine learning

Given a task of interest Generate contextual subgraph and extract

features Each graph is labeled by target outcome Learn statistical model that relates the

features with the outcome Make prediction on unseen graphs

Leskovec&Faloutsos, WWW 2008 Part 4-71

Page 72: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Experiments overview Given a set of search results generate projection

and connection graphs and their features

Predict quality of a search result set Discriminate top20 vs. top40to60 results Predict rating of highest rated document in the set

Predict user behavior Predict queries with high vs. low reformulation

probability Predict query transition (generalization vs. specialization) Predict direction of the transition

Leskovec&Faloutsos, WWW 2008 Part 4-72

Page 73: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Experimental details Features

55 graphical features Note we use only graph features, no content

Learning We use probabilistic decision trees (“DNet”)

Report classification accuracy using 10-fold cross validation

Compare against 2 baselines Marginals: Predict most common class RankNet: use 350 traditional features (document, anchor

text, and basic hyperlink features)

Leskovec&Faloutsos, WWW 2008 Part 4-73

Page 74: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Search results quality

Dataset: 30,000 queries Top 20 results for each

Each result is labeled by a human judge using a 6-point scale from "Perfect" to "Bad"

Task: Predict the highest rating in the set of results

6-class problem 2-class problem: “Good” (top 3 ratings) vs. “Poor”

(bottom 3 ratings)

Leskovec&Faloutsos, WWW 2008 Part 4-74

Page 75: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Search quality: the task

Predict the rating of the top result in the set

Predict “Good” Predict “Poor”

Leskovec&Faloutsos, WWW 2008 Part 4-75

Page 76: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Search quality: results Predict top human rating in

the set– Binary classification: Good vs.

Poor 10-fold cross validation

classification accuracy

Observations:– Web Projections outperform

both baseline methods– Just projection graph already

performs quite well– Projections on the URL graph

perform better

AttributesURL

GraphDomain Graph

Marginals 0.55 0.55

RankNet 0.63 0.60

Projection 0.80 0.64

Connection 0.79 0.66

Projection + Connection

0.82 0.69

All 0.83 0.71

Leskovec&Faloutsos, WWW 2008 Part 3-76

Page 77: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Search quality: the model The learned model shows

graph properties of good result sets

Good result sets have:– Search result nodes are hub

nodes in the graph (have large degrees)

– Small connector node degrees

– Big connected component– Few isolated nodes in

projection graph– Few connector nodes

Leskovec&Faloutsos, WWW 2008 Part 3-77

Page 78: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Predict user behavior

Dataset Query logs for 6 weeks 35 million unique queries, 80 million total query

reformulations We only take queries that occur at least 10 times This gives us 50,000 queries and 120,000 query

reformulations Task

Predict whether the query is going to be reformulated

Leskovec&Faloutsos, WWW 2008 Part 4-78

Page 79: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Query reformulation: the task Given a query and corresponding projection and

connection graphs Predict whether query is likely to be reformulated

Query not likely to be reformulated Query likely to be reformulated

Leskovec&Faloutsos, WWW 2008 Part 4-79

Page 80: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Query reformulation: results Observations:

– Gradual improvement as using more features

– Using Connection graph features helps

– URL graph gives better performance

We can also predict type of reformulation (specialization vs. generalization) with 0.80 accuracy

AttributesURL

GraphDomain Graph

Marginals 0.54 0.54

Projection 0.59 0.58

Connection 0.63 0.59

Projection + Connection

0.63 0.60

All 0.71 0.67

Leskovec&Faloutsos, WWW 2008 Part 3-80

Page 81: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Query reformulation: the model

Queries likely to be reformulated have:– Search result nodes have

low degree– Connector nodes are

hubs– Many connector nodes– Results came from many

different domains– Results are sparsely knit

Leskovec&Faloutsos, WWW 2008 Part 3-81

Page 82: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Query transitions

Predict if and how will user transform the query

Q: Strawberry shortcake

Q: Strawberry shortcake pictures

transition

Leskovec&Faloutsos, WWW 2008 Part 4-82

Page 83: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Query transition

With 75% accuracy we can say whether a query is likely to be reformulated: Def: Likely reformulated p(reformulated) > 0.6

With 87% accuracy we can predict whether observed transition is specialization or generalization

With 76% we can predict whether the user will specialize or generalize

Leskovec&Faloutsos, WWW 2008 Part 4-83

Page 84: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Conclusion We introduced Web projections

A general approach of using context-sensitive sets of web pages to focus attention on relevant subset of the web graph

And then using rich graph-theoretic features of the subgraph as input to statistical models to learn predictive models

We demonstrated Web projections using search result graphs for Predicting result set quality Predicting user behavior when reformulating queries

Leskovec&Faloutsos, WWW 2008 Part 4-84

Page 85: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Fraud detection on e-bay

How to find fraudsters on e-bay?

Page 86: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

E-bay Fraud detection

Polo Chau & Shashank Pandit, CMU

•“non-delivery” fraud:seller takes $$ and disappears

Leskovec&Faloutsos, WWW 2008 Part 3-86

Page 87: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Potential Buyer CPotential Buyer B

Potential Buyer A

$$$

Seller

$$$

Buyer

A TransactionWhat if something goes BAD?

Non-delivery fraud

Online Auctions: How They Work

Leskovec&Faloutsos, WWW 2008 Part 3-87

Page 88: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Modeling Fraudulent Behavior (contd.)

How would fraudsters behave in this graph? interact closely with other fraudsters fool reputation-based systems

Wow! This should lead to nice and easily detectable cliques of fraudsters …

Not quite experiments with a real eBay dataset showed they

rarely form cliques

092453 0112149Reputation

Leskovec&Faloutsos, WWW 2008 Part 4-88

Page 89: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Modeling Fraudulent Behavior

So how do fraudsters operate?

= fraudster

= accomplice= honest

Bipart

ite Co

re

Leskovec&Faloutsos, WWW 2008 Part 4-89

Page 90: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Modeling Fraudulent Behavior The 3 roles

Honest people like you and me

Fraudsters those who actually commit fraud

Accomplices erstwhile behave like honest users accumulate feedback via low-cost transactions secretly boost reputation of fraudsters (e.g.,

occasionally trading expensive items)

Leskovec&Faloutsos, WWW 2008 Part 4-90

Page 91: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

A

C

B

E

D

Belief Propagation

Leskovec&Faloutsos, WWW 200891 of 40

Page 92: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Center piece subgraphs

What is the best explanatory path between the nodes in a graph?

Hanghang Tong, KDD [email protected]

Page 93: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Center-Piece Subgraph(Ceps) Given Q query nodes Find Center-piece ( )

App.– Social Networks– Law Inforcement, …

Idea:– Proximity -> random walk

with restarts

A C

B

A C

B

A C

B

b

Leskovec&Faloutsos, WWW 2008 93

Page 94: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Case Study: AND query

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

Leskovec&Faloutsos, WWW 2008 Part 4-94

Page 95: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Case Study: AND query

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Heikki Mannila

Christos Faloutsos

Padhraic Smyth

Corinna Cortes

15 1013

1 1

6

1 1

4 Daryl Pregibon

10

2

11

3

16

Leskovec&Faloutsos, WWW 2008 Part 4-95

Page 96: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

R. Agrawal Jiawei Han

V. Vapnik M. Jordan

H.V. Jagadish

Laks V.S. Lakshmanan

Umeshwar Dayal

Bernhard Scholkopf

Peter L. Bartlett

Alex J. Smola

1510

13

3 3

5 2 2

327

42_SoftAnd query

ML/Statistics

databases

Leskovec&Faloutsos, WWW 2008 Part 4-96

Page 97: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Details

Main idea: use random walk with restarts, to measure ‘proximity’ p(i,j) of node j to node i

Leskovec&Faloutsos, WWW 2008 Part 4-97

Page 98: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Example

1

2

3

4

5

6

789

11

10 13

12•Starting from 1

•Randomly to neighbor

•Some p to return to 1

Prob (RW will finally stay at j)

Leskovec&Faloutsos, WWW 2008 Part 4-98

Page 99: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Individual Score Calculation

Q1 Q2 Q3

Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12Node 13

0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.12601

10

11

9 8

12

13

4

3

62

0.5767

0.1260

0.1235

0.1260

0.0283

0.0333

0.0024

0.0088

0.0076

0.00760.00240.0333

0.0088

7

5

Leskovec&Faloutsos, WWW 2008 99

Page 100: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Individual Score Calculation

Q1 Q2 Q3

Node 1Node 2Node 3Node 4Node 5Node 6Node 7Node 8Node 9Node 10Node 11Node 12Node 13

0.5767 0.0088 0.0088 0.1235 0.0076 0.0076 0.0283 0.0283 0.0283 0.0076 0.1235 0.0076 0.0088 0.5767 0.0088 0.0076 0.0076 0.1235 0.0088 0.0088 0.5767 0.0333 0.0024 0.1260 0.1260 0.0024 0.0333 0.1260 0.0333 0.0024 0.0333 0.1260 0.0024 0.0024 0.1260 0.0333 0.0024 0.0333 0.1260

Individual Score matrix

1

10

11

9 8

12

13

4

3

62

0.5767

0.1260

0.1235

0.1260

0.0283

0.0333

0.0024

0.0088

0.0076

0.00760.00240.0333

0.0088

7

5

Leskovec&Faloutsos, WWW 2008 100

Page 101: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

AND: Combining Scores

Q: How to combine scores?

A: Multiply …= prob. 3 random

particles coincide on node j

Leskovec&Faloutsos, WWW 2008 101

Page 102: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

K_SoftAnd: Combining Scores

Generalization – SoftAND:We want nodes close to k

of Q (k<Q) query nodes.

Q: How to do that?

details

Leskovec&Faloutsos, WWW 2008 102

Page 103: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

K_SoftAnd: Combining Scores

Generalization – softAND:We want nodes close to k

of Q (k<Q) query nodes.

Q: How to do that? A: Prob(at least k-out-of-

Q will meet each other at j)

details

Leskovec&Faloutsos, WWW 2008 103

Page 104: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

AND query vs. K_SoftAnd query

And Query 2_SoftAnd Query

x 1e-4

1 7

5

10

11

9 8

12

13

4

3

62

0.4505

0.1010

0.0710

0.1010

0.2267

0.1010

0.1010

0.4505

0.0710

0.07100.10100.1010

0.4505

1 7

5

10

11

9 8

12

13

4

3

62

0.0103

0.0046

0.0019

0.0046

0.0024

0.0046

0.0046

0.0103

0.0019

0.00190.00460.0046

0.0103

Leskovec&Faloutsos, WWW 2008 104

Page 105: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

1 7

5

10

11

9 8

12

13

4

3

62

0.0103

0.1617

0.1387

0.1617

0.0849

0.1617

0.1617

0.0103

0.1387

0.13870.16170.1617

0.0103

1_SoftAnd query = OR query

Leskovec&Faloutsos, WWW 2008 Part 4-105

Page 106: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Challenges in Ceps

Q1: How to measure the importance?– A: RWR

Q2: How to do it efficiently?

Leskovec&Faloutsos, WWW 2008 106

Page 107: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Graph Partition: Efficiency Issue Straightforward way

– solve a linear system: – time: linear to # of edges

Observation– Skewed dist.– communities

How to exploit them?– Graph partition

Leskovec&Faloutsos, WWW 2008 107

Page 108: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Even better:

We can correct for the deleted edges (Tong+, ICDM’06, best paper award)

Leskovec&Faloutsos, WWW 2008 Part 4-108

Page 109: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Experimental Setup

Dataset– DBLP/authorship– Author-Paper– 315k nodes– 1.8M edges

Leskovec&Faloutsos, WWW 2008 Part 3-109

Page 110: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Query Time vs. Pre-Computation Time

Log Query Time

Log Pre-computation Time

•Quality: 90%+ •On-line:

•Up to 150x speedup•Pre-computation:

•Two orders saving

Leskovec&Faloutsos, WWW 2008 Part 4-110

Page 111: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Query Time vs. Storage

Log Query Time

Log Storage

•Quality: 90%+ •On-line:

•Up to 150x speedup•Pre-storage:

•Three orders saving

Leskovec&Faloutsos, WWW 2008 Part 4-111

Page 112: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

Conclusions Q1:How to measure the importance? A1: RWR+K_SoftAnd Q2: How to find connection subgraph? A2:”Extract” Alg.) Q3:How to do it efficiently? A3:Graph Partition and Sherman-Morrison

– ~90% quality– 6:1 speedup; 150x speedup (ICDM’06, b.p.

award)

A C

B

Leskovec&Faloutsos, WWW 2008 Part 3-112

Page 113: Tools for large graph mining WWW 2008 tutorial Part 4: Case studies

References Leskovec and Horvitz: Worldwide Buzz: Planetary-Scale

Views on an Instant-Messaging Network, 2007 Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan Fast

Random Walk with Restart and Its Applications ICDM 2006.

Hanghang Tong, Christos Faloutsos Center-Piece Subgraphs: Problem Definition and Fast Solutions, KDD 2006

Shashank Pandit, Duen Horng Chau, Samuel Wang, and Christos Faloutsos: NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks, WWW 2007.

Leskovec&Faloutsos, WWW 2008 Part 4-113