weiren yu 1 , xuemin lin 1 , wenjie zhang 1 , ying zhang 1 jiajin le 2 ,

27
Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1 Jiajin Le 2 , SimFusion+: Extending SimFusion Towards Efficient Estimation on Large and Dynamic Networks 1 University of New South Wales & NICTA, Australia 2 Donghua University, China SIGIR 2012

Upload: erik

Post on 30-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

SIGIR 2012. SimFusion+: Extending SimFusion Towards Efficient Estimation on Large and Dynamic Networks. Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1 Jiajin Le 2 ,. 1 University of New South Wales & NICTA, Australia 2 Donghua University, China. Contents. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

Weiren Yu1, Xuemin Lin1, Wenjie Zhang1, Ying Zhang1 Jiajin Le2,

SimFusion+: Extending SimFusion Towards Efficient Estimation on Large and Dynamic Networks

1 University of New South Wales & NICTA, Australia

2 Donghua University, China

SIGIR 2012

Page 2: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

2

2. Problem Definition

Contents

4. Experimental Results

1. Introduction

3. Optimization Techniques

Page 3: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

3

1. Introduction

Many applications require a measure of “similarity” between objects.

similaritysearch

Citation of Scientific Papers

(citeseer.com)(amazon.co

m)

Recommender System

Graph Clustering

Web Search Engine

(google.com)

Page 4: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

4

SimFusion: A New Link-based Similarity Measure

Structural Similarity Measure

PageRank [Page et. al, 99]

SimRank [Jeh and Widom, KDD 02]

SimFusion similarity

A new promising structural measure [Xi et. al, SIGIR 05]

Extension of Co-Citation and Coupling metrics

Basic Philosophy

Following the Reinforcement Assumption:

The similarity between objects is reinforced by the similarity of

their related objects.

Page 5: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

5

SimFusion Overview

Features Using a Unified Relationship Matrix (URM) to represent

relationships among heterogeneous data Defined recursively and is computed iteratively Applicable to any domain with object-to-object relationships

Challenges URM may incur trivial solution or divergence issue of SimFusion. Rather costly to compute SimFusion on large graphs

Naïve Iteration: matrix-matrix multiplication Requiring O(Kn3) time, O(n2) space [Xi et. al. , SIGIR 05]

No incremental algorithms when edges update

Page 6: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

6

Existing SimFusion: URM and USM

Data Space: a finite set of data objects (vertices)

Data Relation (edges) Given an entire space

Intra-type Relation carrying info. within one space

Inter-type Relation carrying info. between spaces

Unified Relationship Matrix (URM):

λi,j is the weighting factor between Di and Dj

Unified Similarity Matrix (USM):

1 2{ , , }o oD

,i i i i R D D

,i j i j R D D

1

N

iiD D

1

1, ,

, if ;

, , if , ;

0, otherwise.

j

j

jn

i j i jx

x

x y x y

LN

N

R

1,1 1,1 1,2 1,2 1, 1,

2,1 2,1 2,2 2,2 2, 2,URM

,1 ,1 ,2 ,2 , ,

N N

N N

N N N N N N N N

L L L

L L LL

L L L

1,1 1,

,1 ,

. . .n

n n T

n n n

s s

s t

s s

S S L S LR

Page 7: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

7

Example.

1D 2D

3D

in tra - typ ere la tio n sh ip

in te r- typ ere la tio n sh ip

d a tasp a ce

d a ta o b je ct

1v

2v 3v

4v 5v6v

1 2 3

1 1 11 4 4 2

51 12 8 4 8

31 13 5 5 5

Λ

D D D

D

D

D

1 1 1 1 18 8 4 4 4

1 1 14 4 2

5 5 51 1 116 16 4 24 24 24

URM 31 1 1 1 110 10 5 15 15 15

31 1 1 1 110 10 5 15 15 15

31 1 1 110 10 5 10 10

0

0 0 0

0

L

High complexity

!!!

O(Kn3) time

O(n2) space

. . .n n Ts t S S L S LR

1,2 1,

2,1 2,USM

,1 ,2

1

1

1

n

n

n n

s s

s s

s s

S

SimFusion Similarity on Heterogeneous Domain

Trivial

Solution !!!

S=[1]nxn

Page 8: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

8

Contributions

Revising the existing SimFusion model, avoiding

non-semantic convergence

divergence issue

Optimizing the computation of SimFusion+

O(Km) pre-computation time, plus O(1) time and O(n) space

Better accuracy guarantee

Incremental computation on edge updates

O(δn) time and O(n) space for handling δ edge updates

Page 9: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

9

Revised SimFusion

Motivation: Two issues of the existing SimFusion model

Trivial Solution on Heterogeneous Domain

Divergent Solution on Homogeneous Domain

Root cause: row normalization of URM !!!

Page 10: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

10

From URM to UAM

Unified Adjacency Matrix (UAM)

Example

1

, ,

, if ;

, if , ;

0, otherwi e

1

s

,

.

j jn

i j i j

x

x y x y

A

N

R

Page 11: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

11

Revised SimFusion+

Basic Intuition

replace URM with UAM to postpone “row normalization”

in a delayed fashion while preserving the reinforcement

assumption of the original SimFusion

Revised SimFusion+ Model Original SimFusion

squeeze similarity scores in S into [0, 1].squeeze similarity scores in S into [0, 1].

Page 12: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

12

Optimizing SimFusion+ Computation

Conventional Iterative Paradigm

Matrix-matrix multiplication, requiring O(kn3) time and O(n2) space

Our approach: To convert SimFusion+ computation into

finding the dominant eigenvector of the UAM A.

Matrix-vector multiplication, requiring O(km) time and O(n) space

Pre-compute σmax(A) only once, and cache it for later reusePre-compute σmax(A) only once, and cache it for later reuse

Page 13: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

13

Example

Conventional Iteration:

Our approach:

Assume with

Page 14: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

14

Key Observation

Kroneckor product “ ”:⊗

e.g.

Vec operator:

e.g.

Two important Properties:

5 6 5 6 5 6 10 121 2

7 8 7 81 2 5 6 7 8 14 16, ,

3 4 7 8 15 18 20 245 6 5 63 4

21 24 28 327 8 7 8

X Y X Y

( ) [1 3 2 4]Tvec X

Page 15: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

15

Key Observation

Two important Properties:

P1.

P2.

Our main idea:

(1)

(2)Power Iteration

Page 16: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

16

Accuracy Guarantee

Conventional Iterations: No accuracy guarantee !!!

Question: || S(k+1) – S || ≤ ?

Our Method: Utilize Arnoldi decomposition to build an

order-k orthogonal subspace for the UAM A.

Due to Tk small size and almost “upper-triangularity”, Computing σmax(Tk) is less costly than σmax(A).

Due to Tk small size and almost “upper-triangularity”, Computing σmax(Tk) is less costly than σmax(A).

Page 17: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

17

Accuracy Guarantee

Arnoldi Decomposition:

k-th iterative similarity

Estimate Error:

Page 18: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

18

Example

Arnoldi Decomposition:

Assume with

Given

(1)

(2)

(3)

Page 19: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

19

Edge Update on Dynamic Graphs

Incremental UAM

Given old G =(D,R) and a new G’=(D,R’), the incremental UAM is

a list of edge updates, i.e.,

Main idea

To reuse and the eigen-pair (αp, ξp) of the old A to compute

is a sparse matrix when the number δ of edge updates is small.

Incrementally computing SimFusion+

O(δn) time

O(n) space

Page 20: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

20

ExampleSuppose edges (P1,P2) and (P2,P1) are removed.

Page 21: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

21

Experimental Setting Datasets

Synthetic data (RAND 0.5M-3.5M) Real data (DBLP, WEBKB)

Compared Algorithms

SimFusion+ and IncSimFusion+ ;

SF, a SimFusion algorithm via matrix iteration [Xi et. al, SIGIR 05];

CSF, a variant SF, using PageRank distribution [Cai et. al, SIGIR

10];

SR, a SimRank algorithm via partial sums [Lizorkin et. al, VLDBJ 10];

PR, a P-Rank encoding both in- and out-links [Zhao et. al, CIKM 09];

DBLP

WEBKB

Page 22: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

22

Experiment (1): Accuracy

On DBLP and WEBKB

SF+ accuracy is consistently stable on different datasets.SF+ accuracy is consistently stable on different datasets.

SF seems hardly to get sensible similarities as all its similarities asymptotically approach the same value as K grows.

SF seems hardly to get sensible similarities as all its similarities asymptotically approach the same value as K grows.

Page 23: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

23

Experiment (2): CPU Time and Space

On DBLP

On WEBKB

SF+ outperforms the other approaches, due to the use of σmax(Tk)SF+ outperforms the other approaches, due to the use of σmax(Tk)

Page 24: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

24

Experiment (3): Edge Updates

IncSF+ outperformed SF+ when δ is small.IncSF+ outperformed SF+ when δ is small.

For larger δ, IncSF+ is not that good because the small value of δ preserves the sparseness of the incremental UAM.

For larger δ, IncSF+ is not that good because the small value of δ preserves the sparseness of the incremental UAM.

Varying δ

Page 25: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

25

Experiment (4) : Effects of

The small choice of imposes more iterations on computing Tk and vk, and hence increases the estimation costs.

The small choice of imposes more iterations on computing Tk and vk, and hence increases the estimation costs.

Page 26: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

26

Conclusions

A revision of SimFusion+, for preventing the trivial solution

and the divergence issue of the original model.

Efficient techniques to improve the time and space of

SimFusion+ with accuracy guarantees.

An incremental algorithm to compute SimFusion+ on

dynamic graphs when edges are updated.

Devise vertex-updating methods for incrementally

computing SimFusion+.

Extend to parallelize SimFusion+ computing on GPU.

Future Work

Page 27: Weiren Yu 1 , Xuemin Lin 1 , Wenjie Zhang 1 , Ying Zhang 1  Jiajin Le 2 ,

27