1 diffusionrank: a possible penicillin for web spamming haixuan yang group meeting jan. 16, 2006

18
1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006.

Post on 19-Dec-2015

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

1

DiffusionRank: A Possible Penicillin for Web Spamming

Haixuan Yang

Group MeetingJan. 16, 2006.

Page 2: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

2

Outline

Introduction DiffusionRank

Model Establishment Computation consideration Discussion on γ

Results Conclusions

Page 3: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

3

Introduction

PageRank Tries to find the importance of a Web page based on

the link structure. The importance of a page i is defined recursively in

terms of pages which point to it:

It proves to be effective for ranking Web pages.

Page 4: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

4

Introduction PageRank

Two problems: The incomplete information about the Web structure.

Solution: predict the Web Structure as a random graph. The web pages manipulated by people for

commercial interests. About 70% of all pages in the .biz domain are spam About 35% of the pages in the .us domain belong to spam

category. Two methods used for manipulating spam pages

Link Stuffing Keyword Stuffing

Solution: DiffusionRank

Page 5: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

5

An example for manipulation

The rank value of node 1 can be increased greatly!

Page 6: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

6

Why? Two reasons

Over-democratic All pages are born equal--equal voting ability of one

page: the sum of each column is equal to one. Input-independent

For any given non-zero initial input, the iteration will converge to the same stable distribution.

Heat Diffusion Model -- a natural way to avoid these two factors

Pages are not equal as some pages are born with high temperatures while others are born with low temperatures.

Different initial temperature distributions will give rise to different temperature distributions after a fixed time period.

Page 7: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

7

DiffusionRank On an undirected graph

Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j.

Solution:

Page 8: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

8

DiffusionRank On an undirected graph

Assumption: the amount of the heat flow from j to i is proportional to the heat difference between i and j.

Solution:

On a directed graph Assumption: there is extra energy imposed on

the link (j, i) such that the heat flow only from j to i if there is no link (i,j).

Solution:

On a random directed graph Assumption: the heat flow is proportional to the

probability of the link (j,i). Solution:

Page 9: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

9

DiffusionRank On a random directed graph

Solution:

The initial value f(i,0) in f(0) is set to be 1 if i is trusted and 0 otherwise according to the inverse PageRank.

Page 10: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

10

Computation consideration Approximation of heat kernel

N=? When N>=30, the real eigenvalues of are

less than 0.01; when N>=100, they are less than 0.005. We use N=100 in the paper.

When N tends to infinity

Page 11: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

11

Discuss γ

γcan be understood as the thermal conductivity. When γ=0, the ranking value is most robust to

manipulation since no heat is diffused, but the Web structure is completely ignored;

When γ= ∞, DiffusionRank becomes PageRank, it can be manipulated easily.

Whenγ=1, DiffusionRank works well in practice

Page 12: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

12

DiffusionRank Advantages

Can detect Group-group relations Can cut Graphs Anti-manipulation

+1

-1

γ= 0.5 or 1

Page 13: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

13

DiffusionRank Experiments

Data: a toy graph (6 nodes) a middle-size real-world graph (18542 nodes) a large-size real-world graph crawled from CUHK

(607170 nodes) Compare with TrustRank and PageRank

Page 14: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

14

Results The tendency of

DiffusionRank when γ becomes larger

On the toy graph

Page 15: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

15

Anti-manipulation On the toy graph

Page 16: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

16

Anti-manipulation on the middle graph and the large graph

Page 17: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

17

Stability--the order difference between ranking results for an algorithm before it is manipulated and those after that

Page 18: 1 DiffusionRank: A Possible Penicillin for Web Spamming Haixuan Yang Group Meeting Jan. 16, 2006

18

Conclusions

This anti-manipulation feature enables DiffusionRank to be a candidate as a penicillin for Web spamming.

DiffusionRank is a generalization of PageRank (when γ=∞).

DiffusionRank can be employed to detect group-group relation.

DiffusionRank can be used to cut graph.