Transcript
  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Link-based Spam Detection

    Luca Becchetti1, Carlos Castillo1,Debora Donato1,Stefano Leonardi1 and Ricardo Baeza-Yates2

    1. Università di Roma “La Sapienza” – Rome, Italy2. Yahoo! Research – Barcelona, Spain and Santiago, Chile

    Aug 11th, AirWeb 2006

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation

    2 Degree-based measures

    3 PageRank

    4 TrustRank

    5 Truncated PageRank

    6 Counting supporters

    7 Conclusions

    http://www.dis.uniroma1.it/~becchet/http://www.chato.cl/http://www.dis.uniroma1.it/~donato/http://www.dis.uniroma1.it/~leon/http://www.baeza.cl/http://www.uniroma1.it/http://research.yahoo.com/

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    What is on the Web?

    Information + Porn + On-line casinos + Free movies +Cheap software + Buy a MBA diploma + Prescription -freedrugs + V!-4-gra + Get rich now now now!!!

    Graphic: www.milliondollarhomepage.com

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Web spam (keywords + links)

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Web spam (mostly keywords)

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Search engine?

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Fake search engine

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Problem: “normal” pages that are spam

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Link farms

    Single-level farms can be detected by searching groups of nodessharing their out-links [Gibson et al., 2005]

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Motivation

    [Fetterly et al., 2004] hypothesized that studying thedistribution of statistics about pages could be a good way ofdetecting spam pages:

    “in a number of these distributions, outlier values areassociated with web spam”

    Research goal

    Statistical analysis of link-based spam

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Metrics

    Graph algorithms

    Streamed algorithms

    Symmetric algorithms

    All shortest paths, centrality, betweenness, clustering coefficient...

    (Strongly) connected componentsApproximate count of neighborsPageRank, Truncated PageRank, Linear RankHITS, Salsa, TrustRank

    Breadth-first and depth-first searchCount of neighbors

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Test collection

    U.K. collection

    18.5 million pages downloaded from the .UK domain

    5,344 hosts manually classified (6% of the hosts)

    Classified entire hosts:

    V A few hosts are mixed: spam and non-spam pagesX More coverage: sample covers 32% of the pages

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Degree

    1 100 100000

    0.1

    0.2

    0.3

    0.4

    Number of in−links

    In−degree δ = 0.35

    Normal Spam

    (δ = max. difference in C.D.F. plot)

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Degree

    1 10 50 1000

    0.1

    0.2

    0.3

    Number of out−links

    Out−degree δ = 0.28

    Normal Spam

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Edge reciprocity

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    Fraction of reciprocal links

    Reciprocity of max. PR page δ = 0.35

    Normal Spam

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Assortativity

    0.001 0.01 0.1 1 10 100 10000

    0.1

    0.2

    0.3

    0.4

    Degree/Degree ratio of home page

    Degree / Degree of neighbors δ = 0.31

    Normal Spam

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Automatic classifier

    All of the following attributes in the home page and the pagewith the maximum PageRank, plus a binary variableindicating if they are the same page:

    - In-degree, out-degree- Fraction of reciprocal edges- Degree divided by degree of direct neighbors- Average and sum of in-degree of out-neighbors- Average and sum of out-degree of in-neighbors

    Decision tree

    72.6% of detection rate, with 3.1% false positives

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    PageRank

    Let PN×N be the normalized link matrix of a graph

    Row-normalized

    No “sinks”

    Definition (PageRank)

    Stationary state of:

    αP +(1− α)

    N1N×N

    Follow links with probability α

    Random jump with probability 1− α

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Maximum PageRank in the Host

    1e−08 1e−07 1e−06 1e−05 0.00010

    0.1

    0.2

    0.3

    PageRank

    Maximum PageRank of the site δ = 0.23

    Normal Spam

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Variance of PageRank

    Suggested in [Benczúr et al., 2005]

    PageRank PageRank

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Variance of PageRank of in-neighbors

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    σ2 of the logarithm of PageRank

    Stdev. of PR of Neighbors (Home) δ = 0.41

    Normal Spam

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Automatic classifier

    Features: degree-based plus the following in the home pageand the page with maximum PageRank:

    - PageRank- In-degree/PageRank- Out-degree/PageRank- Standard deviation of PageRank of in-neighbors = σ2

    - σ2/PageRank

    Plus the PageRank of the home page divided by thePageRank of the page with the maximum PageRank.

    Decision tree

    74.4% of detection rate, with 2.6% false positives(Degree-based: 72.6% of detection, 3.1% false positives)

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    TrustRank

    TrustRank [Gyöngyi et al., 2004]

    A node with high PageRank, but far away from a core set of“trusted nodes” is suspicious

    Start from a set of trusted nodes, then do a random walk,returning to the set of trusted nodes with probability 1− α ateach step

    i Trusted nodes: data from http://www.dmoz.org/

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    TrustRank score

    1e−06 0.0010

    0.1

    0.2

    0.3

    0.4

    TrustRank

    TrustRank score of home page δ = 0.59

    Normal Spam

    http://www.dmoz.org/

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    TrustRank / PageRank

    0.3 1 10 1000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    TrustRank score/PageRank

    Estimated relative non−spam mass δ = 0.59

    Normal Spam

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Automatic classifier

    PageRank attributes, plus the following in the home page andthe page with maximum PageRank:

    - TrustScore- TrustScore/PageRank (estimated relative non-spam mass)- TrustScore/In-degree

    Plus the TrustScore in the home page divided by theTrustScore in the page with the maximum PageRank.

    Decision tree

    77.3% of detection rate, with 3.0% false positives(PageRank-based: 74.4% of detection, 2.6% false positives)

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Path-based formula for PageRank

    Given a path p = 〈x1, x2, . . . , xt〉 of length t = |p|

    branching(p) =1

    d1d2 · · · dt−1where di are the out-degrees of the members of the path

    Explicit formula for PageRank [Newman et al., 2001]

    ri (α) =∑

    p∈Path(−,i)

    (1− α)α|p|

    Nbranching(p)

    Path(−, i) are incoming paths in node i

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    General functional ranking

    In general:

    ri (α) =∑

    p∈Path(−,i)

    damping(|p|)N

    branching(p)

    There are many choices for damping(|p|), including a simplelinear function that is as good as PageRank in practice[Baeza-Yates et al., 2006]

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Truncated PageRank

    Reduce the direct contribution of the first levels of links:

    damping(t) =

    {0 t ≤ TCαt t > T

    V No extra reading of the graph after PageRank

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Truncated PageRank(T=2) / PageRank

    0.2 0.4 0.6 0.8 1 1.2 1.4 1.60

    0.1

    0.2

    0.3

    TruncatedPageRank(T=2) / PageRank

    TruncatedPageRank T=2 / PageRank δ = 0.30

    Normal Spam

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Max. change of Truncated PageRank

    0.85 0.9 0.95 1 1.05 1.10

    0.1

    0.2

    max(TrPRi+1

    /TrPri)

    Maximum change of Truncated PageRank δ = 0.29

    Normal Spam

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Automatic classifier

    PageRank attributes, plus the following in the home page andthe page with maximum PageRank:

    - TruncPageRank(T = 1 . . . 4)- TruncPageRank(T = 4) / TruncPageRank(T = 3)- TruncPageRank(T = 3) / TruncPageRank(T = 2)- TruncPageRank(T = 2) / TruncPageRank(T = 1)- TruncPageRank(T = 1 . . . 4) / PageRank- Minimum, maximum and average of:TruncPageRank(T = i)/TruncPageRank(T = i − 1)Plus the TruncatedPageRank(T = 1 . . . 4) of the home pagedivided by the same value in the page with the maximumPageRank.

    Decision tree

    76.9% of detection rate, with 2.5% false positives(TrustRank-based: 77.3% of detection, 3.0% false positives)

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Idea: count “supporters” at different distances

    Number of different nodes at a given distance:

    .UK 18 mill. nodes .EU.INT 860,000 nodes

    0.0

    0.1

    0.2

    0.3

    5 10 15 20 25 30

    Freq

    uenc

    y

    Distance

    0.0

    0.1

    0.2

    0.3

    5 10 15 20 25 30

    Freq

    uenc

    y

    Distance

    Average distance Average distance14.9 clicks 10.0 clicks

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    High and low-ranked pages are different

    1 5 10 15 200

    2

    4

    6

    8

    10

    12

    x 104

    Distance

    Num

    ber o

    f Nod

    es

    Top 0%−10%Top 40%−50%Top 60%−70%

    Areas below the curves are equal if we are in the samestrongly-connected component

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Probabilistic counting

    100010

    100010

    110000

    000110

    000011

    100010

    100011

    111100111111

    100011

    Count bits setto estimatesupporters

    Target page

    Propagation ofbits using the

    “OR” operation

    100010

    Improvement of ANF algorithm [Palmer et al., 2002] based onprobabilistic counting [Flajolet and Martin, 1985]

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    General algorithm

    Require: N: number of nodes, d: distance, k: bits1: for node : 1 . . . N, bit: 1 . . . k do2: INIT(node,bit)3: end for4: for distance : 1 . . . d do {Iteration step}5: Aux ← 0k6: for src : 1 . . . N do {Follow links in the graph}7: for all links from src to dest do8: Aux[dest] ← Aux[dest] OR V[src,·]9: end for

    10: end for11: V ← Aux12: end for13: for node: 1 . . .N do {Estimate supporters}14: Supporters[node] ← ESTIMATE( V[node,·] )15: end for16: return Supporters

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Our estimator

    Initialize all bits to one with probability �

    Estimator: neighbors(node) = log(1−�)

    (1− ones(node)k

    )Adaptive estimation

    Repeat the above process for � = 1/2, 1/4, 1/8, . . . , and lookfor the transitions from more than (1− 1/e)k ones to lessthan (1− 1/e)k ones.

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Convergence

    5 10 15 200%

    10%

    20%

    30%

    40%

    50%

    60%

    70%

    80%

    90%

    100%

    Iteration

    Frac

    tion

    of n

    odes

    with

    est

    imat

    es

    d=1d=2d=3d=4d=5d=6d=7d=8

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Error rate

    1 2 3 4 5 6 7 80

    0.1

    0.2

    0.3

    0.4

    0.5

    512 b×i 768 b×i 832 b×i 960 b×i 1152 b×i 1216 b×i 1344 b×i 1408 b×i

    576 b×i1152 b×i

    512 b×i768 b×i

    832 b×i

    960 b×i

    1152 b×i1216 b×i

    1344 b×i 1408 b×i

    Distance

    Ave

    rage

    Rel

    ativ

    e E

    rror

    Ours 64 bits, epsilon−only estimatorOurs 64 bits, combined estimatorANF 24 bits × 24 iterations (576 b×i)ANF 24 bits × 48 iterations (1152 b×i)

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Hosts at distance 4

    1 100 10000

    0.1

    0.2

    0.3

    0.4

    S4 − S

    3

    Hosts at Distance Exactly 4 δ = 0.39

    Normal Spam

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Minimum change of supporters

    1 5 100

    0.1

    0.2

    0.3

    0.4

    min(S2/S

    1, S

    3/S

    2, S

    4/S

    3)

    Minimum change of supporters δ = 0.39

    Normal Spam

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Automatic classifier

    PageRank attributes, plus the following in the home page andthe page with maximum PageRank:

    - Supporters at 2 . . . 4- Supporters at 2 . . . 4 / PageRank- Supporters at i / Supporters at i − 1 (for i = 1..4)- Minimum, maximum and average of: Supporters at i /Supporters at i − 1 (for i = 1..4)- (Supporters at i - Supporters at i − 1) / PageRankPlus the number of supporters at distance 2 . . . 4 in the homepage divided by the same feature in the page with themaximum PageRank.

    Decision tree

    78.9% of detection rate, with 2.5% false positives(TruncPR: 76.9% of detection, 2.5% false positives)

    (TrustRank: 77.7% of detection, 3.0% false positives)

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    1 Motivation2 Degree-based measures3 PageRank4 TrustRank5 Truncated PageRank6 Counting supporters7 Conclusions

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Summary of classifiers

    Detection FalseMetrics rate positives

    Degree (D) 72.6% 3.1%D + PageRank (P) 74.4% 2.6%D + P + TrustRank 77.7% 3.0%D + P + Trunc. PageRank 76.9% 2.5%D + P + Est. of Supporters 78.9% 2.5%

    All attributes 81.4% 2.8%All attributes (more rules) 80.8% 1.1%

    Comparison

    Content-based analysis [Ntoulas et al., 2006] has shown86.2% detection rate with 2.2% false positives

    Classifier based on TrustRank [Gyöngyi et al., 2004]:49%-50% of detection rate with 2.3%-2.1% error in oursample – SpamRank [Benczúr et al., 2005] reports similardetection rates

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Top 10 metrics

    1. Binary variable indicating if homepage is the page withmaximum PageRank of the site2. Edge reciprocity3. Different hosts at distance 44. Different hosts at distance 35. Minimum change of supporters (different hosts)6. Different hosts at distance 27. TruncatedPagerank (T=1) / PageRank8. TrustRank score divided by PageRank9. Different hosts at distance 110. TruncatedPagerank (T=2) / PageRank

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Conclusions

    V Link-based statistics to detect 80% of spamX No magic bullet in link analysisX Precision still low compared to e-mail spam filtersV Measure both home page and max. PageRank pageV Host-based counts are important

    Further results at WebKDD 2006 Next step: combine linkanalysis and content analysis

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    If you want privacy, spamize your data!

    Thank you!

    Questions?

    Soon: Spam detection test collection: 11,000 classified hosts

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Baeza-Yates, R., Boldi, P., and Castillo, C. (2006).

    Generalizing PageRank: Damping functions for link-basedranking algorithms.

    In Proceedings of SIGIR, Seattle, Washington, USA. ACMPress.

    Becchetti, L., Castillo, C., Donato, D., Leonardi, S., andBaeza-Yates, R. (2006).

    Using rank propagation and probabilistic counting forlink-based spam detection.

    In Proceedings of the Workshop on Web Mining and WebUsage Analysis (WebKDD), Pennsylvania, USA. ACM Press.

    Benczúr, A. A., Csalogány, K., Sarlós, T., and Uher, M.(2005).

    Spamrank: fully automatic link spam detection.

    In Proceedings of the First International Workshop onAdversarial Information Retrieval on the Web, Chiba, Japan.

    http://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions_link_ranking.pdfhttp://www.dcc.uchile.cl/~ccastill/papers/baeza06_general_pagerank_damping_functions_link_ranking.pdfhttp://www.dcc.uchile.cl/~ccastill/papers/becchetti_06_automatic_link_spam_detection_rank_propagation.pdfhttp://www.dcc.uchile.cl/~ccastill/papers/becchetti_06_automatic_link_spam_detection_rank_propagation.pdf

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Fetterly, D., Manasse, M., and Najork, M. (2004).

    Spam, damn spam, and statistics: Using statistical analysis tolocate spam web pages.

    In Proceedings of the seventh workshop on the Web anddatabases (WebDB), pages 1–6, Paris, France.

    Flajolet, P. and Martin, N. G. (1985).

    Probabilistic counting algorithms for data base applications.

    Journal of Computer and System Sciences, 31(2):182–209.

    Gibson, D., Kumar, R., and Tomkins, A. (2005).

    Discovering large dense subgraphs in massive graphs.

    In VLDB ’05: Proceedings of the 31st international conferenceon Very large data bases, pages 721–732. VLDB Endowment.

    Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Gyöngyi, Z., Molina, H. G., and Pedersen, J. (2004).

    Combating web spam with trustrank.

    In Proceedings of the Thirtieth International Conference onVery Large Data Bases (VLDB), pages 576–587, Toronto,Canada. Morgan Kaufmann.

    Newman, M. E., Strogatz, S. H., and Watts, D. J. (2001).

    Random graphs with arbitrary degree distributions and theirapplications.

    Phys Rev E Stat Nonlin Soft Matter Phys, 64(2 Pt 2).

    Ntoulas, A., Najork, M., Manasse, M., and Fetterly, D. (2006).

    Detecting spam web pages through content analysis.

    In Proceedings of the World Wide Web conference, pages83–92, Edinburgh, Scotland.

    http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077http://dx.doi.org/http://doi.acm.org/10.1145/1017074.1017077http://citeseer.ist.psu.edu/flajolet85probabilistic.htmlhttp://portal.acm.org/citation.cfm?id=1083676http://dbpubs.stanford.edu:8090/pub/2004-17http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed% &dopt=Abstract&list_uids=11497662http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed% &dopt=Abstract&list_uids=11497662http://dx.doi.org/10.1145/1135777.1135794

  • Link-BasedSpam Detection

    L. Becchetti,C. Castillo,D. Donato,

    S. Leonardi andR. Baeza-Yates

    Motivation

    Degree-basedmeasures

    PageRank

    TrustRank

    TruncatedPageRank

    Countingsupporters

    Conclusions

    Palmer, C. R., Gibbons, P. B., and Faloutsos, C. (2002).

    ANF: a fast and scalable tool for data mining in massivegraphs.

    In Proceedings of the eighth ACM SIGKDD internationalconference on Knowledge discovery and data mining, pages81–90, New York, NY, USA. ACM Press.

    http://dx.doi.org/10.1145/775047.775059http://dx.doi.org/10.1145/775047.775059

    MotivationDegree-based measuresPageRankTrustRankTruncated PageRankCounting supportersConclusions


Top Related