adversarial id - social spam recognition

39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References Adversarial IR - Social Spamming Nicola Miotto Unipd - Computer Science January 22, 2011 Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 1 / 39

Upload: nicola-miotto

Post on 08-May-2015

1.293 views

Category:

Documents


1 download

DESCRIPTION

Spam recognition in social system using seprevised machine learning.

TRANSCRIPT

Page 1: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Adversarial IR - Social Spamming

Nicola Miotto

Unipd - Computer Science

January 22, 2011

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 1 / 39

Page 2: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Outline

1 IntroductionSpamAdversarial IR

2 Tag-spam detection in Social Bookmarking systemsProblem descriptionFeaturesClassification

3 Youtube Video SpammingProblem descriptionFeaturesClassificatio

4 Conclusions

5 References

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 2 / 39

Page 3: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Introduction

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 3 / 39

Page 4: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Spam - History

1970: BBC broadcasts the Spam sketch by Monty Python’s FlyingCircus, where the current meaning of the term is derived;

1978: advisory message sent to 393 ARPANET users, the earliestdocumented spam;

’90: Make Money Fast flooding around in many newsgroup. Fristassociation an IT related field of the term spam;

1998: new definition for the term spam in the New Oxford Dictionaryof English:

Definition

Irrelevant or inappropriate messages sent on the Internet to a large numberof newsgroups or users.

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 4 / 39

Page 5: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Spam - Fields

E-mail

Istant Messaging: Messaging spam

Web-Search: Spamdexing

Social systems: Social spam

And so on...

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 5 / 39

Page 6: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Spam - Spammer

Earn money on the web!

Google AdSense or Heyos like services allow users to place Adautomatically generated in their web pages in order to get money fromclicks and page impressions.

Legal Avertiser:

He produces web site where to put content-related Ad;

He improves the pagerank of the website for the relevant keywork;

Try to lead potential customers to his websites;

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 6 / 39

Page 7: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Spam - Spammer

Spammer:

Website contents just used to attract users and improve the pagerank;

No discrimination between interested and not interested users;

Authomatic spam-network generation programs:

they find the relevant keywords (eg: via AdWords)

they register the domain names containing those keywords;

they create complete websites with fake contents with the keywordsfound;

they link the generated websites together in order to improve thepagerank;

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 7 / 39

Page 8: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Spam - Social Spamming

Spam campaign directed to Social Network users

Social bookmarking systems: Delicious;

Video social network: YouTube;

General purpose social network: Facebook;

and so on..

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 8 / 39

Page 9: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Spam - Social Spamming

Features:

Lots of user related information;

Easier to point to a specific demographic segment;

Cheaper (usually);

Adopted solution (most of the times): Report abuse→ generic solution, but less effective than ad-hoc ones.

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 9 / 39

Page 10: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Spam - Consequences

Users hijacked towards areas out of their informative needs;

unfair competition with legal advertiser

Information poisoning due to the spam noise

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 10 / 39

Page 11: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Adversarial IR - Definition

Adversarial: “Assumes competing parties trying to affect the outcome ofa system (system could be an algorithm, a market, etc)”

Adversarial IR: “Information retrieval, ranking, or classification systemaffected by multiple parties acting in their own interest”

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 11 / 39

Page 12: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Adversarial IR - AIRWeb

AIRWeb

Adversarial Information Retrieval on the Web

Annual workshop about Adversarial IR

Researchers and industry practitioners gathered to to present anddiscuss advances in the state-of-the-art of Adversarial IT

First workshop in 2005 (Japan)

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 12 / 39

Page 13: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Discussed techniques

AIRWeb papers 42Social spam recognition techniques discussed during theAIRWeb workshops

Supervised Machine Learning 42

1 Feature modelling

2 Training dataset retrieval

3 Machine learning (ie: SVM)

4 Result evaulation

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 13 / 39

Page 14: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Tag-spam detection in Social Bookmarkingsystems

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 14 / 39

Page 15: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Problem description - Tag-spam

Social bookmarking system:

User can associate meta-information (tags) to resources (links);

Association of one o more words to any resource;

Advertiser:

Social tagging: posting link to his website tagging them withcontent-related keywords

Spammer:

Most “famous” keywords (eg: music) used to tag not-related websites(eg: his spam-websites);

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 15 / 39

Page 16: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Figure: Delicious.com Screenshot (2011)

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 16 / 39

Page 17: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Figure: Example: Tag-spam on Delicious.com (2008)

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 17 / 39

Page 18: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Problem description - Folksonomy

Data structure to represent a social tagging system;

Hyper-graph connecting users, resources and tags;

Symbols:

u ∈ U, U set of users;r ∈ R, R set of resources;t ∈ T , T set of tags;

post= {(u, r , t1), ..., (u, r , tn)} = {(u, r , (t1, ..., tn))}

F = {post1, ..., postn}

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 18 / 39

Page 19: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Figure: Folksonomy graphical representation example

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 19 / 39

Page 20: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Features - Tag based

Which tags do spammers use?

TagSpam Ut = {u : (∃r : (u, r , t) ∈ F )}

St ∈ Ut , identified as spammer

Pr(t) = |St ||Ut |

T (u, r) = {t : (u, r , t) ∈ F}

fTagSpam(u, r) =1

|T (u, r)|∑

t∈T (u,r)

Pr(t)

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 20 / 39

Page 21: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Features - Tag based

Is there as semantical relationship between tags?

TagBlur σ(t1, t2) ∈ [0, 1], normalized tag similarity between t1 et2

Z = tag pairs in T( u, r )

fTagBlur (u, r) =1

Z

∑t1 6=t2∈T (u,r)

1

σ(t1, t2) + ε− 1

1 + ε

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 21 / 39

Page 22: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Features - Resource based I

DomFP Spammers use programs to generate pages → samecontent for spam pages

We know the fingerprint of some spam pages

Compute the likelihood that r is spam comparing rfingerprint to know ones

NumAds Usually spammers just offers lots of Ads

NumAds application exampe: countgooglesyndication.com amount in the resource htmlcode

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 22 / 39

Page 23: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Features - Resource based

Plagiarism Spammers usually copy content from high-rankedwebsites

Compare r contents to other webpages

ValidLinks Spammer websites are frequently knocked down

Lots of invalid links posted by u implies greaterlikelihood of u being spammer

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 23 / 39

Page 24: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Classification - Training dataset

BibSonomy.org :

public dataset

27.000 user and their post

hand made classification → 25.000 spammers and 2.000legal users

Classification :

Binary classification into either spammer or notspammer

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 24 / 39

Page 25: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Classification - Results

SVM AdaBoostFeatures Accuracy FP F1 Accuracy FP F1

TagSpam 95.82% .061 .957 94.66% .048 .943+ TagBlur 96.75% .048 .966 96.06% .044 .958+ DomFp 96.75% .048 .966 96.06% .044 .958+ ValidLinks 96.52% .048 .964 96.75% .026 .965+ NumAds 96.52% .048 .964 97.22% .026 .970+ Plagiarism 96.75% .048 .966 98.38% 0.22 .983

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 25 / 39

Page 26: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

YouTube Video Spamming

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 26 / 39

Page 27: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Description - Youtube video spam

Video-response: user answers to a video with another related video

Spammer: user answering with not related videos

Reasons:

increase video popularitymarketing campaignpornography distributionsystem poisoning

Issue: automatic content based spam recognition hard to implement

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 27 / 39

Page 28: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Description - Techniques

Content-based recognition:

video content analysistoo many computational resourcehard to generalize the idea of spam in a video, unless it doesn’t havetextual conent

Video and users relationship analysis:

lots of informations publicly availablespammers have specific social features (they’re lonely)user behaviour towards spammers can be automatically analysed

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 28 / 39

Page 29: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Features - User-based

For each user:

# posted videos

# friends

# watched videos

# favourite videos

# video responses

# responded videos

# subscrition

# subscriber

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 29 / 39

Page 30: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Features - Video-based

2 category per user:

All posted videos

Just video responses

7 attributes each of them

# views

duration

# votes

# comments

# favourites

# youtube honours

# external links

Total and average for each attribute attribute, so 28 in total.

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 30 / 39

Page 31: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Features - Social network

Basate su Video response user graph:

directed graph (X,Y)

each user is a node in the graph

(x1, x2) directed edge from x1 ∈ X to x2 ∈ Y if x1 ∈ X responded toa video of x2 ∈ Y

Analysis:

in/out degree for each “user”

assortativity: degree(n) / avg( degree(neighbours(n)) )

userrank: depending on quantity and quality of in links

clustering coefficient, betwenness, reciprocity

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 31 / 39

Page 32: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Classification - Dataset

Data crawling:Starting from top-100 most responded video, retrieving connected dataconcerning video responses, responded video e users.

Hand made classification:Each user with at leas a video response not related to the responded videois classified as spammer.

Test set:473 legal users + 119 spammer = 592 users

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 32 / 39

Page 33: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Classification - Training

Support Vector Machine

5-fold cross-validation

Adopted features:

user-basedvideo-basedsocial-networkall together

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 33 / 39

Page 34: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Classification - Results

Measure User Video SN AllTP 0.054 0.426 0.375 0.439TN 0.998 0.922 1 0.981FP 0.002 0.078 0 0.019FN 0.946 0.574 0.625 0.561

Accuracy 0.821 0.821 0.874 0.870F 0.094 0.484 0.540 0.558

TP = users correctly classified as spammersFP = legal users classified as spammers

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 34 / 39

Page 35: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Conclusions

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 35 / 39

Page 36: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Conclusions

Classifications

Tag-spam recognition :Accuracy > 98%False positives < 2%

Youtube-video spam recognition :True positives > 44%False positives < 2%

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 36 / 39

Page 37: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Conclusions

Pro:

Few legal users classified as spammer

Tag-spam recognition finds most of the spammer

Dataset build out of publicly available information

Contro:

Social system already poisoned by spam

Hand made classification of training examples

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 37 / 39

Page 38: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

References I

Brian D. Davison,The Potential for Research and Development in AdversarialInformation Retrieval,Computer Science and Engr., Lehigh University, Cambridge, 2009,available at http://airweb.cse.lehigh.edu/2009/slides/Davison-AIRWeb2009-Keynote.pdf.

B.Markines,C.Cattuto,F.Menczer,D.Benz,A.Hotho,and G. Stumme,Evaluating similarity measures for emergent semantics of socialtagging,In Proc. 18th Intl. WWW Conf., 2009,available at http://www2009.org/proceedings/pdf/p641.pdf.

Benjamin Markines, Ciro Cattuto, Filippo Menczer,Social Spam Detection, AIRWeb ’09, April 21, 2009 Madrid, Spain,available athttp://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf.

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 38 / 39

Page 39: Adversarial ID - Social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

References II

Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida,Chao Zhang, Keith Ros,Identifying Video Spammers in Online Social Networks,AIRWeb ’08, April 22, 2008 Beijing, China,available at http://airweb.cse.lehigh.edu/2008/submissions/benevenuto_2008_spam_video.pdf.

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 39 / 39