adversarial id - social spam recognition

Introduction Tag-spam detection Youtube Video Spamming Conclusions References

Adversarial IR - Social Spamming

Nicola Miotto

Unipd - Computer Science

January 22, 2011

Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 1 / 39


Outline

1 IntroductionSpamAdversarial IR

2 Tag-spam detection in Social Bookmarking systemsProblem descriptionFeaturesClassification

3 Youtube Video SpammingProblem descriptionFeaturesClassificatio

4 Conclusions

5 References



Introduction



Spam - History

1970: BBC broadcasts the Spam sketch by Monty Python’s FlyingCircus, where the current meaning of the term is derived;

1978: advisory message sent to 393 ARPANET users, the earliestdocumented spam;

’90: Make Money Fast flooding around in many newsgroup. Fristassociation an IT related field of the term spam;

1998: new definition for the term spam in the New Oxford Dictionaryof English:

Definition

Irrelevant or inappropriate messages sent on the Internet to a large numberof newsgroups or users.



Spam - Fields

E-mail

Istant Messaging: Messaging spam

Web-Search: Spamdexing

Social systems: Social spam

And so on...



Spam - Spammer

Earn money on the web!

Google AdSense or Heyos like services allow users to place Adautomatically generated in their web pages in order to get money fromclicks and page impressions.

Legal Avertiser:

He produces web site where to put content-related Ad;

He improves the pagerank of the website for the relevant keywork;

Try to lead potential customers to his websites;



Spam - Spammer

Spammer:

Website contents just used to attract users and improve the pagerank;

No discrimination between interested and not interested users;

Authomatic spam-network generation programs:

they find the relevant keywords (eg: via AdWords)

they register the domain names containing those keywords;

they create complete websites with fake contents with the keywordsfound;

they link the generated websites together in order to improve thepagerank;



Spam - Social Spamming

Spam campaign directed to Social Network users

Social bookmarking systems: Delicious;

Video social network: YouTube;

General purpose social network: Facebook;

and so on..



Spam - Social Spamming

Features:

Lots of user related information;

Easier to point to a specific demographic segment;

Cheaper (usually);

Adopted solution (most of the times): Report abuse→ generic solution, but less effective than ad-hoc ones.



Spam - Consequences

Users hijacked towards areas out of their informative needs;

unfair competition with legal advertiser

Information poisoning due to the spam noise



Adversarial IR - Definition

Adversarial: “Assumes competing parties trying to affect the outcome ofa system (system could be an algorithm, a market, etc)”

Adversarial IR: “Information retrieval, ranking, or classification systemaffected by multiple parties acting in their own interest”



Adversarial IR - AIRWeb

AIRWeb

Adversarial Information Retrieval on the Web

Annual workshop about Adversarial IR

Researchers and industry practitioners gathered to to present anddiscuss advances in the state-of-the-art of Adversarial IT

First workshop in 2005 (Japan)



Discussed techniques

AIRWeb papers 42Social spam recognition techniques discussed during theAIRWeb workshops

Supervised Machine Learning 42

1 Feature modelling

2 Training dataset retrieval

3 Machine learning (ie: SVM)

4 Result evaulation



Tag-spam detection in Social Bookmarkingsystems



Problem description - Tag-spam

Social bookmarking system:

User can associate meta-information (tags) to resources (links);

Association of one o more words to any resource;

Advertiser:

Social tagging: posting link to his website tagging them withcontent-related keywords

Spammer:

Most “famous” keywords (eg: music) used to tag not-related websites(eg: his spam-websites);



Figure: Delicious.com Screenshot (2011)



Figure: Example: Tag-spam on Delicious.com (2008)



Problem description - Folksonomy

Data structure to represent a social tagging system;

Hyper-graph connecting users, resources and tags;

Symbols:

u ∈ U, U set of users;r ∈ R, R set of resources;t ∈ T , T set of tags;

post= {(u, r , t1), ..., (u, r , tn)} = {(u, r , (t1, ..., tn))}

F = {post1, ..., postn}



Figure: Folksonomy graphical representation example



Features - Tag based

Which tags do spammers use?

TagSpam Ut = {u : (∃r : (u, r , t) ∈ F )}

St ∈ Ut , identified as spammer

Pr(t) = |St ||Ut |

T (u, r) = {t : (u, r , t) ∈ F}

fTagSpam(u, r) =1

|T (u, r)|∑

t∈T (u,r)

Pr(t)



Features - Tag based

Is there as semantical relationship between tags?

TagBlur σ(t1, t2) ∈ [0, 1], normalized tag similarity between t1 et2

Z = tag pairs in T( u, r )

fTagBlur (u, r) =1

Z

∑t1 6=t2∈T (u,r)

1

σ(t1, t2) + ε− 1

1 + ε



Features - Resource based I

DomFP Spammers use programs to generate pages → samecontent for spam pages

We know the fingerprint of some spam pages

Compute the likelihood that r is spam comparing rfingerprint to know ones

NumAds Usually spammers just offers lots of Ads

NumAds application exampe: countgooglesyndication.com amount in the resource htmlcode



Features - Resource based

Plagiarism Spammers usually copy content from high-rankedwebsites

Compare r contents to other webpages

ValidLinks Spammer websites are frequently knocked down

Lots of invalid links posted by u implies greaterlikelihood of u being spammer



Classification - Training dataset

BibSonomy.org :

public dataset

27.000 user and their post

hand made classification → 25.000 spammers and 2.000legal users

Classification :

Binary classification into either spammer or notspammer



Classification - Results

SVM AdaBoostFeatures Accuracy FP F1 Accuracy FP F1

TagSpam 95.82% .061 .957 94.66% .048 .943+ TagBlur 96.75% .048 .966 96.06% .044 .958+ DomFp 96.75% .048 .966 96.06% .044 .958+ ValidLinks 96.52% .048 .964 96.75% .026 .965+ NumAds 96.52% .048 .964 97.22% .026 .970+ Plagiarism 96.75% .048 .966 98.38% 0.22 .983



YouTube Video Spamming



Description - Youtube video spam

Video-response: user answers to a video with another related video

Spammer: user answering with not related videos

Reasons:

increase video popularitymarketing campaignpornography distributionsystem poisoning

Issue: automatic content based spam recognition hard to implement



Description - Techniques

Content-based recognition:

video content analysistoo many computational resourcehard to generalize the idea of spam in a video, unless it doesn’t havetextual conent

Video and users relationship analysis:

lots of informations publicly availablespammers have specific social features (they’re lonely)user behaviour towards spammers can be automatically analysed



Features - User-based

For each user:

# posted videos

# friends

# watched videos

# favourite videos

# video responses

# responded videos

# subscrition

# subscriber



Features - Video-based

2 category per user:

All posted videos

Just video responses

7 attributes each of them

# views

duration

# votes

# comments

# favourites

# youtube honours

# external links

Total and average for each attribute attribute, so 28 in total.



Features - Social network

Basate su Video response user graph:

directed graph (X,Y)

each user is a node in the graph

(x1, x2) directed edge from x1 ∈ X to x2 ∈ Y if x1 ∈ X responded toa video of x2 ∈ Y

Analysis:

in/out degree for each “user”

assortativity: degree(n) / avg( degree(neighbours(n)) )

userrank: depending on quantity and quality of in links

clustering coefficient, betwenness, reciprocity



Classification - Dataset

Data crawling:Starting from top-100 most responded video, retrieving connected dataconcerning video responses, responded video e users.

Hand made classification:Each user with at leas a video response not related to the responded videois classified as spammer.

Test set:473 legal users + 119 spammer = 592 users



Classification - Training

Support Vector Machine

5-fold cross-validation

Adopted features:

user-basedvideo-basedsocial-networkall together



Classification - Results

Measure User Video SN AllTP 0.054 0.426 0.375 0.439TN 0.998 0.922 1 0.981FP 0.002 0.078 0 0.019FN 0.946 0.574 0.625 0.561

Accuracy 0.821 0.821 0.874 0.870F 0.094 0.484 0.540 0.558

TP = users correctly classified as spammersFP = legal users classified as spammers



Conclusions



Conclusions

Classifications

Tag-spam recognition :Accuracy > 98%False positives < 2%

Youtube-video spam recognition :True positives > 44%False positives < 2%



Conclusions

Pro:

Few legal users classified as spammer

Tag-spam recognition finds most of the spammer

Dataset build out of publicly available information

Contro:

Social system already poisoned by spam

Hand made classification of training examples



References I

Brian D. Davison,The Potential for Research and Development in AdversarialInformation Retrieval,Computer Science and Engr., Lehigh University, Cambridge, 2009,available at http://airweb.cse.lehigh.edu/2009/slides/Davison-AIRWeb2009-Keynote.pdf.

B.Markines,C.Cattuto,F.Menczer,D.Benz,A.Hotho,and G. Stumme,Evaluating similarity measures for emergent semantics of socialtagging,In Proc. 18th Intl. WWW Conf., 2009,available at http://www2009.org/proceedings/pdf/p641.pdf.

Benjamin Markines, Ciro Cattuto, Filippo Menczer,Social Spam Detection, AIRWeb ’09, April 21, 2009 Madrid, Spain,available athttp://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf.


http://airweb.cse.lehigh.edu/2009/slides/Davison-AIRWeb2009-Keynote.pdf

http://airweb.cse.lehigh.edu/2009/slides/Davison-AIRWeb2009-Keynote.pdf

http://www2009.org/proceedings/pdf/p641.pdf

http://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf


References II

Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida,Chao Zhang, Keith Ros,Identifying Video Spammers in Online Social Networks,AIRWeb ’08, April 22, 2008 Beijing, China,available at http://airweb.cse.lehigh.edu/2008/submissions/benevenuto_2008_spam_video.pdf.


http://airweb.cse.lehigh.edu/2008/submissions/benevenuto_2008_spam_video.pdf

http://airweb.cse.lehigh.edu/2008/submissions/benevenuto_2008_spam_video.pdf

adversarial id - social spam recognition

Documents