adversarial id - social spam recognition
DESCRIPTION
Spam recognition in social system using seprevised machine learning.TRANSCRIPT
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Adversarial IR - Social Spamming
Nicola Miotto
Unipd - Computer Science
January 22, 2011
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 1 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Outline
1 IntroductionSpamAdversarial IR
2 Tag-spam detection in Social Bookmarking systemsProblem descriptionFeaturesClassification
3 Youtube Video SpammingProblem descriptionFeaturesClassificatio
4 Conclusions
5 References
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 2 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Introduction
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 3 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - History
1970: BBC broadcasts the Spam sketch by Monty Python’s FlyingCircus, where the current meaning of the term is derived;
1978: advisory message sent to 393 ARPANET users, the earliestdocumented spam;
’90: Make Money Fast flooding around in many newsgroup. Fristassociation an IT related field of the term spam;
1998: new definition for the term spam in the New Oxford Dictionaryof English:
Definition
Irrelevant or inappropriate messages sent on the Internet to a large numberof newsgroups or users.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 4 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Fields
Istant Messaging: Messaging spam
Web-Search: Spamdexing
Social systems: Social spam
And so on...
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 5 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Spammer
Earn money on the web!
Google AdSense or Heyos like services allow users to place Adautomatically generated in their web pages in order to get money fromclicks and page impressions.
Legal Avertiser:
He produces web site where to put content-related Ad;
He improves the pagerank of the website for the relevant keywork;
Try to lead potential customers to his websites;
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 6 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Spammer
Spammer:
Website contents just used to attract users and improve the pagerank;
No discrimination between interested and not interested users;
Authomatic spam-network generation programs:
they find the relevant keywords (eg: via AdWords)
they register the domain names containing those keywords;
they create complete websites with fake contents with the keywordsfound;
they link the generated websites together in order to improve thepagerank;
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 7 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Social Spamming
Spam campaign directed to Social Network users
Social bookmarking systems: Delicious;
Video social network: YouTube;
General purpose social network: Facebook;
and so on..
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 8 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Social Spamming
Features:
Lots of user related information;
Easier to point to a specific demographic segment;
Cheaper (usually);
Adopted solution (most of the times): Report abuse→ generic solution, but less effective than ad-hoc ones.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 9 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Spam - Consequences
Users hijacked towards areas out of their informative needs;
unfair competition with legal advertiser
Information poisoning due to the spam noise
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 10 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Adversarial IR - Definition
Adversarial: “Assumes competing parties trying to affect the outcome ofa system (system could be an algorithm, a market, etc)”
Adversarial IR: “Information retrieval, ranking, or classification systemaffected by multiple parties acting in their own interest”
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 11 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Adversarial IR - AIRWeb
AIRWeb
Adversarial Information Retrieval on the Web
Annual workshop about Adversarial IR
Researchers and industry practitioners gathered to to present anddiscuss advances in the state-of-the-art of Adversarial IT
First workshop in 2005 (Japan)
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 12 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Discussed techniques
AIRWeb papers 42Social spam recognition techniques discussed during theAIRWeb workshops
Supervised Machine Learning 42
1 Feature modelling
2 Training dataset retrieval
3 Machine learning (ie: SVM)
4 Result evaulation
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 13 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Tag-spam detection in Social Bookmarkingsystems
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 14 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Problem description - Tag-spam
Social bookmarking system:
User can associate meta-information (tags) to resources (links);
Association of one o more words to any resource;
Advertiser:
Social tagging: posting link to his website tagging them withcontent-related keywords
Spammer:
Most “famous” keywords (eg: music) used to tag not-related websites(eg: his spam-websites);
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 15 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Figure: Delicious.com Screenshot (2011)
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 16 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Figure: Example: Tag-spam on Delicious.com (2008)
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 17 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Problem description - Folksonomy
Data structure to represent a social tagging system;
Hyper-graph connecting users, resources and tags;
Symbols:
u ∈ U, U set of users;r ∈ R, R set of resources;t ∈ T , T set of tags;
post= {(u, r , t1), ..., (u, r , tn)} = {(u, r , (t1, ..., tn))}
F = {post1, ..., postn}
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 18 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Figure: Folksonomy graphical representation example
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 19 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Tag based
Which tags do spammers use?
TagSpam Ut = {u : (∃r : (u, r , t) ∈ F )}
St ∈ Ut , identified as spammer
Pr(t) = |St ||Ut |
T (u, r) = {t : (u, r , t) ∈ F}
fTagSpam(u, r) =1
|T (u, r)|∑
t∈T (u,r)
Pr(t)
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 20 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Tag based
Is there as semantical relationship between tags?
TagBlur σ(t1, t2) ∈ [0, 1], normalized tag similarity between t1 et2
Z = tag pairs in T( u, r )
fTagBlur (u, r) =1
Z
∑t1 6=t2∈T (u,r)
1
σ(t1, t2) + ε− 1
1 + ε
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 21 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Resource based I
DomFP Spammers use programs to generate pages → samecontent for spam pages
We know the fingerprint of some spam pages
Compute the likelihood that r is spam comparing rfingerprint to know ones
NumAds Usually spammers just offers lots of Ads
NumAds application exampe: countgooglesyndication.com amount in the resource htmlcode
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 22 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Resource based
Plagiarism Spammers usually copy content from high-rankedwebsites
Compare r contents to other webpages
ValidLinks Spammer websites are frequently knocked down
Lots of invalid links posted by u implies greaterlikelihood of u being spammer
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 23 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Training dataset
BibSonomy.org :
public dataset
27.000 user and their post
hand made classification → 25.000 spammers and 2.000legal users
Classification :
Binary classification into either spammer or notspammer
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 24 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Results
SVM AdaBoostFeatures Accuracy FP F1 Accuracy FP F1
TagSpam 95.82% .061 .957 94.66% .048 .943+ TagBlur 96.75% .048 .966 96.06% .044 .958+ DomFp 96.75% .048 .966 96.06% .044 .958+ ValidLinks 96.52% .048 .964 96.75% .026 .965+ NumAds 96.52% .048 .964 97.22% .026 .970+ Plagiarism 96.75% .048 .966 98.38% 0.22 .983
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 25 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
YouTube Video Spamming
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 26 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Description - Youtube video spam
Video-response: user answers to a video with another related video
Spammer: user answering with not related videos
Reasons:
increase video popularitymarketing campaignpornography distributionsystem poisoning
Issue: automatic content based spam recognition hard to implement
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 27 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Description - Techniques
Content-based recognition:
video content analysistoo many computational resourcehard to generalize the idea of spam in a video, unless it doesn’t havetextual conent
Video and users relationship analysis:
lots of informations publicly availablespammers have specific social features (they’re lonely)user behaviour towards spammers can be automatically analysed
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 28 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - User-based
For each user:
# posted videos
# friends
# watched videos
# favourite videos
# video responses
# responded videos
# subscrition
# subscriber
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 29 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Video-based
2 category per user:
All posted videos
Just video responses
7 attributes each of them
# views
duration
# votes
# comments
# favourites
# youtube honours
# external links
Total and average for each attribute attribute, so 28 in total.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 30 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Features - Social network
Basate su Video response user graph:
directed graph (X,Y)
each user is a node in the graph
(x1, x2) directed edge from x1 ∈ X to x2 ∈ Y if x1 ∈ X responded toa video of x2 ∈ Y
Analysis:
in/out degree for each “user”
assortativity: degree(n) / avg( degree(neighbours(n)) )
userrank: depending on quantity and quality of in links
clustering coefficient, betwenness, reciprocity
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 31 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Dataset
Data crawling:Starting from top-100 most responded video, retrieving connected dataconcerning video responses, responded video e users.
Hand made classification:Each user with at leas a video response not related to the responded videois classified as spammer.
Test set:473 legal users + 119 spammer = 592 users
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 32 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Training
Support Vector Machine
5-fold cross-validation
Adopted features:
user-basedvideo-basedsocial-networkall together
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 33 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Classification - Results
Measure User Video SN AllTP 0.054 0.426 0.375 0.439TN 0.998 0.922 1 0.981FP 0.002 0.078 0 0.019FN 0.946 0.574 0.625 0.561
Accuracy 0.821 0.821 0.874 0.870F 0.094 0.484 0.540 0.558
TP = users correctly classified as spammersFP = legal users classified as spammers
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 34 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Conclusions
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 35 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Conclusions
Classifications
Tag-spam recognition :Accuracy > 98%False positives < 2%
Youtube-video spam recognition :True positives > 44%False positives < 2%
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 36 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
Conclusions
Pro:
Few legal users classified as spammer
Tag-spam recognition finds most of the spammer
Dataset build out of publicly available information
Contro:
Social system already poisoned by spam
Hand made classification of training examples
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 37 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
References I
Brian D. Davison,The Potential for Research and Development in AdversarialInformation Retrieval,Computer Science and Engr., Lehigh University, Cambridge, 2009,available at http://airweb.cse.lehigh.edu/2009/slides/Davison-AIRWeb2009-Keynote.pdf.
B.Markines,C.Cattuto,F.Menczer,D.Benz,A.Hotho,and G. Stumme,Evaluating similarity measures for emergent semantics of socialtagging,In Proc. 18th Intl. WWW Conf., 2009,available at http://www2009.org/proceedings/pdf/p641.pdf.
Benjamin Markines, Ciro Cattuto, Filippo Menczer,Social Spam Detection, AIRWeb ’09, April 21, 2009 Madrid, Spain,available athttp://airweb.cse.lehigh.edu/2009/papers/p41-markines.pdf.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 38 / 39
Introduction Tag-spam detection Youtube Video Spamming Conclusions References
References II
Fabricio Benevenuto, Tiago Rodrigues, Virgilio Almeida, Jussara Almeida,Chao Zhang, Keith Ros,Identifying Video Spammers in Online Social Networks,AIRWeb ’08, April 22, 2008 Beijing, China,available at http://airweb.cse.lehigh.edu/2008/submissions/benevenuto_2008_spam_video.pdf.
Nicola Miotto (Unipd - Computer Science) Adversarial IR - Social Spamming January 22, 2011 39 / 39