cs 599: social media analysis university of southern california1 social spam kristina lerman...

46
CS 599: Social Media Analysis University of Southern California 1 Social Spam Kristina Lerman University of Southern California

Upload: rosanna-stella-robinson

Post on 26-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

CS 599: Social Media Analysis

University of Southern California 1

Social Spam

Kristina LermanUniversity of Southern California

Page 2: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Manipulation of social media• Spam

use of electronic messaging systems to send unsolicited bulk messages indiscriminately, for financial gain

– Malware • if the page hosts malicious software or attempts to exploit a user’s

browser. – Phishing

• pages include any website attempting to solicit a user’s account credentials

– Scam • any website advertising pharmaceuticals, software, adult content,

and a multitude of other solicitations• Deception

Page 3: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Motivations for spam

• Abusers drive traffic to a web site – Malicious sites

• phishing, malware, sell products• Compromised accounts then sold to other spammers

– “Click fraud”• Gain financially from showing ads to visitors

Page 4: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

What is the cost of spam?

• Are users harmed by click fraud?– advertiser gains, because real users click on ads– intermediary gains fees from the advertiser. – Spammer gains its cut from the clicked ads. – User gains, since she learns about products from ads

• No harm is done?

Page 5: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

What is the cost of spam• Costs to consumers

– Information pollution: good content is hard to find– Search engines and folksonomies direct traffic in the

wrong directions– Users end up with less relevant resources

• Costs to content producers– Less revenue for producers of relevant content

• Costs to search engines – Develop algorithms to combat spam

Everybody pays for the cost of information pollution

Page 6: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Combatting spam• Social media spam is successful

– 8% of URLs posted on Twitter are spam [2010]– Much higher click-through rates than email

Strategies designed to make spam more costly to spammers• Search engine spam

– Algorithms to combat rank manipulations, e.g., link farms– Blacklists of suspected malware and phishing (e.g.,

Google’s SafeBrowsing API)• Email spam

– Filters on servers and clients– Blacklists: IP, domain and URL

• Social spam?

Page 7: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Social Spam Detection

Benjamin Markines, Ciro Cattuto, Filippo Menczer

Presented by Yue Cai

Page 8: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Introduction

• Web 2.0: social annotationuser-drivensimplicity and open-ended nature

• Folksonomy: set of triples (u, r, t)user annotates resource r with tag t

• Problem: social spammalicious user exploit collaborative tagging

Page 9: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Focus of paper

• Six features of social spam in collaborative tagging system

limited to social bookmarking system (delicious.com)

• Prove each feature has predictive power

• Evaluate various supervised learning algorithms using these features

Page 10: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Background

• Why?financial gains

• How?create content (generate by NLP or plagiarizing)place adsmisleading tagging in social sites to attract traffic-- “Gossip Search Engine”

• Outcome?Pollution of web environment

Page 11: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Levels of spam

• Content of tagged resourcessubjectivity

• Posts: associate resources with tagscreate artificial links between resources and unrelated tagsfor questionable content, how user annotates it reveals

intent• User account

flag users as spammers – BibSonomybroad brush: exceedingly strict

Page 12: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

TagSpam

• Spammers may use tags and tag combinations that are statistically unlikely in legitimate posts

• Pr(t) : possibility of a given tag t is associated with spam

users with tag t:

• For a post:

• Time complexity: constant time for any post• Cold start problem: needs a body of labeled annotations to

bootstrap tag possibilities

{ : : ( , , ) }tU u r u r t F t tS U

Pr( ) t

t

St

U

( , ) { : ( , , ) }T u r t u r t F

( , )( , )

1Pr( )

( , )TagSpam u rt T u r

f tT u r

Page 13: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

TagBlur• Spam posts associate spam resources with popular tags that

are often semantically unrelated to each other• Semantic similarity of tags:

base on prior work

• For a post:

Z: number of tag pairs in T(u,r)

ε: attuning constant• Time complexity: quadratic in number of tags per post

considers constant time • Needs precomputed similarity for any two tags

1 1( , ) [0,1]t t

( , ) { : ( , , ) }T u r t u r t F

1 2

( , )( , ) 1 2

1 1 1

( , ) 1TagBlur u rt t T u r

fZ t t

Page 14: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

DomFp• Spam webpages tend to have similar document structure• Estimate likelihood of r being spam by structure similarity with spam

pages• Fingerprint:

string containing all HTML 4.0 elements with order preserved

K fingerprints of spam pages, each with its frequency Pr(k)

Shingles method:

• Time complexity: grows linearly with size of labeled spam collection• Needs to crawl each resource and precompute spam fingerprint

possibility

1 2( , ) [0,1]k k

( )

( ( ), )*Pr( )

( ( ), )

k KDomFp r

k K

k r k kf

k r k

Page 15: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Plagiarism

• Spammers often copy original content from all over the Web

• Estimate likelihood of content of r not being genuine

• Random sequence of 10 words from page

submit to Yahoo API get numbers of results

• Most expensive feature: page download, query limit

( ) max max( ) / , 10Plagiarism rf y r y y

Page 16: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

NumAds

• Spammers create pages for serving ads• g(r): number of googlesyndication.com appeared in page r

• Needs complete download of a web page

max max( ) / , 113NumAdsf g r g g

ValidLinks• Many spam resources may be taken offline when

detected• High portion of links by a spam user are invalid after some time

• Expensive: send HTTP HEAD request for each resource

( ) / , { : ( : ( , , ) )},ValidLinks u u u u u uf V R R r t u r t F V R

Page 17: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Evaluation• Public Dataset by BibSonomy.org

annotations of 27,000 users, 25,000 of which are spammers

• Training dataset: 500 users, half spammers, half legitimate users• Another training dataset of same size for precompution

features like TagSpam, TagBlur and DomFp

• Aggregation of features on user level:

TagSpam, TagBlur: post level

DomFp, Plagiarism, NumAds: resource level

Simple average works most effective across all features

( , ) ( )

1( ) ( , ), ( ) [0,1]

( ) u r P u

f u f u r f uP u

Page 18: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Each feature has predictive power

• Each feature: contingency matrix n(l, f)

TagSpam works the best

Page 19: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Classification

All classifiers perform very well, with accuracy over 96% and false positive rate below 5%.

Effect of feature selection (SVM):• a modest improvement in accuracy and decrease in false positive rate by using both TagSpam and TagBlur• Performance is hindered by the addition of the ValidLinks feature (not for linear separation)

Page 20: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Conclusion

• Features are strongsingle use : 96% accuracy, 5% false positivecombining: : 98% accuracy, 2% false positive

• TagBlur feature looking promisingits reliance on tag-tag similarity could be updatedothers rely on resource content or search engine so not

reliable• Bootstrap still an open issue

features like TagSpam and DomFp needs spam labels• Whether unsupervised features still needed

like ValidLinks and Plagiarism

Page 21: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Questions?

Page 22: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

@spam: The Underground on 140 Characters or Less

Chris Grier, Kurt Thomas, Vern Paxson, Michael Zhang

Presented by Renjie Zhao

Page 23: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Focus of the Paper• Categorization and measure of Twitter spam

– Spammers’ strategies, accounts and tools– How good are they? (Much better than junk emails)

• Identification of spam campaigns– URL clustering– Extraction of distinct spam behaviors and targets

• Performance of URL blacklists against Twitter spam– Temporal effectiveness (lead/lag)– Spammers’ counter-measures

Page 24: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Preparation• Data Collection

– Tapping into Twitter’s Streaming API• 7 million tweets per day• Over the course of one month (January 2010 – Feburary 2010)

– Total: 200 million tweets gathered

• Spam Identification– Focus on tweets with URL (25 million URLs)– Check URLs with 3 blacklists: Google Safebrowsing API,

URIBL, Joewein– Result: 2 million URLs are flagged as spam

• Challenged by manual inspection!

Page 25: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Spam Breakdown

Win an iTouch AND a $150 Apple gift card @victim!http://spam.com

Call outsRT @scammer: check out the iPads there having a giveaway http://spam.com

Retweetshttp://spam.com RT @barackobama A great battle is ahead of us

Tweet hijackingBuy more followers! http://spam.com #fwlr

Trend settingHelp donate to #haiti relief: http://spam.com

Trend hijacking

Page 26: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Clickthrough Analysis

• According to Clickthrough data of 245,000 URLs:– Only 2.3% have traffic– They had over 1.6 million visitors

• Clickthrough rate– For a certain spam URL,

CR = <# of clicks> / <# of URL’s exposure>– Result: 0.13% of spams tweets generate a visit

(Compared to junk emails’ CR of 0.0003%-0.0006%)

Page 27: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Spam Accounts • 2 tests to identify career spamming accounts

– χ2 test on timestamp – consistency with uniform distribution

– Tweet entropy – whether content is repeated throughout tweets

• ResultIn a sample of 43,000 spam accounts:– 16% are identified as career spammers– What about the rest 84%?

Page 28: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Spam Accounts• Compromised (non-career) spamming accounts

– Phishing sites• 86% of 20,000 victims passed career spammer tests

– Malware botnet: Koobface

Page 29: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Spam Campaigns• Multiple spamming accounts may co-operate to advertise a

spam website• URL clustering

– Define a spam campaign as a binary feature vector c={0, 1}n

– For two accounts i and j, if ci∩cj ≠ Ø, then i and j are clustered

Page 30: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Spam Campaigns• Phishing for followers

– A pyramid scheme– Most spammers are compromised users advertising the

service

• Personalized mentions– twitprize.com/<user name>– Unique, victim-specific landing pages shortened with

tinyurl– Most relevant tweets are just RT or mentions

Page 31: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Spam Campaigns• Buying retweets

– retweet.it– Usually employed by spammers to spread malware and

scams– Most accounts are career spammers (by χ2 test)

• Distributing malware– ‘Free’ software, drive-by download– Use multiple hops of redirect to mask landing pages

Page 32: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

URL Blacklists• Currently (2010), Twitter relies Google Safebrowsing API to

block malicious URLs.– Blacklists usually lags behind spam tweets– No retroactive blocking!

Page 33: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Evading URL Blacklists• URL shortening service

– bit.ly goo.gl ow.ly spam.com

• What about domain-wise blacklists?

Page 34: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Conclusion• 8% of URLs on Twitter are spams• 16% of spam accounts are automated bots• Spam Clickthrough rate = 0.13%• Spammers may coordinate thousands of accounts in a

campaign• URL blacklists don’t work very well

– because of delayed response– unable to reveal shortened URLs

• Advice– Dig deeper into redirect chains– Retroactive blacklisting to increase spammers’ cost

Page 35: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Follow-ups• More researches on spammers’ behaviors• Twitter added feature for user to report spam• ‘BotMaker’ launched in August

Page 36: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Entropy-based Classification of ‘Retweeting’ Activity [Ghosh et al.]• Question

– Given the time series of ‘retweeting’ activity on some user-generated content or tweet, how do we meaningfully categorize it as organic or spam?

• Contributions– Use information theory-based features to categorize

tweeting activity• Time interval entropy• User entropy

Page 37: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Dynamics of Retweeting Activity

(i) Popular news website (nytimes) (ii) Popular celebrity (billgates) (iii) Politician (silva_marina)

(iv) An aspiring artist (youngdizzy) (v) Post by a fan site (AnnieBeiber) (vi) Advertisement using social media(onstrategy)

vs

Page 38: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Measuring time interval and user diversity• Measure time interval between consecutive retweets• Count distinct tweeting users

ti

Page 39: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Time Interval Diversity

Frequency of time Intervals of duration ti

Many different time intervals

Time Interval Entropy

(i) (ii)

Few time intervals observed

Page 40: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

User Diversity

Frequency of retweets by distinct user fi User Entropy

Many different users retweet a few times each

Few users retweet many times each

Page 41: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Bloggers and News Website

(i) Popular news website (nytimes) (ii) Popular celebrity (billgates)

Dynamics of Retweeting Activity

Page 42: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Campaigners

Dynamics of Retweeting Activity

(iii) Politician (silva_marina)(vi) Animal Right Activist(nokillanimalist)

Page 43: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Performers and their fans

Dynamics of Retweeting Activity

(iv) An aspiring artist (youngdizzy) (v) Post by a fan site (AnnieBeiber)

Page 44: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Advertisers and spammers

(vii) Advertisement using social media(onstrategy)(viii) Account eventually suspended by Twitter(EasyCash435)

(ix) Advertisement by a Japanese user (nikotono)

Page 45: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Validation

Manually annotated URLs shown in the entropy plane

nytimes

billgates

silva_marina

News and blogs

DonnaCCasteel

animalist

campaigns

EasyCash

onstrategy

advertisements & spams

AnnieBieber bot activity

Page 46: CS 599: Social Media Analysis University of Southern California1 Social Spam Kristina Lerman University of Southern California

Conclusion– Novel information theoretic approach to activity recognition

• Content independent• Scalable and efficient• Robust to sampling

– Results• sophisticated tools for marketing and spamming• Twitter is exploited for promotional and spam-like activities• Able to identify distinct classes of dynamic activities in Twitter and

associated content• Separation of popular with unpopular content

– Applications-spam detection, trend identification, trust management, user-modeling, social search, content classification