the fight against spam - a machine learning approach

28
The Fight against Spam - A Machine Learning Approach Jiri Hynek ([email protected]) Karel Jezek ([email protected]) ELPUB 2007, Vienna www.textmining.c z

Upload: zora

Post on 17-Jan-2016

27 views

Category:

Documents


0 download

DESCRIPTION

ELPUB 2007, Vienna. The Fight against Spam - A Machine Learning Approach. Jiri Hynek ([email protected]) Karel Jezek ([email protected]). www.textmining.cz. Contents:. Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results. Contents:. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Fight against Spam - A Machine Learning Approach

The Fight against Spam- A Machine Learning

Approach

Jiri Hynek ([email protected])Karel Jezek ([email protected])

ELPUB 2007, Vienna

www.textmining.cz

Page 2: The Fight against Spam - A Machine Learning Approach

2

Contents:

Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results

Page 3: The Fight against Spam - A Machine Learning Approach

3

Contents:

Spamming is publishing:

Web Spam (“comment spam“)- blogs, (unmoderated) forums, wikisWhy: to trigger higher page-ranking!

Unsolicited marketing spam in our e-mails – info dissemination to the public

Why: sell products!

Page 4: The Fight against Spam - A Machine Learning Approach

4

A bit of Terminology:“Canned meat made largely from pork“

Ham vs. Spam (Spam mail)UCE (Unsolicited Commercial Email)UBM (Unsolicited Bulk Mail)EMP (Excessive Multi-Posting)Junk mail Bulk email

Page 5: The Fight against Spam - A Machine Learning Approach

5

Stats 101

Top five spam categories: Online Pharmacies 20.0%Mortgage Refinancing 9.7%Investment/financial services 9.0%Male products (\/i@gra, CI@1i$) 8.7%Discount computer software 6.9%

Communications of the ACM, February 2007/Vol. 50 No.2

Page 6: The Fight against Spam - A Machine Learning Approach

6

Stats 101

1998: Mere 10% of overall mail volumeNow: 80%Communications of the ACM, February 2007/Vol. 50 No.2

Average spammers‘ revenue: $1 per 45,000 spams dispatched

A database of 100 million e-mails costs 100 dollars, spam software included

(www.symantec.com)

Page 7: The Fight against Spam - A Machine Learning Approach

7

Today‘s Spam Types

Text Spam

Page 8: The Fight against Spam - A Machine Learning Approach

8

Today‘s Spam TypesText Spam Commonly used phrases filtered out by antispam filters(and words to avoid, of course)  Free! 50% off! Click HereCall now! Subscribe Earn $Discount! Eliminate Debt Double your incomeYou're a Winner! Reverses Aging HiddenInformation you requested Stop / Stops Lose Weight Multi level Marketing Million Dollars OpportunityCompare Removes CollectAmazing Cash Bonus Promise YouCredit Loans Satisfaction

GuaranteedSerious Cash Search Engine Listings

Page 9: The Fight against Spam - A Machine Learning Approach

9

Today‘s Spam TypesImage-Based Spam

Page 10: The Fight against Spam - A Machine Learning Approach

10

Today‘s Spam TypesImage-Based Spam in our mailboxes

June 2005

June2006

Overall share in spam

1 % 12 %

New spam domain originating every

48 hours 4 hours

Daily spam volume 30,000 million

55,000 million

Page 11: The Fight against Spam - A Machine Learning Approach

11

Today‘s Spam Types

Phishing

Page 12: The Fight against Spam - A Machine Learning Approach

12

Today‘s Spam Types

Captcha - fighting web spam

Page 13: The Fight against Spam - A Machine Learning Approach

13

Common Spammer Tricks

Tricks to fool statistical spam filters:

Avoidance of keywords (such as stock, Viagra, etc.),Frequent change in sender’s address,Message encoding (such as base64, commonly used for secure message transfer),Hashing (e.g. insertion of HTML tags into messages),Use of images instead of plain text (namely GIF, JPEG, and PNG).

Page 14: The Fight against Spam - A Machine Learning Approach

14

New Spammer Tricks

Character Hashing:

I finlaly was able to lsoe the wieght I have been sturggling to lose for years! And I couldn't bileeve how simple it was! Amizang pacth makes you shed the ponuds! It's Guanarteed to work or your menoy back!

Page 15: The Fight against Spam - A Machine Learning Approach

15

New Spammer Tricks

Keyword masking by repeating characters: Buuuyyyy cheeeeaaap viaaagraaa

Word obfuscations:\/laGr@Need a{} Dpiloma?sh1pp1ng //orldwideS0ft T4bsCi@li$repl1ca w4tches from r0lex

Page 16: The Fight against Spam - A Machine Learning Approach

16

New Spammer Tricks

V I A G R A

V, v, \/ I i 1 l | ï ì : Ì Î Í Ï

A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å

G g R r ® A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å

3 variation

s

12 variations

17 variations

2 variation

s

3 variation

s

17 variations

There are 62,424 (3 x 12 x 17 x 2 x 3 x 17) ways to portray the name Viagra.

In fact, there are 600,426,974,379,824,381,952 ways to spell

Source: http://cockeyed.com/lessons/viagra/viagra.html

Word obfuscations:

Page 17: The Fight against Spam - A Machine Learning Approach

17

New Spammer Tricks

ASCII Art:     \|||||/                        

( o   o )          -ooO--(_)--Ooo— / \

Page 18: The Fight against Spam - A Machine Learning Approach

18

New Spammer Tricks

ASCII Art:

Page 19: The Fight against Spam - A Machine Learning Approach

19

New Spammer TricksGood word attacks(Bayesian poisoning)

Russa says McGwire belongs in Hall AP - 35 minutes ago One year on, the face live! EDITORS' BLOG CNN.com AP Action on Elder Abuse Politics My Sources Weather Alerts Back Security SPACE.com The council is now proposing to increase the annual fee to nurses Freeman dies AFP Pope calls for Islam dialogue "There's a lot of theoreticalCSMonitor.com Last Updated: Tuesday, 28 November 2006, 23:13 GMT Bad rapto top ^^ Five girls killed in Iraqi clash This is where a little bit of help28, 6:33 AM ET Wales Lottery Video: Bush Praises Estonia As War on Terror AllyANALYSIS Mucking about? Hazards Podcasts ELSEWHERE ON THE BBC At the same timeVictims Were Asleep Fashion Wire Daily AFP Football's elite Baby beluga dies athands-on situation." 'My mother was assaulted' Entertainment Search World Radio 2 Google together Mr Litvinenko's movements on 1 November, the day he fell...

Page 20: The Fight against Spam - A Machine Learning Approach

20

New Spammer TricksGood word attacks

Page 21: The Fight against Spam - A Machine Learning Approach

21

A Filter to Fight Text-Based Spam

It‘s just another Short Document Classification Problem:

The Itemsets FilterPlain Bayes FilterLSI FilterSVM FilterGZip (Compression-based) filter

Page 22: The Fight against Spam - A Machine Learning Approach

22

Standard Spam Testing Collections

PU1: A mixture of 481 spam messages and 618

legitimate messages

PU123A: Four corpora, based on private mailboxes

Enron Corpus: 200,399 unique messages collected by 158 users

(mostly managers)

Page 23: The Fight against Spam - A Machine Learning Approach

23

Itemsets Spam Filter: Results

100 spams, 100 hams

400 spams, 400 hams

Avg (100,400)

FPI (%) 19,00 20,61 19,81

FNI (%) 2,53 3,55 3,04

TNI (%) 97,47 97,80 97,64

TPI (%) 80,99 89,91 85,45

FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.

FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.

Page 24: The Fight against Spam - A Machine Learning Approach

24

SVM Spam Filter: Results

FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.

FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.

100 spams, 100 hams

400 spams, 400 hams

Avg (100,400)

FPI (%) 13,91 10,10 12,01

FNI (%) 1,18 2,19 1,69

TNI (%) 98,82 97,80 98,31

TPI (%) 86,09 89,91 88,00

Page 25: The Fight against Spam - A Machine Learning Approach

25

GZip Spam Filter: Results

FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.

FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.

…We will look into this in the near future

100 spams, 100 hams

400 spams, 400 hams

Avg (100,400)

FPI (%) 1,72 2,33 2,03

FNI (%) 30,28 27,31 28,80

TNI (%) 69,72 72,69 71,21

TPI (%) 98,28 97,67 97,98

Page 26: The Fight against Spam - A Machine Learning Approach

26

Light at the end of the tunnel?

Payment per e-mail?

Quite unlikely…

E-mail authentication by SIDF

• Sender ID Framework (by Microsoft)

• … registered list of servers of domain owners

• Confirmation of e-mail source domain (automatically, by ISPs)

• Protects 40% of legitimate email sent worldwide

• Helps combat phishing scams / domain spoofing (forging a sender's address)

Page 27: The Fight against Spam - A Machine Learning Approach

27

Light at the end of the tunnel?

DomainKeys Identified Mail (DKIM)

• Similar technology by Yahoo, Cisco Systems, Sendmail, PGP

• Based on digital signatures

• An official proposed standard by Internet Engineering Task Force

Page 28: The Fight against Spam - A Machine Learning Approach

28

Thank You For Your Attention

Questions?

FEEDBACK