the fight against spam - a machine learning approach

The Fight against Spam- A Machine Learning

Approach

Jiri Hynek ([email protected])Karel Jezek ([email protected])

ELPUB 2007, Vienna

www.textmining.cz

2

Contents:

Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results

3

Contents:

Spamming is publishing:

Web Spam (“comment spam“)- blogs, (unmoderated) forums, wikisWhy: to trigger higher page-ranking!

Unsolicited marketing spam in our e-mails – info dissemination to the public

Why: sell products!

4

A bit of Terminology:“Canned meat made largely from pork“

Ham vs. Spam (Spam mail)UCE (Unsolicited Commercial Email)UBM (Unsolicited Bulk Mail)EMP (Excessive Multi-Posting)Junk mail Bulk email

5

Stats 101

Top five spam categories: Online Pharmacies 20.0%Mortgage Refinancing 9.7%Investment/financial services 9.0%Male products (\/i@gra, CI@1i$) 8.7%Discount computer software 6.9%

Communications of the ACM, February 2007/Vol. 50 No.2

6

Stats 101

1998: Mere 10% of overall mail volumeNow: 80%Communications of the ACM, February 2007/Vol. 50 No.2

Average spammers‘ revenue: $1 per 45,000 spams dispatched

A database of 100 million e-mails costs 100 dollars, spam software included

(www.symantec.com)

7

Today‘s Spam Types

Text Spam

8

Today‘s Spam TypesText Spam Commonly used phrases filtered out by antispam filters(and words to avoid, of course) Free! 50% off! Click HereCall now! Subscribe Earn $Discount! Eliminate Debt Double your incomeYou're a Winner! Reverses Aging HiddenInformation you requested Stop / Stops Lose Weight Multi level Marketing Million Dollars OpportunityCompare Removes CollectAmazing Cash Bonus Promise YouCredit Loans Satisfaction

GuaranteedSerious Cash Search Engine Listings

9

Today‘s Spam TypesImage-Based Spam

10

Today‘s Spam TypesImage-Based Spam in our mailboxes

June 2005

June2006

Overall share in spam

1 % 12 %

New spam domain originating every

48 hours 4 hours

Daily spam volume 30,000 million

55,000 million

11


Phishing

12


Captcha - fighting web spam

13

Common Spammer Tricks

Tricks to fool statistical spam filters:

Avoidance of keywords (such as stock, Viagra, etc.),Frequent change in sender’s address,Message encoding (such as base64, commonly used for secure message transfer),Hashing (e.g. insertion of HTML tags into messages),Use of images instead of plain text (namely GIF, JPEG, and PNG).

14

New Spammer Tricks

Character Hashing:

I finlaly was able to lsoe the wieght I have been sturggling to lose for years! And I couldn't bileeve how simple it was! Amizang pacth makes you shed the ponuds! It's Guanarteed to work or your menoy back!

15

New Spammer Tricks

Keyword masking by repeating characters: Buuuyyyy cheeeeaaap viaaagraaa

Word obfuscations:\/laGr@Need a{} Dpiloma?sh1pp1ng //orldwideS0ft T4bsCi@li$repl1ca w4tches from r0lex

16

New Spammer Tricks

V I A G R A

V, v, \/ I i 1 l | ï ì : Ì Î Í Ï

A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å

G g R r ® A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å

3 variation

s

12 variations

17 variations

2 variation

s

3 variation

s

17 variations

There are 62,424 (3 x 12 x 17 x 2 x 3 x 17) ways to portray the name Viagra.

In fact, there are 600,426,974,379,824,381,952 ways to spell

Source: http://cockeyed.com/lessons/viagra/viagra.html

Word obfuscations:

17

New Spammer Tricks

ASCII Art: \|||||/

( o o ) -ooO--(_)--Ooo— / \

18

New Spammer Tricks

ASCII Art:

19

New Spammer TricksGood word attacks(Bayesian poisoning)

Russa says McGwire belongs in Hall AP - 35 minutes ago One year on, the face live! EDITORS' BLOG CNN.com AP Action on Elder Abuse Politics My Sources Weather Alerts Back Security SPACE.com The council is now proposing to increase the annual fee to nurses Freeman dies AFP Pope calls for Islam dialogue "There's a lot of theoreticalCSMonitor.com Last Updated: Tuesday, 28 November 2006, 23:13 GMT Bad rapto top ^^ Five girls killed in Iraqi clash This is where a little bit of help28, 6:33 AM ET Wales Lottery Video: Bush Praises Estonia As War on Terror AllyANALYSIS Mucking about? Hazards Podcasts ELSEWHERE ON THE BBC At the same timeVictims Were Asleep Fashion Wire Daily AFP Football's elite Baby beluga dies athands-on situation." 'My mother was assaulted' Entertainment Search World Radio 2 Google together Mr Litvinenko's movements on 1 November, the day he fell...

20

New Spammer TricksGood word attacks

21

A Filter to Fight Text-Based Spam

It‘s just another Short Document Classification Problem:

The Itemsets FilterPlain Bayes FilterLSI FilterSVM FilterGZip (Compression-based) filter

22

Standard Spam Testing Collections

PU1: A mixture of 481 spam messages and 618

legitimate messages

PU123A: Four corpora, based on private mailboxes

Enron Corpus: 200,399 unique messages collected by 158 users

(mostly managers)

23

Itemsets Spam Filter: Results

100 spams, 100 hams

400 spams, 400 hams

Avg (100,400)

FPI (%) 19,00 20,61 19,81

FNI (%) 2,53 3,55 3,04

TNI (%) 97,47 97,80 97,64

TPI (%) 80,99 89,91 85,45

FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.

FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.

24

SVM Spam Filter: Results



100 spams, 100 hams

400 spams, 400 hams

Avg (100,400)

FPI (%) 13,91 10,10 12,01

FNI (%) 1,18 2,19 1,69

TNI (%) 98,82 97,80 98,31

TPI (%) 86,09 89,91 88,00

25

GZip Spam Filter: Results



…We will look into this in the near future

100 spams, 100 hams

400 spams, 400 hams

Avg (100,400)

FPI (%) 1,72 2,33 2,03

FNI (%) 30,28 27,31 28,80

TNI (%) 69,72 72,69 71,21

TPI (%) 98,28 97,67 97,98

26

Light at the end of the tunnel?

Payment per e-mail?

Quite unlikely…

E-mail authentication by SIDF

• Sender ID Framework (by Microsoft)

• … registered list of servers of domain owners

• Confirmation of e-mail source domain (automatically, by ISPs)

• Protects 40% of legitimate email sent worldwide

• Helps combat phishing scams / domain spoofing (forging a sender's address)

27

Light at the end of the tunnel?

DomainKeys Identified Mail (DKIM)

• Similar technology by Yahoo, Cisco Systems, Sendmail, PGP

• Based on digital signatures

• An official proposed standard by Internet Engineering Task Force

28

Thank You For Your Attention

Questions?

FEEDBACK

the fight against spam - a machine learning approach

Documents