The Fight against Spam- A Machine Learning
Approach
Jiri Hynek ([email protected])Karel Jezek ([email protected])
ELPUB 2007, Vienna
www.textmining.cz
2
Contents:
Stats 101 Today‘s Spam Types Spammer Tricks Text-Based Spam Filter Implementation Results
3
Contents:
Spamming is publishing:
Web Spam (“comment spam“)- blogs, (unmoderated) forums, wikisWhy: to trigger higher page-ranking!
Unsolicited marketing spam in our e-mails – info dissemination to the public
Why: sell products!
4
A bit of Terminology:“Canned meat made largely from pork“
Ham vs. Spam (Spam mail)UCE (Unsolicited Commercial Email)UBM (Unsolicited Bulk Mail)EMP (Excessive Multi-Posting)Junk mail Bulk email
5
Stats 101
Top five spam categories: Online Pharmacies 20.0%Mortgage Refinancing 9.7%Investment/financial services 9.0%Male products (\/i@gra, CI@1i$) 8.7%Discount computer software 6.9%
Communications of the ACM, February 2007/Vol. 50 No.2
6
Stats 101
1998: Mere 10% of overall mail volumeNow: 80%Communications of the ACM, February 2007/Vol. 50 No.2
Average spammers‘ revenue: $1 per 45,000 spams dispatched
A database of 100 million e-mails costs 100 dollars, spam software included
(www.symantec.com)
7
Today‘s Spam Types
Text Spam
8
Today‘s Spam TypesText Spam Commonly used phrases filtered out by antispam filters(and words to avoid, of course) Free! 50% off! Click HereCall now! Subscribe Earn $Discount! Eliminate Debt Double your incomeYou're a Winner! Reverses Aging HiddenInformation you requested Stop / Stops Lose Weight Multi level Marketing Million Dollars OpportunityCompare Removes CollectAmazing Cash Bonus Promise YouCredit Loans Satisfaction
GuaranteedSerious Cash Search Engine Listings
9
Today‘s Spam TypesImage-Based Spam
10
Today‘s Spam TypesImage-Based Spam in our mailboxes
June 2005
June2006
Overall share in spam
1 % 12 %
New spam domain originating every
48 hours 4 hours
Daily spam volume 30,000 million
55,000 million
11
Today‘s Spam Types
Phishing
12
Today‘s Spam Types
Captcha - fighting web spam
13
Common Spammer Tricks
Tricks to fool statistical spam filters:
Avoidance of keywords (such as stock, Viagra, etc.),Frequent change in sender’s address,Message encoding (such as base64, commonly used for secure message transfer),Hashing (e.g. insertion of HTML tags into messages),Use of images instead of plain text (namely GIF, JPEG, and PNG).
14
New Spammer Tricks
Character Hashing:
I finlaly was able to lsoe the wieght I have been sturggling to lose for years! And I couldn't bileeve how simple it was! Amizang pacth makes you shed the ponuds! It's Guanarteed to work or your menoy back!
15
New Spammer Tricks
Keyword masking by repeating characters: Buuuyyyy cheeeeaaap viaaagraaa
Word obfuscations:\/laGr@Need a{} Dpiloma?sh1pp1ng //orldwideS0ft T4bsCi@li$repl1ca w4tches from r0lex
16
New Spammer Tricks
V I A G R A
V, v, \/ I i 1 l | ï ì : Ì Î Í Ï
A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å
G g R r ® A a @ /\ á à â ã ä å æ À Á Â Ã Ä Å
3 variation
s
12 variations
17 variations
2 variation
s
3 variation
s
17 variations
There are 62,424 (3 x 12 x 17 x 2 x 3 x 17) ways to portray the name Viagra.
In fact, there are 600,426,974,379,824,381,952 ways to spell
Source: http://cockeyed.com/lessons/viagra/viagra.html
Word obfuscations:
17
New Spammer Tricks
ASCII Art: \|||||/
( o o ) -ooO--(_)--Ooo— / \
18
New Spammer Tricks
ASCII Art:
19
New Spammer TricksGood word attacks(Bayesian poisoning)
Russa says McGwire belongs in Hall AP - 35 minutes ago One year on, the face live! EDITORS' BLOG CNN.com AP Action on Elder Abuse Politics My Sources Weather Alerts Back Security SPACE.com The council is now proposing to increase the annual fee to nurses Freeman dies AFP Pope calls for Islam dialogue "There's a lot of theoreticalCSMonitor.com Last Updated: Tuesday, 28 November 2006, 23:13 GMT Bad rapto top ^^ Five girls killed in Iraqi clash This is where a little bit of help28, 6:33 AM ET Wales Lottery Video: Bush Praises Estonia As War on Terror AllyANALYSIS Mucking about? Hazards Podcasts ELSEWHERE ON THE BBC At the same timeVictims Were Asleep Fashion Wire Daily AFP Football's elite Baby beluga dies athands-on situation." 'My mother was assaulted' Entertainment Search World Radio 2 Google together Mr Litvinenko's movements on 1 November, the day he fell...
20
New Spammer TricksGood word attacks
21
A Filter to Fight Text-Based Spam
It‘s just another Short Document Classification Problem:
The Itemsets FilterPlain Bayes FilterLSI FilterSVM FilterGZip (Compression-based) filter
22
Standard Spam Testing Collections
PU1: A mixture of 481 spam messages and 618
legitimate messages
PU123A: Four corpora, based on private mailboxes
Enron Corpus: 200,399 unique messages collected by 158 users
(mostly managers)
23
Itemsets Spam Filter: Results
100 spams, 100 hams
400 spams, 400 hams
Avg (100,400)
FPI (%) 19,00 20,61 19,81
FNI (%) 2,53 3,55 3,04
TNI (%) 97,47 97,80 97,64
TPI (%) 80,99 89,91 85,45
FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.
FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
24
SVM Spam Filter: Results
FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.
FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
100 spams, 100 hams
400 spams, 400 hams
Avg (100,400)
FPI (%) 13,91 10,10 12,01
FNI (%) 1,18 2,19 1,69
TNI (%) 98,82 97,80 98,31
TPI (%) 86,09 89,91 88,00
25
GZip Spam Filter: Results
FPI = (#ham as spam) / #hami.e. the proportion of legitimate messages deleted by mistake.
FNI = (#spam as ham) / #spam i.e. the proportion of spam passing through the filter.
…We will look into this in the near future
100 spams, 100 hams
400 spams, 400 hams
Avg (100,400)
FPI (%) 1,72 2,33 2,03
FNI (%) 30,28 27,31 28,80
TNI (%) 69,72 72,69 71,21
TPI (%) 98,28 97,67 97,98
26
Light at the end of the tunnel?
Payment per e-mail?
Quite unlikely…
E-mail authentication by SIDF
• Sender ID Framework (by Microsoft)
• … registered list of servers of domain owners
• Confirmation of e-mail source domain (automatically, by ISPs)
• Protects 40% of legitimate email sent worldwide
• Helps combat phishing scams / domain spoofing (forging a sender's address)
27
Light at the end of the tunnel?
DomainKeys Identified Mail (DKIM)
• Similar technology by Yahoo, Cisco Systems, Sendmail, PGP
• Based on digital signatures
• An official proposed standard by Internet Engineering Task Force
28
Thank You For Your Attention
Questions?
FEEDBACK