scavenger: a junk mail classification program rohan malkhare committee : dr. eugene fink dr. dewey...

SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM

Rohan Malkhare

Committee:

Dr. Eugene FinkDr. Dewey Rundus

Dr. Alan Hevner

Outline

• Introduction• Related Work• Algorithm• Measurements• Implementation• Future Work

Introduction

Anti-spam efforts

• Legislation• Technology

– White listing of Email addresses– Black Listing of Email addresses/domains– Challenge Response mechanisms– Content Filtering

• Learning Techniques

Introduction

Learning techniques for Spam classification

• Feature Extraction• Assignment of weights to individual features

representing the predictive strength of a feature• Combining weights of extracted features during

classification to numerically determine whether mail is spam/legitimate

Introduction

Current algorithms

• Word or phrases as features• Probabilities of occurrence in spam/legitimate

collections as weights• Bayes rule or one of it’s variants for combining

weights

Outline


Related Work

• Cohen (1996): – RIPPER, Rule Learning System– Rules in a human-comprehensible format

• Pantel & Lin (1998):– Naïve-Bayes with words as features

• Microsoft Research (1998): – Naïve-Bayes with the mutual information measure to

select features with strongest resolving power– Words and domain-specific attributes of spam used

as features

Related Work

• Paul Graham (2002): A Plan for spam– Very popular algorithm credited with starting the craze

for Bayesian Filters– Uses naïve-bayes with words as features

• Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm– Most accurate algorithm till date (over 99.7%

accuracy)– Distinctive because of it’s powerful feature extraction

technique– Uses Bayesian chain rule for combining weights

Related Work

• CRM114 algorithm Feature Extraction– Slide a Window of 5 words over the incoming text– Generate order-preserving sub-phrases containing all

combinations of windowed words– For one window, 24 = 16 features are generated– Very high computational complexity– E.g. “Click here to buy Viagra”

• Features generated would be “Click”, “Click here”, “Click to”,“Click buy”, “Click Viagra”, “Click here to”, “Click here buy” etc.

Outline


Algorithm

• Feature Extraction– Sentences in a message are identified by using the

delimiting characters ‘.’, ‘?’, ‘!’, ‘;’, ‘<‘, ‘>’– All possible word-pairings are formed from the

sentences– Commonly occuring words are skipped– These word-pairings serve as features to be used for

classification

Algorithm

• Feature Extraction (continued….)– If number of words become greater than a constant K,

then series of K words is treated as a sentence– Value of K is set to 20– E.g. “There is a problem in the tables that have been

copied to the database”“problem tables”, “tables problem”, “problem

copied”, “copied problem”, “problem database”, “database problem” etc. are the features that would be formed out of the sentence

Algorithm

• Feature Extraction (continued….)– Entire subject line is treated as one sentence– For HTML, all content within ‘<‘ and ‘>’ is treated as

one sentence– For a sentence of n words, ‘scavenger’ creates

(n-1)*(n-2) features as compared to 2n-1 created by CRM114

Algorithm

• Weight Assignment– Weights represent predictive strength of features– Discretized values are assigned as weights to

features depending on whether the feature is a ‘strong’ evidence or a ‘weak’ evidence

– ‘Strong’ pieces of evidence should have high impact on the classification decision and ‘weak’ pieces of evidence should have low impact on the classification decision

Algorithm

• Weight Assignment (Continued…)– Categorization of features into ‘strong’ and ‘weak’

pieces of evidence is done on the basis of frequency of occurrence of the feature in spam/legitimate collections, exclusivity of occurrence and on heuristic rules like distance between words in the word-pairing, whether the feature is from the subject or the body.

– Only exclusively occuring features are assigned weights

– Features occuring in both spam and legitimate collections are ignored.

Algorithm

• Weight Assignment (Continued…)– What weights to select for the ‘strong’ evidences and

the ‘weak’ evidences?– During classification, the class having more pieces of

‘strong’ evidence should ‘win’ regardless of the number of ‘weak’ evidences on either side.

– In the absence of ‘strong’ evidences on either side, the class having more pieces of ‘weak’ evidence should ‘win’.

Algorithm

• Weight Assignment (Continued…)– Intuitively, we would like to have as much ‘distance’

between the values we choose for the ‘strong’ and ‘weak’ evidences.

– We select 0.9 as the weight for ‘strong’ evidences and 0.1 as the weight for ‘weak’ evidences.

Algorithm

• Combining of weights– Total spam evidence = sum of spam weights of

matching features– Total legitimate evidence = sum of legitimate weights

of matching features– If Total spam evidence >= M* Total legitimate

evidence, then message is spam– M is the thresold parameter which can be used as

‘tuning knob’

Outline


Measurements

• Precision and Recall used as parameters of measurement– Spam Precision=Messages correctly classified as

spam / Total Messages classified as spam– Spam Recall = Messages correctly classified as spam

/ Total Spam Messages in Testing set– Precision gives accuracy with respect to false

positives– Recall gives capacity of filter to catch spam

Measurements

• Testing data– Downloaded around 5600 spam messages from

http://www.spamarchive.org– Used around 960 legitimate mails from Dr. Fink’s

mailbox

• Cross-Validation– K-fold cross-validation for two values of K, K=2 and

K=5– K=2: Dividing data into 2 equal-sized sets– K=5: Dividing data into 5 equal-sized sets

http://www.spamarchive.org/

Measurements

• Comparison with Paul Graham’s naïve-bayes algorithm

• Implemented Graham’s algorithm for two methods of feature extraction– Words+phrases as features– Feature extraction similar to ‘scavenger’

Measurements

ALGORITHM K=5 K=2

SPAM PRECISION (AVERAGE)

SPAM RECALL (AVERAGE)

SPAM PRECISION (AVERAGE)

SPAM RECALL (AVERAGE)

Scavenger (M=1) 100% 99.85% 99.92% 99.72%

Naïve-bayes (words+phrases)

100% 98.87% 99.80% 97.03%

Naïve-bayes (with scavenger feature extraction method)

100% 99.15% 99.65% 98.68%

Measurements

M Missed Spam (%) False Positives (%) Spam Recall (%)

0.25 0 13.23 100

0.5 0.07 6.17 99.93

0.75 0.11 2.05 99.89

1 0.28 0.58 99.72

1.25 0.3 0.58 99.70

1.5 0.35 0.58 99.66

1.75 0.41 0.29 99.59

2 0.49 0 99.51

2.25 0.6 0 99.4

2.5 0.74 0 99.36

Measurements

ROC Curve: Missed Spam Rate VS False Positive Rate

0

2

4

6

8

10

12

14

0 0.07 0.11 0.28 0.3 0.34 0.41 0.49 0.6 0.74

Missed Spam Rate (%)

Fal

se P

osi

tive

Rat

e (%

)

Measurements

• Why ‘scavenger’ performs better than naïve-bayes?– Powerful feature extraction (as powerful as CRM114)– Calculates predictive strength on basis of frequency

of occurrence as well as heuristic rules

Outline


Implementation

• Windows-PC based filter• Runs for Individual email accounts in IMAP mail

servers• Three Modules

– Configuration– Training– Classification

Implementation

• Classifier runs as a Windows Service• Connects to mail server every ten minutes• Downloads new messages, classifies them• Moves messages classified as spam to a pre-

configured folder on the server

Outline


Future Work

• Incorporating message headers during feature extraction step

• Incorporating domain-specific attributes of spam during weight combination step

Publications

• Dr. William Yerazunis (inventor of CRM114) mentioned the ‘scavenger’ algorithm at the MIT spam conference on Jan 16, 2004

• To be published in the ‘First Conference on email and anti-spam’ in Palo-Alto, California in July 2004

Acknowledgements

• Dr. Eugene Fink

• Dr. Dewey Rundus, Dr. Alan Hevner

• Dr. Paul Graham, MIT, Boston

• Dr. William Yerazunis, MERL, Boston

scavenger: a junk mail classification program rohan malkhare committee : dr. eugene fink dr. dewey...

Documents

individual features

tables problem

sentence of n words

number of words

problem database

database problem

weightsbayes rule

databaseproblem tables