scavenger: a junk mail classification program rohan malkhare committee : dr. eugene fink dr. dewey...
TRANSCRIPT
SCAVENGER: A JUNK MAIL CLASSIFICATION PROGRAM
Rohan Malkhare
Committee:
Dr. Eugene FinkDr. Dewey Rundus
Dr. Alan Hevner
Introduction
Anti-spam efforts
• Legislation• Technology
– White listing of Email addresses– Black Listing of Email addresses/domains– Challenge Response mechanisms– Content Filtering
• Learning Techniques
Introduction
Learning techniques for Spam classification
• Feature Extraction• Assignment of weights to individual features
representing the predictive strength of a feature• Combining weights of extracted features during
classification to numerically determine whether mail is spam/legitimate
Introduction
Current algorithms
• Word or phrases as features• Probabilities of occurrence in spam/legitimate
collections as weights• Bayes rule or one of it’s variants for combining
weights
Related Work
• Cohen (1996): – RIPPER, Rule Learning System– Rules in a human-comprehensible format
• Pantel & Lin (1998):– Naïve-Bayes with words as features
• Microsoft Research (1998): – Naïve-Bayes with the mutual information measure to
select features with strongest resolving power– Words and domain-specific attributes of spam used
as features
Related Work
• Paul Graham (2002): A Plan for spam– Very popular algorithm credited with starting the craze
for Bayesian Filters– Uses naïve-bayes with words as features
• Bill Yerazunis (2002): CRM114 sparse binary polynomial hashing algorithm– Most accurate algorithm till date (over 99.7%
accuracy)– Distinctive because of it’s powerful feature extraction
technique– Uses Bayesian chain rule for combining weights
Related Work
• CRM114 algorithm Feature Extraction– Slide a Window of 5 words over the incoming text– Generate order-preserving sub-phrases containing all
combinations of windowed words– For one window, 24 = 16 features are generated– Very high computational complexity– E.g. “Click here to buy Viagra”
• Features generated would be “Click”, “Click here”, “Click to”,“Click buy”, “Click Viagra”, “Click here to”, “Click here buy” etc.
Algorithm
• Feature Extraction– Sentences in a message are identified by using the
delimiting characters ‘.’, ‘?’, ‘!’, ‘;’, ‘<‘, ‘>’– All possible word-pairings are formed from the
sentences– Commonly occuring words are skipped– These word-pairings serve as features to be used for
classification
Algorithm
• Feature Extraction (continued….)– If number of words become greater than a constant K,
then series of K words is treated as a sentence– Value of K is set to 20– E.g. “There is a problem in the tables that have been
copied to the database”“problem tables”, “tables problem”, “problem
copied”, “copied problem”, “problem database”, “database problem” etc. are the features that would be formed out of the sentence
Algorithm
• Feature Extraction (continued….)– Entire subject line is treated as one sentence– For HTML, all content within ‘<‘ and ‘>’ is treated as
one sentence– For a sentence of n words, ‘scavenger’ creates
(n-1)*(n-2) features as compared to 2n-1 created by CRM114
Algorithm
• Weight Assignment– Weights represent predictive strength of features– Discretized values are assigned as weights to
features depending on whether the feature is a ‘strong’ evidence or a ‘weak’ evidence
– ‘Strong’ pieces of evidence should have high impact on the classification decision and ‘weak’ pieces of evidence should have low impact on the classification decision
Algorithm
• Weight Assignment (Continued…)– Categorization of features into ‘strong’ and ‘weak’
pieces of evidence is done on the basis of frequency of occurrence of the feature in spam/legitimate collections, exclusivity of occurrence and on heuristic rules like distance between words in the word-pairing, whether the feature is from the subject or the body.
– Only exclusively occuring features are assigned weights
– Features occuring in both spam and legitimate collections are ignored.
Algorithm
• Weight Assignment (Continued…)– What weights to select for the ‘strong’ evidences and
the ‘weak’ evidences?– During classification, the class having more pieces of
‘strong’ evidence should ‘win’ regardless of the number of ‘weak’ evidences on either side.
– In the absence of ‘strong’ evidences on either side, the class having more pieces of ‘weak’ evidence should ‘win’.
Algorithm
• Weight Assignment (Continued…)– Intuitively, we would like to have as much ‘distance’
between the values we choose for the ‘strong’ and ‘weak’ evidences.
– We select 0.9 as the weight for ‘strong’ evidences and 0.1 as the weight for ‘weak’ evidences.
Algorithm
• Combining of weights– Total spam evidence = sum of spam weights of
matching features– Total legitimate evidence = sum of legitimate weights
of matching features– If Total spam evidence >= M* Total legitimate
evidence, then message is spam– M is the thresold parameter which can be used as
‘tuning knob’
Measurements
• Precision and Recall used as parameters of measurement– Spam Precision=Messages correctly classified as
spam / Total Messages classified as spam– Spam Recall = Messages correctly classified as spam
/ Total Spam Messages in Testing set– Precision gives accuracy with respect to false
positives– Recall gives capacity of filter to catch spam
Measurements
• Testing data– Downloaded around 5600 spam messages from
http://www.spamarchive.org– Used around 960 legitimate mails from Dr. Fink’s
mailbox
• Cross-Validation– K-fold cross-validation for two values of K, K=2 and
K=5– K=2: Dividing data into 2 equal-sized sets– K=5: Dividing data into 5 equal-sized sets
Measurements
• Comparison with Paul Graham’s naïve-bayes algorithm
• Implemented Graham’s algorithm for two methods of feature extraction– Words+phrases as features– Feature extraction similar to ‘scavenger’
Measurements
ALGORITHM K=5 K=2
SPAM PRECISION (AVERAGE)
SPAM RECALL (AVERAGE)
SPAM PRECISION (AVERAGE)
SPAM RECALL (AVERAGE)
Scavenger (M=1) 100% 99.85% 99.92% 99.72%
Naïve-bayes (words+phrases)
100% 98.87% 99.80% 97.03%
Naïve-bayes (with scavenger feature extraction method)
100% 99.15% 99.65% 98.68%
Measurements
M Missed Spam (%) False Positives (%) Spam Recall (%)
0.25 0 13.23 100
0.5 0.07 6.17 99.93
0.75 0.11 2.05 99.89
1 0.28 0.58 99.72
1.25 0.3 0.58 99.70
1.5 0.35 0.58 99.66
1.75 0.41 0.29 99.59
2 0.49 0 99.51
2.25 0.6 0 99.4
2.5 0.74 0 99.36
Measurements
ROC Curve: Missed Spam Rate VS False Positive Rate
0
2
4
6
8
10
12
14
0 0.07 0.11 0.28 0.3 0.34 0.41 0.49 0.6 0.74
Missed Spam Rate (%)
Fal
se P
osi
tive
Rat
e (%
)
Measurements
• Why ‘scavenger’ performs better than naïve-bayes?– Powerful feature extraction (as powerful as CRM114)– Calculates predictive strength on basis of frequency
of occurrence as well as heuristic rules
Implementation
• Windows-PC based filter• Runs for Individual email accounts in IMAP mail
servers• Three Modules
– Configuration– Training– Classification
Implementation
• Classifier runs as a Windows Service• Connects to mail server every ten minutes• Downloads new messages, classifies them• Moves messages classified as spam to a pre-
configured folder on the server
Future Work
• Incorporating message headers during feature extraction step
• Incorporating domain-specific attributes of spam during weight combination step
Publications
• Dr. William Yerazunis (inventor of CRM114) mentioned the ‘scavenger’ algorithm at the MIT spam conference on Jan 16, 2004
• To be published in the ‘First Conference on email and anti-spam’ in Palo-Alto, California in July 2004