by ankur khator 01005028 gaurav sharma 01005029 arpit mathur 01d05014 spam filtering

27
By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Upload: clifford-baldwin

Post on 17-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

ByAnkur Khator 01005028Gaurav Sharma 01005029Arpit Mathur 01D05014

SPAM FILTERING

Page 2: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

“junk email” or “unsolicited commercial email”.

Spam filtering - a special case of email classification.

Only 2 classes – Spam and Non-spam.

What is Spam Email?

Page 3: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Various Approaches

Bayesian Learning Probabilistic model for Spam Filtering Bag of Words Representation

Ripper algorithm Context Sensitive Learning.

Boosting algorithm Improving Accuracy by combining weaker

hypotheses.

Page 4: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Term Vectors

Page 5: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Naive Bayes for Spam

Seeking model to find

P(Y=1/X1=x1,X2=x2,..,Xd=xd)

From Bayes theorem

P(Y=1/X1=x1,..,Xd=xd) = P(Y=1) * P(X1=x1,..,Xd=xd/Y=1)

P(X1=x1,..Xd=xd)

P(Y=0/X1=x1,..,Xd=xd) = P(Y=0) * P(X1=x1,..,Xd=xd/Y=0)

P(X1=x1,..Xd=xd)

Page 6: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Justification of using Bayes Theorem Sparseness of data P(B/A) can be easily and accurately

determined as compared to P(A/B)

Page 7: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Assume P(X1=x1,..,Xd=xd/Y=k) = ∏ P(Xi=xi / Y=k)

Naive Bayes for Spam(contd.)

Also assume Xi = 1 if no of occurrence of word i >= 1

= 0 otherwise

Page 8: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

referred to as weights of evidence

• Inconsistency when some probability is zero.

•Smooth the estimates by adding a smooth positive constant to both numerator and denominator of each probability estimate

Naive Bayes for Spam(contd.)

Page 9: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Assume new mail with text “The quick rabbit rests”

Classifying

0.51 + 0.51 + 0 + 0.51 + 1.10 + 0 = 2.63

Probability = 0.93

Page 10: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Threshold

Lower threshold Higher false positive rate

Higher threshold Higher false negative rate Preferred

Page 11: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Linear Classifier Ignores the effect of Context of word on its meaning.

Unrealistic . Build a linear classifier that test for more complex Features like Simultaneous Occurrences. High Computation Cost !! Non-Linear Classification is the Solution

Non-Linear Classification

Page 12: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Ripper

Disjunction of Different Contexts Each Contexts is conjunction of Simple terms Context of w1 is :

if w2 belongs to data and w3 belongs to data.

i.e. for context to be true w1 must occur with w2 and w3.

Three Components of Ripper Algorithm:

Page 13: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Rule Learning :

Spam spam Є Subject Spam Free Є Subject ,Spam Є Subject. Spam Gift!! Є Subject, Click Є Subject. The rule would be disjunction of three

statements stated above. There is an initial set of rules too

Page 14: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Constructing Rule Set

Initial Rule Set is Constructed Using a greedy Strategy.

Based on the IREP (Incremental Reduced Error Pruning)

To Construct A new Rule partitioning Dataset into two parts training Set And Pruning Set is Done.

Every Time a Single condition is Added to Rule.

Page 15: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Simplification And Optimization

At every step the density of +ve examples covered is increased.

Adding stops until clause cover no –ve example or there is no positive gain.

After this, pruning i.e. simplification is done. At every stage, again following greedy Strategy

Page 16: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Reaching Sufficient Rules The clause is deleted which maximizes the Function

where U+(i+1) and U-

(i-1) are the positive and negative examples.

Termination when information gain is non-zero i.e. every rule covers +ve examples.

But If data is noisy then number of rules increase

Page 17: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

MDL

Several heuristics are applied to solve the problem. MDL(Minimum Description Length) is one of them.

After addition of each rule , total length of current rule set and example is calculated.

Addition of rule is stopped when this length is d bits larger than shortest length.

Page 18: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

AdaBoost

Easy to find rule of thumb which are often correct

If ‘buy now’ occurs in message, then predict ‘spam’

Hard to find one rule which is very accurate AdaBoost helps here

general method of converting rough rules of thumb into highly accurate prediction rule

Concentrating on hard examples

Page 19: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Pictorially

Page 20: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Algorithm Input S = { (Xi , Yi) } m

i=1

Initialize D(i) = 1/m for all i For i = 1 to T

H(t) = WeakLearner(S,Dt) Choose βt

ln((1-ε)/ε) (proven to Minimize error for 2class) [2]

Update Dt+1(i) = Dt(i) exp(-βtYiht(xi)) and Normalize

Final Hypotheses f(x) = ∑βt ht(x)

Page 21: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Example

Page 22: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Example

Page 23: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Accuracy Weighted accuracy measure

(λL- + S+) / (λL + S) λ strictness measure L : # legitimate messages S : # spam L- : #legitimate messages classified as legitimate S+ : #spam classified as spam

Improving accuracy Increase λ Introduce θ threshold

Example classified positive only if f(x) > θ Default is ZERO

Recall correctly predicted spam out of number of spam in corpus

Precision correctly classified spam out of number predicted as spam

Page 24: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Results on corpus PU1 . . . [1]

T RECALL PRECISION ACC

Tree Depth 1

Θ = 10.2 λ = 9525 93.55 98.71 98.59

Tree Depth 1

θ = 46.9 λ = 999550 74.43 100 99.98

Tree Depth 5

θ = 37.4 λ = 9 525 93.97 99.12 98.92

Tree Depth 5

Θ = 178 λ = 999550 66.53 100 99.97

Page 25: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Pros and Cons Fast and Simple No parameters to tune Flexible

Can combine with any learning Algorithm No knowledge needed of WeakLearner

Error reduces exponentially Robust to overfitting Data Driven – requires lots of data Performance depends on WeakLearner

May fail if WeakLearner is too weak

Page 26: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

Conclusion

RIPPER as text categorization algorithm works better than Naïve Bayes (better for more classes).

Comparable for spam filtering (2 classes) Boosting better than any weak learner it

works on.

Page 27: By Ankur Khator 01005028 Gaurav Sharma 01005029 Arpit Mathur 01D05014 SPAM FILTERING

References

[1] Boosting trees for Anti Spam Email Filtering by Xavier Carreras and Llius Marquez 2001.

[2] The boosting approach to machine learning: An overview. by Robert E. Schapire in MSRI Workshop on Nonlinear Estimation and Classification, 2002.

[3] Statistics and The War on Spam by David Madigan, David Madigan, 2004.

[4] Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C. D. Spyropoulos. An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proc. of the workshop on Machine Learning in the New Information Age, 2000. http://citeseer.ist.psu.edu/androutsopoulos00evaluation.html

[5] William W. Cohen, Yoram Singer: Context-sensitive Learning Methods for Text Categorization. SIGIR 1996: 307-315