anti-spam filter based on naïve bayes, svm, and knn model

Upload: edgar-marca

Post on 03-Jun-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Anti-Spam Filter Based on Nave Bayes, SVM, and KNN model

    1/5

    AI TERM PROJECT GROUP 14 1

    Anti-Spam Filter Based on Nave Bayes,SVM, and KNN modelYun-Nung Chen, Che-An Lu, Chao-Yu Huang

    AbstractNave Bayes spam email filters are a well-known and powerful type of filters. We construct different filters using

    three types of classification, including Nave Bayes, SVM, and KNN. We compare the pros and cons between these three types

    and use some approaches to improve them to get a better spam filter.

    Index TermsSpam filter, Nave Bayes, SVM, KNN

    1 INTRODUCTION

    ass unsolicited electrontic mail, often known asspam, has recently increased enormously and hasbecome a serious threat to not only the Internet but

    also to society.Over the past few years, many approaches have beenprovided to filter spam. Some of them often use NaveBayes method to implement it.

    2 PROPOSED APPROACHES

    2.1 Problem Definition

    Construct a filter which can prevent spam from gettinginto mailbox.Input: A mail messageOutput: Spam or Ham

    2.2 Proposed Solution

    Figure.1 System FlowchartWe use method of machine learning to train a model,

    and decide the input message belonges ham or spam. Weuse three methods to implement spam filter, including

    Nave Bayes, SVM, and KNN.We will compare these three methods first. After that

    we will modify each method independently to improve

    the accuracy and compare each method itself. Finallymake a conclusion above all the improvement on thesethree methods.

    Following are our methods details.Nave Bayes

    From Bayesian theorem of total probability, given thevector =, , , of a mail d, the probability that dbelongs to category C is:

    = C = = (1)where ,.

    We can decide that mail belongs to a model with ahigher probability. This uses unigram language model tocompute the probability of class.

    In order to get better feature (word with more informa-tion), we preprocess testing message to remove somenoise word and reserve words which are more important.We remove words whose lenth is longer than 50, and wealso delete words which appear for less 20 times.

    A word which appears in more documents representsthat it carries less information for classification.

    We use SRI Language Model Toolkit: to generate uni-gram language model, and compute the Bayesian Proba-bility according to the model. We can decide the classifi-

    cation of testing message.SVM

    The support vector machines (SVM) is easy to use andpowerful for data classification. When generating filtermodel, we create a vector for each data in tha trainingcorpus and then SVM will map these vectors into hyper-plan. Then SVM will learn which kinds of vector are closeto which class. Here SVM is a good approaching methodin precisely finding best classification hyper-plans tomaximize the margin so that we could classify a new mailinto spam or ham.

    2009 AI TERMPROJECT

    Yun-Nung Chen Author is in the National Taiwan Universty.E-mail: [email protected].

    Che-An Lu Author is in the National Taiwan Universty.E-mail: [email protected].

    Chau-Yu Huang Author is in the National Taiwan Universty.E-mail: [email protected]

    M

  • 8/12/2019 Anti-Spam Filter Based on Nave Bayes, SVM, and KNN model

    2/5

    2 AI TERM PROJECT GROUP 14

    Figure.2Version1

    First we will select top 1000 document frequency termsas we do in KNN. Secondly we will create TF-IDF foreach training data. Thirdly we will use libsvm (a toolfrom Chih-Jen Lin) to train a model. With each unclassi-

    fied mail we will create a TF-IDF vector too, finallypredict it with the svm model we trained before. The fo-mula of TF-IDFis described as below:

    0.5 + 0.5

    + 1 (2)Where N is number of terms in whole corpus, is fre-quency of term t, and is data frequency of term t.

    If a term appears in many documents, it representsthat this term doesnt have significance. So we use IDF todiminish the weight of such term inorder to select betterfeatures. Why we use data frequency to select the featureis because the following figure (a reference from course

    Web Mining).

    Figure.3We can see that the performance between all the fea-

    ture selection methods except mutual information andterm strength makes no much difference. For our codingconvinence, we choose to implement data frequency.Version2

    It is similar to version 1, we only turn the uppercase in-to lowercase (case-insensitive), so that for example freewill be the same term with FREE or FreeVersion3

    It is similar to version 2, the only difference is thatwhen training svm model, we will set the parameters

    cost and gamma to the best condition.KNNVersion1

    First we will create a binary vector which maintainthe information of each feature (term) exist or not (1 or 0)for each training data. Secondly, for each unclassifiedmail we will create a binary vector too, then using cosine

    distance to find out the top K closest training data fromthe corpus then find out which class is the unclassifiedmail belongs to.

    : (3)Version2

    It is similar to version 1, one big difference is that inversion 2, we will not use binary vector as before, wewill use TF-IDF respectively.Version3

    It is similar to previous version, we only turn the up-percase into lowercase (case-insensitive), so that for ex-ample free will be see as the same term with FREE or

    Free.

    3 CONTRIBUTIONS

    3.1 Compare each method independently

    We compare three methods independently, and we alsocan observe the difference between sizes in 1200, 3000,9000, and 21000.Nave Bayes

    When we trains language model, computing the prob-abilities of words doesnt care the case (case-insensitive),and we also remove the word with too small probability

    for message preprocessing.When the size of traing data is smaller (1200), the re-sult still has good performance. We can see that accuracyis 93.6206%.

    But when training set becomes a little larger (3000), theresult is not good as smaller one. We can see that the re-sult is improved up slightly. Compared to size, the im-provement of accuracy is relatively small, and we can justimprove accuracy to 3% (93.6206% -> 96.2669).

    We believe feature selection is important to NaveBayes, and we can use a better feature to improve result.But we must spend much time to testing data to seewhether the result will be improved up, time and accura-

    cy is tradeoff.SVMWhen the training set is small (1200), the result of SVM

    model is much poor than others.When we use case-insensitive to create tf-idf vector,

    the accuracy can improve up to 20% (60.854% ->72.6285%), which means that it is important to combinethe information of uppercase and lowercase together toincrease the concept for a specific term (ex: free, Free,FREE). The other reason is that if we see free and Freeas the same term, then the data frequency of free will in-crease, so that we wont throw away such important fea-ture (since we only select top 1000 ranked by data fre-

    quency).

  • 8/12/2019 Anti-Spam Filter Based on Nave Bayes, SVM, and KNN model

    3/5

    AI TERM PROJECT GROUP 14 3

    After we find the best gamma and cost value for eachcorpus, the performance can also improve up to 20%(72.6285% -> 88.2511%). But the process of finding suchparameters is very time consuming, so it will be a trade-off.

    When using large training set, the performance wontimprove accordingly. We think that it is because when the

    training set grows up, the noise will also increase, whichmeans that there will be more ham similar to spam.KNN

    The performance of KNN is really out of our expecta-tion. At begin, we think KNN wont be better than SVM.But after our experiment, KNN seems work well on spamclassification.

    The other interesting thing is that when using binaryvector and TF-IDF vector, the performance makes lit-tle difference (notice that TF-IDF still a little better thanbinary vector). The improvement of using TF-IDF vec-tor is not that significant as we expect. We think that it isbecause in spam classification, some important features in

    spam is quite different from ham, so the weight (TF-IDF) of such feature will not be so important as whetherthere exist such feature.

    And we also found that the case-sensitive and case-insensitive makes little difference in KNN, not like inSVM. We think the main reason is that we didnt do fea-ture selection in KNN, instead, we keep all the features.So we wont throw out some important features (ex:Money) such as SVM feature selection does. And this mayalso be the reason why KNN will beat SVM.

    3.2 Compare all these three methods

    We think that it is why KNN is better than Nave Bayes

    that the feature we select when we implement NaveBayes is not good enough to train an excellent model.

    We think KNN can beat SVM is because that we throwsome information when doing feature selection in SVM.And the other reason might be that spam classificationproblem is a binary decide problem (spam / ham), soKNN can easily close to one side, and we think if thereexists more class, the performance of SVM will be betterthan KNN.

    When training corpus is small, the performance ofKNN is still well, not like SVM, with only 60% accuracy.We think it is because SVM is a machine learning method,we cant expect it learns well with a little training data.

    KNN can still find the top K similar data in a small cor-pus.

    We also think that Nave Bayes is good method to filtspam when traing set is small, and because the word ap-pearing in ham isnt too many, we can compute probabili-ties of words to decide a category by using a training setwith small size.

    4 EXPERIMENTAL RESULTS

    4.1 Corpus

    We use the corpus which is provided by trec06. There are37822 messages (12910 ham, 24912 spam). We separatethe whole set into training data and testing data.

    The testing data are 2910 ham and 4912 spam whichare randomly select from the corpus. Remaining data(10000 ham, 20000 spam) are used to be training data.

    These two set (testing data and training data) is inde-pendent.

    In our experiments, we will create four different sizesof training data which are randomly select from the train-

    ing corpus. The ratio of spam and ham in these four train-ing datas are all 2:1, and the corresponding size of themare 1200, 3000, 9000, and 21000. (Following we will use800:400, 2000:1000, 6000:3000, 14000:7000 to represent fourdifferent training sets)

    4.2 Result of Evaluation

    Following are three different methods accuracy tablewith different sizes of training data.Nave Bayes

    Accuracy800:400 93.6206 %

    2000:1000 90.8719 %

    6000:3000 95.2570 %14000:7000 96.2669 %

    Table. 1SVM

    Version1 Version2 Version3800:400 60.8540 % 72.6285 % 88.2511 %

    2000:1000 87.4585 % 90.5316 % 93.8794 %6000:3000 88.8520 % 93.5983 % 96.4477 %14000:7000 88.2255 % 91.0299 % 91.0299 %

    Table. 2KNN

    Version1 Version2 Version3800:400 94.0710 % 95.6172 % 95.6044 %

    2000:1000 95.9877 % 97.2144 % 97.0611 %6000:3000 97.5466 % 98.0833 % 97.7383 %14000:7000 97.1122 % 97.8789 % 97.5978 %

    Table. 3Fig.4 is the accuracy plot of Nave Bayes, first version

    of SVM, and first version of KNN with different size oftraining data.

    Figure.4 Accuracy of Nave Bayes

    We use Nave Bayes to be the baselin. We can see thatin the first version of SVM, all the accuracy are less thanNave Bayes. And in the first version of KNN, the per-formance is already better than Nave bayes.

    Fig.5 is the accuracy plot after we improve the SVM

  • 8/12/2019 Anti-Spam Filter Based on Nave Bayes, SVM, and KNN model

    4/5

    4 AI TERM PROJECT GROUP 14

    accuracy.

    Figure.5 Accuracy of SVM

    We can see that there has a big improvement betweenversion 1 and version 2, the difference between these twoversions is that we use case-insensitive in version 2. Inversion 2 we will much emphasize some important fea-tures such as Free, and we wont throw too many in-formation away due to feature selection as we mentionedbefore. The improvement between version 2 and version3 is also significant. By well selecting the SVM parametergamma and cost, the performance can really improve alot as we can see.

    Fig.6 is the accuracy plot after we improve the KNNaccuracy.

    Figure.6 Accuracy of KNN

    We can see that there is a little improvement from ver-sion 1 to version 2, the difference of them is that we useTF-IDF vector instead of binary vector in version 2.And when we modified version 2 from case-sensitive tocase-insensitive, the difference between them is not assignificant as SVM (no more than 0.4%). We also men-tioned this before, it is because we didnt do feature selec-tion in KNN.

    Fig.7 is the accuracy plot of Nave Bayes, and the bestversion of SVM and KNN with different size of trainingdata.

    Figure.7 Comparison of 3 methods

    As we can see, after we improve SVM, two of the mid-dle size of training set is better than Nave Bayes. And theother two is much more close to Nave bayes comparewith the first version. After we improve KNN, the resultis again better than Nave Bayes.

    5 CONCLUSION

    After experimenting three different methods, we foundthat KNN has higher accuracy than other two approaches.

    Because we think KNN is more suitable for classifica-tion of less catergories than SVM, accuracy of KNN ishigher.

    We think Nave Bayes is a good method for spam filter,and the time costs little on training (about 1-2 seconds).Testing an input message requires much time usingNave Bayes, but the result is good enough.

    The training time of KNN is very fast, but it takes lots

    of time on testing. We think it is because we didn imple-ment any indexing algorithm such as KD-tree, R-Tree orQuad-Tree when finding the nearest top K neighbors. Inour future work, we can implement these indexing me-thods to improve the efficiency of KNN.

    The training time of SVM compared with Nave Bayesand KNN is much longer, espically when we want to findout the best gamma and cost parameters for the trainingprocess. But the testing time of SVM is much faster thanthe other two methods.

    In future work, we can focus on different feature selec-tion methods to improve the performance of Nave Bayesand SVM, and the results of them might become better

    than KNN.

    6 JOB RESPONSIBILITY

    Yun-nung Chen (B94902032)Nave Bayes (Training and Testing), Report Writing.

    Che-an Lu (B94902097)SVM (Training and Testing), KNN(Training and Test-

    ing), Report Writing.Chau-yu Huang (B94902052)

    Message Preprocessing, Report Writing.

    ACKNOWLEDGMENT

    The report uses lots of toolkits, including MIME-tools,

  • 8/12/2019 Anti-Spam Filter Based on Nave Bayes, SVM, and KNN model

    5/5

    AI TERM PROJECT GROUP 14 5

    SRILM, and SVM. So we want to thanks about it.

    REFERENCES

    [1] SRI Language Model Toolkit

    http://www.speech.sri.com/projects/srilm/

    [2] CJ Lins Home Page

    http://www.csie.ntu.edu.tw/~cjlin/