azam: comparative study on feature space reduction techniques for spam detection (presentation)

Upload: jgrahamc

Post on 31-May-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    1/36

    Comparative Study on Feature SpaceReduction techniques for Spam Detection

    Researchers: Nouman Azam, Dr. Amir Hanif Dar Samiullah marwat

    MS Thesis (Presentation)

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    2/36

    The Problem

    Email is the most widely used medium for communication world wide Its Cheap, Reliability, Fast and easily

    accessible. it is prone to spam emails. Why.?

    Due to its wide usage and cheapness

    With a single click you can communicate withany one any where around the globe. It hardly cost spammers to send out 1

    million emails than to send 10 emails

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    3/36

    Statistics of spam

    Is spam really a problem.? Some statistic will clarify.

    At the end of 2002,

    as much as 40% of all email traffic consisted of spam. (http://zdnet.com.com/2100-1106-955842.html) In 2003

    the percentage was estimated to be about 50% of allemails

    (http://zdnet.com.com/2100-1105_2-1019528.html)

    In 2006 BBC news reported 96% of all emails to be spam.

    (http://news.bbc.co.uk/2/hi/technology/5219554.stm)

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    4/36

    Statistics of Spam

    28%Users who reply to Spam email

    2.1 millionAnnual Spam in 1,000 employee company

    16%Email address changes due to Spam

    $8.9 billionSpam cost to all U.S. Corporations in 2002

    $255 millionSpam cost to all non-corporate Internetusers

    2,200Annual Spam received per person

    6Daily Spam received per person

    12.4billionDaily Spam emails sent

    http://spam-filter-review.toptenreviews.com/spam-statistics.html

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    5/36

    Statistics of Spam

    http://www.junk-o-meter.com/stats/index.php

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    6/36

    Problems from Spam

    Wastage of network resources bandwidth

    Wastage of time

    wasting peoples time working in organizationsresulting in reduce productivity. Demages to PCs

    Computer viruses can cause serious damages to

    PCs. Ethical issues Spam emails advertising pornographic sites can

    cause problems for children's.

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    7/36

    Definition of Spam

    Unsolicited (unwanted) email for arecipient.

    OR Any email that the user do not wanted to

    have in his inbox.

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    8/36

    Existing Approaches Rule based

    hand made rules for detection of Spam made by experts.(Needs domain experts and constant updating of rules).

    Customer Revolt

    forcing companies not to publicize personal email ids givento them. (Hard to implement) Domain filters

    Allowing mails from specific domains only. (hard job of keeping track of domains that are valid for a user. )

    Blacklisting Blacklist filters use databases of known abusers, and also

    filter unknown addresses. (constant updating of the databases would be required).

    http://www.templetons.com/brad/spam/spamsol.html

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    9/36

    Existing Approaches Whitelist Filters

    Mailer programs learn all contacts of a user and let mail from thosecontacts through directly. ( Every one should first be needed tocommunicate his email id to the user and only then he can send email)

    Hidding address hidding ones original address from the spammers by allowing all emails

    to be received at temporary email id which is then forwarded to theoriginal email if found valid by the user. (hard job of mainting couple of email ids).

    Checks on number of recipients by the email agent programs. Government actions

    Laws implemented by government against spammers (Hard toimplement laws).

    Lastly Automated Recognition of Spam

    Uses machine learning algorithms by first learning from the past dataavailable. (Seems to be the best at Current).

    http://www.templetons.com/brad/spam/spamsol.html

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    10/36

    Why automated Spam Detection isBest

    Minimum user input taken The filter will filter Spam automatically with

    minimum user input.

    Adaptation to new kinds of spam The filter can adopt itself with the newly

    unknown kinds of spam. i.e. it will learn and

    update it self automatically.

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    11/36

    Nature of the Problem

    Instance of document classification it can be considered as a simple instance of

    document classification problem where we have twoclasses and our objective is to separate spam fromlegitimate emails.

    The features in our domain will be words. Representation of emails

    Any email can be represented in terms of features(taken to be words in this case) with discrete valuesbased on some statistics of the presence or absenceof words

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    12/36

    Main Steps

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    13/36

    Preprocessing of Data Removal of words that have length lesser than 3

    All those words whose length were found to be lesser in length than 3 were removed as they were found tobe mostly non informative.

    Removal of stop words Stop words are those which provide structure of thelanguage and do not provide the content.

    Not informative towards the class of the document. Examples are pronouns and conjectives

    Performing Stemming with Porter Stemmingalgorithms (Porter 1980). Stemming reduces the words having the same stems

    to single words thus reducing the vocabulary.

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    14/36

    Preprocessing

    Some stop words Examples of Stemmed words

    Ling spam corpus after the pre processing

    http://d/presentation/ling_Spam_After_PP/PART1_After_PP.txthttp://d/presentation/ling_Spam_After_PP/PART1_After_PP.txthttp://d/presentation/ling_Spam_After_PP/PART1_After_PP.txt
  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    15/36

    Representation of Data

    feature #4

    feature #3

    feature #2

    Feature #1

    example #4example #3example #2example #1

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    16/36

    Representation of Data

    Term Frequency (TF)

    Wij = weight of aterm i in email j ,

    tfij = frequency of aterm I in email j

    ij ijw tf

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    17/36

    The Corpus (Data Set)

    The corpus that we used in our experimentation was Ling Spam Corpus(Androutsopoulos. et al. 00)

    Total number of legitimate emails in thecorpus were 2412

    Total number of spam emails were 481. Spam percentage was about 16%

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    18/36

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    19/36

    Feature Reduction Methods

    Mutual Information (MI) Latent Semantic indexing (LSI, PCA or KLT) Word Frequency Thresh holding (TF)

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    20/36

    Mutual Information

    Supervised feature selection method MI for feature t can be calculated as

    Where t = terms or features and c = classMI Scores for all of the terms (features) werecalculated and then features were sorted indescending order and top scoring featureswere selected. (Sahami. et al.98)

    ( , ) ( )

    P(t,c)( , ) P(t,c) logP(t) P(c)Spam Leg 0,1t c

    MI t c

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    21/36

    Term Frequency Thresh holding

    Unsupervised feature selection Term frequency

    TF for a feature in a document is the number of times it appear in that document.

    Term frequency score of a feature TF Score for a feature is the addition of the

    individual term frequencies for that feature inthe entire set of documents (emails).

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    22/36

    Latent Semantic Indexing Unsupervised feature extraction Also known as Principal Component Analysis and

    Karhunen-Love transform It calculates the Eigen vectors EV of the

    covariance matrix C which is obtained from themultiplication of the mean adjusted data with itstranspose.

    The Eigen vectors corresponding to the top most

    Eigen values are selected. Transformed data TD is obtained by taking theTranspose of the Eigen vectors matrix andmultiplying it with the mean adjusted data i.e.

    TD = EV` * (Gnal. et al. 05 )

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    23/36

    The Classifier

    The classifier used was K-Nearestneighbor.

    All the data were stored in the memory. Classification of new example would be

    carry out by finding Its Euclidean distancefrom all the stored data. The ones with the

    nearest distance would be the class of thenew data.(Androutsopoulos. et al. 00)

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    24/36

    The Classifier

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    25/36

    Experimental settings

    In the first set the data was representedusing Term frequency.

    Three algorithms were tested. MI, LSI and TF thresh holding All three algorithms were used to select

    the top most 20,50,100 and 250 features.

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    26/36

    Evaluation Measures

    Accuracy let and be the total number of spam and

    legitimate emails in our data set. let be the number of emails that are classified as Z but

    belong to class Y. then

    Identifying legitimate email as spam is more costly then identifyingspam as legitimate. To cope with this cost different we redefineaccuracy as weighted accuracy and error as weighted error as

    Spam N Leg N

    Y Z N

    Spam Spam Leg Leg Spam Leg Acc N N N N

    Legit Spam Spam Legit Spam Leg WERR N N N N

    .Leg Leg Spam Spam Spam Leg WAC N N N N

    . Legit Spam Spam Legit Spam Leg WERR N N N N

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    27/36

    Evaluation Measures Spam Recall

    If we consider identification of spam as a filtering processand filter out all of the identified spam from the legitimateones than.

    Spam recall measures the percentage of spam messages

    that the filter manages to block

    Spam precision

    measures the degree to which the blocked messages areindeed spam

    (Androutsopoulos. et al 00, sahami et al)

    Spam Spam

    Spam

    N SR

    N

    Spam Spam

    Spam Spam Legit Spam

    N SP

    N N

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    28/36

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    29/36

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    30/36

    Experimental Results (1)

    70

    75

    80

    85

    90

    95

    0 100 200 300Features

    S p a m R e

    L S IThresh holdingMI(Entire Data)MI(Indivisual f ile)

    Spam Recall values for K = 1

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    31/36

    Experimental Results (1)

    70

    75

    80

    85

    90

    95

    0 100 200 300Features

    S p a m

    R e

    L S IThresh holdingMI(Entire data)MI(indivisual f ile)

    Spam Recall values for K = 3

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    32/36

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    33/36

    Experimental Results (1)

    70

    75

    80

    85

    90

    95

    0 100 200 300Features

    S p a m

    P

    r e c i

    L S IThresh holdingMI(Entire Data)MI(Indivisual File)

    Spam Precision values for K = 3

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    34/36

    Summary of Results fromExperiment

    MI performs well with accuracy. MI Scores calculated over the entire data set

    performs better than MI scores calculated on the

    individual files. LSI and TF Thresh holding performs well in

    Spam Recall but is out performed by MI withSpam Precision.

    LSI and TF Thresh holding have similar sort of results

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    35/36

    Observation Changing the Values of K for the Nearest Neighbor

    does not have significant impact on the results.

    Value of K from 1 to 7 can give you approximalty the same results.

    feature set size and accuracy There isnt any consistent relationship

    Changing the values of from 9 to 999 (in the weighted accuracy equation) improves

    the accuracy from 0.5% to 1.5% on average. The Best accuracy results

    Against the lower feature sets. Which is great improvementover the original feature space of over 40 thousand features.

  • 8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

    36/36

    Future work Minimum feature set size

    I was unable to find the minimum feature set size after which the performance starts degrading.

    Other features of email

    The corpus I used does not have other features of emailssuch as attachments, pictures, domain properties etc.adding these as a features will have a good impact onaccuracy and has been examined in (sahami et al 97).

    Spam rate of corpus

    The Spam rate of the corpus was about 16%. Which shouldbe more. Increasing the Spam rate to 70% or 80% mightimprove the performance in terms of spam recall andprecision and will be actually depicting the current spamrate.