azam: comparative study on feature space reduction techniques for spam detection (presentation)

8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)

1/36

Comparative Study on Feature SpaceReduction techniques for Spam Detection

Researchers: Nouman Azam, Dr. Amir Hanif Dar Samiullah marwat

MS Thesis (Presentation)


2/36

The Problem

Email is the most widely used medium for communication world wide Its Cheap, Reliability, Fast and easily

accessible. it is prone to spam emails. Why.?

Due to its wide usage and cheapness

With a single click you can communicate withany one any where around the globe. It hardly cost spammers to send out 1

million emails than to send 10 emails


3/36

Statistics of spam

Is spam really a problem.? Some statistic will clarify.

At the end of 2002,

as much as 40% of all email traffic consisted of spam. (http://zdnet.com.com/2100-1106-955842.html) In 2003

the percentage was estimated to be about 50% of allemails

(http://zdnet.com.com/2100-1105_2-1019528.html)

In 2006 BBC news reported 96% of all emails to be spam.

(http://news.bbc.co.uk/2/hi/technology/5219554.stm)


4/36

Statistics of Spam

28%Users who reply to Spam email

2.1 millionAnnual Spam in 1,000 employee company

16%Email address changes due to Spam

$8.9 billionSpam cost to all U.S. Corporations in 2002

$255 millionSpam cost to all non-corporate Internetusers

2,200Annual Spam received per person

6Daily Spam received per person

12.4billionDaily Spam emails sent

http://spam-filter-review.toptenreviews.com/spam-statistics.html


5/36

Statistics of Spam

http://www.junk-o-meter.com/stats/index.php


6/36

Problems from Spam

Wastage of network resources bandwidth

Wastage of time

wasting peoples time working in organizationsresulting in reduce productivity. Demages to PCs

Computer viruses can cause serious damages to

PCs. Ethical issues Spam emails advertising pornographic sites can

cause problems for children's.


7/36

Definition of Spam

Unsolicited (unwanted) email for arecipient.

OR Any email that the user do not wanted to

have in his inbox.


8/36

Existing Approaches Rule based

hand made rules for detection of Spam made by experts.(Needs domain experts and constant updating of rules).

Customer Revolt

forcing companies not to publicize personal email ids givento them. (Hard to implement) Domain filters

Allowing mails from specific domains only. (hard job of keeping track of domains that are valid for a user. )

Blacklisting Blacklist filters use databases of known abusers, and also

filter unknown addresses. (constant updating of the databases would be required).

http://www.templetons.com/brad/spam/spamsol.html


9/36

Existing Approaches Whitelist Filters

Mailer programs learn all contacts of a user and let mail from thosecontacts through directly. ( Every one should first be needed tocommunicate his email id to the user and only then he can send email)

Hidding address hidding ones original address from the spammers by allowing all emails

to be received at temporary email id which is then forwarded to theoriginal email if found valid by the user. (hard job of mainting couple of email ids).

Checks on number of recipients by the email agent programs. Government actions

Laws implemented by government against spammers (Hard toimplement laws).

Lastly Automated Recognition of Spam

Uses machine learning algorithms by first learning from the past dataavailable. (Seems to be the best at Current).

http://www.templetons.com/brad/spam/spamsol.html


10/36

Why automated Spam Detection isBest

Minimum user input taken The filter will filter Spam automatically with

minimum user input.

Adaptation to new kinds of spam The filter can adopt itself with the newly

unknown kinds of spam. i.e. it will learn and

update it self automatically.


11/36

Nature of the Problem

Instance of document classification it can be considered as a simple instance of

document classification problem where we have twoclasses and our objective is to separate spam fromlegitimate emails.

The features in our domain will be words. Representation of emails

Any email can be represented in terms of features(taken to be words in this case) with discrete valuesbased on some statistics of the presence or absenceof words


12/36

Main Steps


13/36

Preprocessing of Data Removal of words that have length lesser than 3

All those words whose length were found to be lesser in length than 3 were removed as they were found tobe mostly non informative.

Removal of stop words Stop words are those which provide structure of thelanguage and do not provide the content.

Not informative towards the class of the document. Examples are pronouns and conjectives

Performing Stemming with Porter Stemmingalgorithms (Porter 1980). Stemming reduces the words having the same stems

to single words thus reducing the vocabulary.


14/36

Preprocessing

Some stop words Examples of Stemmed words

Ling spam corpus after the pre processing
http://d/presentation/ling_Spam_After_PP/PART1_After_PP.txthttp://d/presentation/ling_Spam_After_PP/PART1_After_PP.txthttp://d/presentation/ling_Spam_After_PP/PART1_After_PP.txt


15/36

Representation of Data

feature #4

feature #3

feature #2

Feature #1

example #4example #3example #2example #1


16/36

Representation of Data

Term Frequency (TF)

Wij = weight of aterm i in email j ,

tfij = frequency of aterm I in email j

ij ijw tf


17/36

The Corpus (Data Set)

The corpus that we used in our experimentation was Ling Spam Corpus(Androutsopoulos. et al. 00)

Total number of legitimate emails in thecorpus were 2412

Total number of spam emails were 481. Spam percentage was about 16%


18/36


19/36

Feature Reduction Methods

Mutual Information (MI) Latent Semantic indexing (LSI, PCA or KLT) Word Frequency Thresh holding (TF)


20/36

Mutual Information

Supervised feature selection method MI for feature t can be calculated as

Where t = terms or features and c = classMI Scores for all of the terms (features) werecalculated and then features were sorted indescending order and top scoring featureswere selected. (Sahami. et al.98)

( , ) ( )

P(t,c)( , ) P(t,c) logP(t) P(c)Spam Leg 0,1t c

MI t c


21/36

Term Frequency Thresh holding

Unsupervised feature selection Term frequency

TF for a feature in a document is the number of times it appear in that document.

Term frequency score of a feature TF Score for a feature is the addition of the

individual term frequencies for that feature inthe entire set of documents (emails).


22/36

Latent Semantic Indexing Unsupervised feature extraction Also known as Principal Component Analysis and

Karhunen-Love transform It calculates the Eigen vectors EV of the

covariance matrix C which is obtained from themultiplication of the mean adjusted data with itstranspose.

The Eigen vectors corresponding to the top most

Eigen values are selected. Transformed data TD is obtained by taking theTranspose of the Eigen vectors matrix andmultiplying it with the mean adjusted data i.e.

TD = EV` * (Gnal. et al. 05 )


23/36

The Classifier

The classifier used was K-Nearestneighbor.

All the data were stored in the memory. Classification of new example would be

carry out by finding Its Euclidean distancefrom all the stored data. The ones with the

nearest distance would be the class of thenew data.(Androutsopoulos. et al. 00)


24/36

The Classifier


25/36

Experimental settings

In the first set the data was representedusing Term frequency.

Three algorithms were tested. MI, LSI and TF thresh holding All three algorithms were used to select

the top most 20,50,100 and 250 features.


26/36

Evaluation Measures

Accuracy let and be the total number of spam and

legitimate emails in our data set. let be the number of emails that are classified as Z but

belong to class Y. then

Identifying legitimate email as spam is more costly then identifyingspam as legitimate. To cope with this cost different we redefineaccuracy as weighted accuracy and error as weighted error as

Spam N Leg N

Y Z N

Spam Spam Leg Leg Spam Leg Acc N N N N

Legit Spam Spam Legit Spam Leg WERR N N N N

.Leg Leg Spam Spam Spam Leg WAC N N N N

. Legit Spam Spam Legit Spam Leg WERR N N N N


27/36

Evaluation Measures Spam Recall

If we consider identification of spam as a filtering processand filter out all of the identified spam from the legitimateones than.

Spam recall measures the percentage of spam messages

that the filter manages to block

Spam precision

measures the degree to which the blocked messages areindeed spam

(Androutsopoulos. et al 00, sahami et al)

Spam Spam

Spam

N SR

N

Spam Spam

Spam Spam Legit Spam

N SP

N N


28/36


29/36


30/36

Experimental Results (1)

70

75

80

85

90

95

0 100 200 300Features

S p a m R e

L S IThresh holdingMI(Entire Data)MI(Indivisual f ile)

Spam Recall values for K = 1


31/36


70

75

80

85

90

95

0 100 200 300Features

S p a m

R e

L S IThresh holdingMI(Entire data)MI(indivisual f ile)

Spam Recall values for K = 3


32/36


33/36


70

75

80

85

90

95

0 100 200 300Features

S p a m

P

r e c i

L S IThresh holdingMI(Entire Data)MI(Indivisual File)

Spam Precision values for K = 3


34/36

Summary of Results fromExperiment

MI performs well with accuracy. MI Scores calculated over the entire data set

performs better than MI scores calculated on the

individual files. LSI and TF Thresh holding performs well in

Spam Recall but is out performed by MI withSpam Precision.

LSI and TF Thresh holding have similar sort of results


35/36

Observation Changing the Values of K for the Nearest Neighbor

does not have significant impact on the results.

Value of K from 1 to 7 can give you approximalty the same results.

feature set size and accuracy There isnt any consistent relationship

Changing the values of from 9 to 999 (in the weighted accuracy equation) improves

the accuracy from 0.5% to 1.5% on average. The Best accuracy results

Against the lower feature sets. Which is great improvementover the original feature space of over 40 thousand features.


36/36

Future work Minimum feature set size

I was unable to find the minimum feature set size after which the performance starts degrading.

Other features of email

The corpus I used does not have other features of emailssuch as attachments, pictures, domain properties etc.adding these as a features will have a good impact onaccuracy and has been examined in (sahami et al 97).

Spam rate of corpus

The Spam rate of the corpus was about 16%. Which shouldbe more. Increasing the Spam rate to 70% or 80% mightimprove the performance in terms of spam recall andprecision and will be actually depicting the current spamrate.

azam: comparative study on feature space reduction techniques for spam detection (presentation)

Documents