azam: comparative study on feature space reduction techniques for spam detection (presentation)
Post on 31-May-2018
226 Views
Preview:
TRANSCRIPT
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
1/36
Comparative Study on Feature SpaceReduction techniques for Spam Detection
Researchers: Nouman Azam, Dr. Amir Hanif Dar Samiullah marwat
MS Thesis (Presentation)
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
2/36
The Problem
Email is the most widely used medium for communication world wide Its Cheap, Reliability, Fast and easily
accessible. it is prone to spam emails. Why.?
Due to its wide usage and cheapness
With a single click you can communicate withany one any where around the globe. It hardly cost spammers to send out 1
million emails than to send 10 emails
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
3/36
Statistics of spam
Is spam really a problem.? Some statistic will clarify.
At the end of 2002,
as much as 40% of all email traffic consisted of spam. (http://zdnet.com.com/2100-1106-955842.html) In 2003
the percentage was estimated to be about 50% of allemails
(http://zdnet.com.com/2100-1105_2-1019528.html)
In 2006 BBC news reported 96% of all emails to be spam.
(http://news.bbc.co.uk/2/hi/technology/5219554.stm)
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
4/36
Statistics of Spam
28%Users who reply to Spam email
2.1 millionAnnual Spam in 1,000 employee company
16%Email address changes due to Spam
$8.9 billionSpam cost to all U.S. Corporations in 2002
$255 millionSpam cost to all non-corporate Internetusers
2,200Annual Spam received per person
6Daily Spam received per person
12.4billionDaily Spam emails sent
http://spam-filter-review.toptenreviews.com/spam-statistics.html
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
5/36
Statistics of Spam
http://www.junk-o-meter.com/stats/index.php
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
6/36
Problems from Spam
Wastage of network resources bandwidth
Wastage of time
wasting peoples time working in organizationsresulting in reduce productivity. Demages to PCs
Computer viruses can cause serious damages to
PCs. Ethical issues Spam emails advertising pornographic sites can
cause problems for children's.
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
7/36
Definition of Spam
Unsolicited (unwanted) email for arecipient.
OR Any email that the user do not wanted to
have in his inbox.
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
8/36
Existing Approaches Rule based
hand made rules for detection of Spam made by experts.(Needs domain experts and constant updating of rules).
Customer Revolt
forcing companies not to publicize personal email ids givento them. (Hard to implement) Domain filters
Allowing mails from specific domains only. (hard job of keeping track of domains that are valid for a user. )
Blacklisting Blacklist filters use databases of known abusers, and also
filter unknown addresses. (constant updating of the databases would be required).
http://www.templetons.com/brad/spam/spamsol.html
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
9/36
Existing Approaches Whitelist Filters
Mailer programs learn all contacts of a user and let mail from thosecontacts through directly. ( Every one should first be needed tocommunicate his email id to the user and only then he can send email)
Hidding address hidding ones original address from the spammers by allowing all emails
to be received at temporary email id which is then forwarded to theoriginal email if found valid by the user. (hard job of mainting couple of email ids).
Checks on number of recipients by the email agent programs. Government actions
Laws implemented by government against spammers (Hard toimplement laws).
Lastly Automated Recognition of Spam
Uses machine learning algorithms by first learning from the past dataavailable. (Seems to be the best at Current).
http://www.templetons.com/brad/spam/spamsol.html
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
10/36
Why automated Spam Detection isBest
Minimum user input taken The filter will filter Spam automatically with
minimum user input.
Adaptation to new kinds of spam The filter can adopt itself with the newly
unknown kinds of spam. i.e. it will learn and
update it self automatically.
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
11/36
Nature of the Problem
Instance of document classification it can be considered as a simple instance of
document classification problem where we have twoclasses and our objective is to separate spam fromlegitimate emails.
The features in our domain will be words. Representation of emails
Any email can be represented in terms of features(taken to be words in this case) with discrete valuesbased on some statistics of the presence or absenceof words
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
12/36
Main Steps
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
13/36
Preprocessing of Data Removal of words that have length lesser than 3
All those words whose length were found to be lesser in length than 3 were removed as they were found tobe mostly non informative.
Removal of stop words Stop words are those which provide structure of thelanguage and do not provide the content.
Not informative towards the class of the document. Examples are pronouns and conjectives
Performing Stemming with Porter Stemmingalgorithms (Porter 1980). Stemming reduces the words having the same stems
to single words thus reducing the vocabulary.
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
14/36
Preprocessing
Some stop words Examples of Stemmed words
Ling spam corpus after the pre processing
http://d/presentation/ling_Spam_After_PP/PART1_After_PP.txthttp://d/presentation/ling_Spam_After_PP/PART1_After_PP.txthttp://d/presentation/ling_Spam_After_PP/PART1_After_PP.txt -
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
15/36
Representation of Data
feature #4
feature #3
feature #2
Feature #1
example #4example #3example #2example #1
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
16/36
Representation of Data
Term Frequency (TF)
Wij = weight of aterm i in email j ,
tfij = frequency of aterm I in email j
ij ijw tf
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
17/36
The Corpus (Data Set)
The corpus that we used in our experimentation was Ling Spam Corpus(Androutsopoulos. et al. 00)
Total number of legitimate emails in thecorpus were 2412
Total number of spam emails were 481. Spam percentage was about 16%
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
18/36
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
19/36
Feature Reduction Methods
Mutual Information (MI) Latent Semantic indexing (LSI, PCA or KLT) Word Frequency Thresh holding (TF)
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
20/36
Mutual Information
Supervised feature selection method MI for feature t can be calculated as
Where t = terms or features and c = classMI Scores for all of the terms (features) werecalculated and then features were sorted indescending order and top scoring featureswere selected. (Sahami. et al.98)
( , ) ( )
P(t,c)( , ) P(t,c) logP(t) P(c)Spam Leg 0,1t c
MI t c
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
21/36
Term Frequency Thresh holding
Unsupervised feature selection Term frequency
TF for a feature in a document is the number of times it appear in that document.
Term frequency score of a feature TF Score for a feature is the addition of the
individual term frequencies for that feature inthe entire set of documents (emails).
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
22/36
Latent Semantic Indexing Unsupervised feature extraction Also known as Principal Component Analysis and
Karhunen-Love transform It calculates the Eigen vectors EV of the
covariance matrix C which is obtained from themultiplication of the mean adjusted data with itstranspose.
The Eigen vectors corresponding to the top most
Eigen values are selected. Transformed data TD is obtained by taking theTranspose of the Eigen vectors matrix andmultiplying it with the mean adjusted data i.e.
TD = EV` * (Gnal. et al. 05 )
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
23/36
The Classifier
The classifier used was K-Nearestneighbor.
All the data were stored in the memory. Classification of new example would be
carry out by finding Its Euclidean distancefrom all the stored data. The ones with the
nearest distance would be the class of thenew data.(Androutsopoulos. et al. 00)
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
24/36
The Classifier
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
25/36
Experimental settings
In the first set the data was representedusing Term frequency.
Three algorithms were tested. MI, LSI and TF thresh holding All three algorithms were used to select
the top most 20,50,100 and 250 features.
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
26/36
Evaluation Measures
Accuracy let and be the total number of spam and
legitimate emails in our data set. let be the number of emails that are classified as Z but
belong to class Y. then
Identifying legitimate email as spam is more costly then identifyingspam as legitimate. To cope with this cost different we redefineaccuracy as weighted accuracy and error as weighted error as
Spam N Leg N
Y Z N
Spam Spam Leg Leg Spam Leg Acc N N N N
Legit Spam Spam Legit Spam Leg WERR N N N N
.Leg Leg Spam Spam Spam Leg WAC N N N N
. Legit Spam Spam Legit Spam Leg WERR N N N N
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
27/36
Evaluation Measures Spam Recall
If we consider identification of spam as a filtering processand filter out all of the identified spam from the legitimateones than.
Spam recall measures the percentage of spam messages
that the filter manages to block
Spam precision
measures the degree to which the blocked messages areindeed spam
(Androutsopoulos. et al 00, sahami et al)
Spam Spam
Spam
N SR
N
Spam Spam
Spam Spam Legit Spam
N SP
N N
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
28/36
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
29/36
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
30/36
Experimental Results (1)
70
75
80
85
90
95
0 100 200 300Features
S p a m R e
L S IThresh holdingMI(Entire Data)MI(Indivisual f ile)
Spam Recall values for K = 1
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
31/36
Experimental Results (1)
70
75
80
85
90
95
0 100 200 300Features
S p a m
R e
L S IThresh holdingMI(Entire data)MI(indivisual f ile)
Spam Recall values for K = 3
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
32/36
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
33/36
Experimental Results (1)
70
75
80
85
90
95
0 100 200 300Features
S p a m
P
r e c i
L S IThresh holdingMI(Entire Data)MI(Indivisual File)
Spam Precision values for K = 3
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
34/36
Summary of Results fromExperiment
MI performs well with accuracy. MI Scores calculated over the entire data set
performs better than MI scores calculated on the
individual files. LSI and TF Thresh holding performs well in
Spam Recall but is out performed by MI withSpam Precision.
LSI and TF Thresh holding have similar sort of results
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
35/36
Observation Changing the Values of K for the Nearest Neighbor
does not have significant impact on the results.
Value of K from 1 to 7 can give you approximalty the same results.
feature set size and accuracy There isnt any consistent relationship
Changing the values of from 9 to 999 (in the weighted accuracy equation) improves
the accuracy from 0.5% to 1.5% on average. The Best accuracy results
Against the lower feature sets. Which is great improvementover the original feature space of over 40 thousand features.
-
8/15/2019 Azam: Comparative Study on Feature Space Reduction techniques for Spam Detection (Presentation)
36/36
Future work Minimum feature set size
I was unable to find the minimum feature set size after which the performance starts degrading.
Other features of email
The corpus I used does not have other features of emailssuch as attachments, pictures, domain properties etc.adding these as a features will have a good impact onaccuracy and has been examined in (sahami et al 97).
Spam rate of corpus
The Spam rate of the corpus was about 16%. Which shouldbe more. Increasing the Spam rate to 70% or 80% mightimprove the performance in terms of spam recall andprecision and will be actually depicting the current spamrate.
top related