improving spam detection based on structural similarity
DESCRIPTION
Improving Spam Detection Based on Structural Similarity. By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida, Luis M. A. Bettencourt, Virg í lio A. F. Almeida, Jussara M. Almeida Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005 Presented by Jared Bott. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/1.jpg)
Improving Spam Detection Based on Structural Similarity
By Luiz H. Gomes, Fernando D. O. Castro, Rodrigo B. Almeida,Luis M. A. Bettencourt, Virgílio A. F. Almeida, Jussara M. Almeida
Presented at Steps to Reducing Unwanted Traffic on the Internet Workshop, 2005
Presented by Jared Bott
![Page 2: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/2.jpg)
2
Outline
Overview
Concepts
Detecting Spam
Experimental Results
Analysis of Paper
![Page 3: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/3.jpg)
3
Overview
New algorithm to detect spam messagesUses email information that is harder to
changeWorks in conjunction with another spam
classifier I.e. SpamAssassin
Less false positives than compared methods
![Page 4: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/4.jpg)
4
Spam Detection Problem
Spam detection algorithms use some part of emails to determine if a message is spam Spammers change messages so that they do
not meet detection criteria for spam
Very easy to change spam messages, usernames, domains, subjects, etc.
![Page 5: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/5.jpg)
5
Key Idea
The lists that spammers and legitimate users send messages to and from can be used as the identifiers of classes of email traffic. The lists of addresses spammers send to are
unlikely to be similar to those of legitimate users.
Lists don’t change that often
![Page 6: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/6.jpg)
6
Using Lists
A user is not just an email address. It can be a domain, etc.
Represent email user as a vector in multi-dimensional conceptual space created with all possible contacts Each sender and each recipient has their own
vectorModel relationship between senders and
recipients
![Page 7: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/7.jpg)
7
Constructing Vectors
If there is at least one email sent from sender si to recipient rn, then the value in si’s vector’s nth dimension is 1. Otherwise, that value is 0.
If there is at least one email received by recipient ri from sender sn, the value in ri’s vector’s nth dimension is 1. Otherwise it is 0.
![Page 8: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/8.jpg)
8
Example Vectors
User 1
User 2
User 3
S[0,1,1]R[0,1,0]
S[1,0,1]R[1,0,0]
S[0,0,0]R[1,1,0]
![Page 9: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/9.jpg)
9
Similarity Between Senders
Similarity between senders si and sk is the cosine of the angle between their vectors cos(si, sk) 0 means no shared contact 1 means identical contact lists
In legitimate email, a 1 means that the senders operate in the same social group.
In spammers, a 1 means that the senders use the same list or are the same person.
![Page 10: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/10.jpg)
10
Grouping Users Into Clusters
Group users with similar vectors Users with similar vectors are likely to have
related roles, i.e. spammer or legitimate user
Each cluster is represented by a vector This vector is the sum of all its component
users’ vectors
![Page 11: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/11.jpg)
11
Similarity Between a User and a Cluster
Similarity is derived from user to user similarity equation If sender si is a member of cluster sck, then the
similarity is cos(sck – si, si).
If sender si is not a member of cluster sck, then the similarity is cos(sck, si).
Similarity between a user and a cluster will change over time Remove the user’s vector from the cluster’s vector when
computing similarity and reclassifying a user
![Page 12: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/12.jpg)
12
Detecting Spam
Two probabilities to compute Ps(m) – Probability of an email m being sent by
a spammer
Pr(m) – Probability of an email m being addressed to users that receive spam
![Page 13: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/13.jpg)
13
Detecting Spam
When an email arrives, classify it using some other method
Find the cluster (sc) the email’s sender belongs in If many users in the cluster send messages that are
classified as spam by auxiliary method, the probability of all the users in that cluster sending spam is high
Update the sc’s spam probability Ps(m) ← sc’s spam probability
![Page 14: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/14.jpg)
14
Detecting Spam
For all recipients of the email, find the cluster (rc) each one belongs to
Update the spam probability for each cluster
Pr(m) ← Pr(m) + spam probability of each rc
Pr(m) ← Pr(m)/number of recipients
![Page 15: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/15.jpg)
15
Detecting Spam
Compute a spam rank for the email based upon Pr(m) and Ps(m)
If the spam rank is above some threshold (ω), label it as spam
If the spam rank is below 1- ω, label it is legitimate
Otherwise label the email as the auxiliary method’s classification
![Page 16: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/16.jpg)
16
![Page 17: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/17.jpg)
17
Experimental Results
Tested on a log of eight days of email from a large Brazilian university
Tested on a 2.8 GHz Pentium 4 with 512 MB RAM Able to classify 20 messages per second Faster than the average message arrival peak
rate
![Page 18: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/18.jpg)
18
Results
Measure Non-Spam Spam Aggregate
# of emails 191,417 173,584 365,001
Size of emails 11.3 GB 1.2 GB 12.5 GB
# of distinct senders
12,338 19,567 27,734
# of distinct recipients
22,762 27,926 38,875
![Page 19: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/19.jpg)
19
Results
Manually checked false positives to see if they were spam or not Auxiliary algorithm had more false positives
Algorithm % of Misclassifications
Original Classification 60.33%
Their approach 39.67%
![Page 20: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/20.jpg)
20
Strengths
Less false positives than SpamAssassin
Low-cost
Works with message information that doesn’t change that much
![Page 21: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/21.jpg)
21
Weaknesses
Needs an additional message classifier, i.e. SpamAssassin
Manual tuning of algorithm
![Page 22: Improving Spam Detection Based on Structural Similarity](https://reader035.vdocuments.net/reader035/viewer/2022062321/56812db3550346895d92e17c/html5/thumbnails/22.jpg)
22
Improvements
Time correlation of similar addresses
Collaborative filtering based upon user feedback