database techniques for fighting spam telvis calhoun csc 8710 – advanced databases dr. yingshu li
TRANSCRIPT
![Page 1: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/1.jpg)
Database Techniques for fighting SPAM
Telvis Calhoun
CSc 8710 – Advanced Databases
Dr. Yingshu Li
![Page 2: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/2.jpg)
Everybody knows about SPAM
Spam is unsolicited bulk email sent for profit and general mayhem.
BOTNETs = Distributed Network of hijacked IPs.
IPs hard to track 70 billion emails sent per day. 70%
spam
![Page 3: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/3.jpg)
How Anti-SPAM uses DBs?
Spam databases collect network layer and application layer data.
IP Blacklisting Detect a malicious host during SMTP dialog. Difficult to detect IP address DHCP, botnet size or good
IPs used to forward Content Analysis
Detect malicious mail content. Requires that MTA complete the SMTP connection. Arms race between content filter designers and
spammers.
![Page 4: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/4.jpg)
Summary of DB Techniques
Grey Space Analysis Trinity: Peer-to-Peer Database Behavioral Blacklisting Progressive Email Scanning Content filtering using Bayesian
Analysis
![Page 5: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/5.jpg)
Grey Space Analysis
Characterize IP Space: Active vs. Grey Space IP Flow Database Detect malicious IPs by extracting dominant
scanning ports (DSPs) Find DSPs using relative uncertainty algorithm
![Page 6: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/6.jpg)
Mining Technique: Relative Uncertainty
Determines entropy of IP ports in flows database. Formula := Entropy of dstPrt distribution ÷ maximum
entropy. p := number of flows with port[i] ÷ total flows RU close to 1 shows ~even distribution, near 0 shows
uneven distribution
![Page 7: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/7.jpg)
Grey Space Algorithm
Isolate flows toward grey space
Find dominant scanning ports (DSPs)
Find outside hosts with DSPs flows toward grey and active hosts.
Find inside host footprint for outside hosts.
Classify adversary as hitter or scanner.
![Page 8: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/8.jpg)
Focused Hitters vs Bad Scanners
Focused hitters tend to send tens or hundreds of flows to each grey host.
Bad scanners send one or a few flows to each grey host
![Page 9: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/9.jpg)
Trinity: Distribute IP Reputation Database
Botnets send a large amount of data in a short amount of time.
Trinity uses distributed in-memory hash table containing IP reputation entries.
Each peer has 10 to 50 megabytes of data (833K – 4.17M entries)
![Page 10: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/10.jpg)
Chord Distributed Hash Table
Distribute data over a large P2P network Quickly find any given item
Stores key/value pairs The key value controls which node(s) stores the
value Each node is responsible for some section of the
space Basic operations
Store(key; val) val = Retrieve(key)
![Page 11: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/11.jpg)
Chord (cont)
Each node chooses a n-bit ID IDs are arranged in a ring
Each lookup key is also a n-bit ID i.e., the hash of the real lookup key Node IDs and keys occupy the same space!
Each node is responsible for storing keys “near" its ID Replication usaully between current and previous node Items can be replicated at multiple successors No single host contains large fraction of a particular space
to guard against DDoS.
![Page 12: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/12.jpg)
Database Updates
Compute the number of interval quarters since last update. Shift and update counters accordingly
Determine site responsible for entry and send UDP. Once received by owner site, forward entry to k peers using TCP.
Updates communicative, order doesn’t matter. Consistency not required.
Even if host goes down, database can be rebuilt in an hour.
![Page 13: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/13.jpg)
Security
Secure communications for neighbors Limit updates for nodes that have sent
more than 100 emails in 10 minutes. Falsified source IPs can cause false
positives.
![Page 14: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/14.jpg)
Clustering Technique for Behavioral Blacklisting
Identify spammers that attack many domains.
Domain distribution and frequency is the sending pattern
Form clusters of sending patterns
Use clusters to ID new attack
![Page 15: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/15.jpg)
Spectral Clustering
Divide Phase – produces a tree whose leaves are elements of the set.
Merge Phase – Start with each leaf in its own cluster and merge going up the tree.
![Page 16: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/16.jpg)
Vector Generation
Database contains: M(i,j,k) Total times that IP ‘i' sent email to domain
‘j’ in time slot ‘k’. Find total flows for IP/Domain across
entire time axis (M’). Generate feature vector from M’
IP := <#flows to domain 1, # flows to domain 2, … #flows to domain j>
![Page 17: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/17.jpg)
Clustering
Clusters contain IP addresses that send mail to similar sets of domains.
Define traffic pattern for each cluster Averaging the rows
(vector contents) for all IPs in the cluster.
IPxIP matrix of related spam senders
![Page 18: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/18.jpg)
Classification
Input IP vector ‘r’ :=1 x d vector Use similarity algorithm to find closes cluster Spam score is the maximum similarity of r
with any cluster.
![Page 19: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/19.jpg)
Progressive Email Scanner
Maintains Feature Instance (FI) database
FI is any feature that can discriminate HAM from SPAM.
Dynamic Features - Use any feature that IDs mail such as contents, network, etc.) Paper only uses URL links as FIs
![Page 20: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/20.jpg)
PEC Architecture
FI States Grey (Ambiguous FI) Black (Spam FI) White (HAM FI)
Blacklist Module – Extracts and hashes FIs
Scoreboard Module – Tracks FI occurrences and timestamp (age)
![Page 21: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/21.jpg)
Competitive Aging and Scoring System (CASS)
Transition between states governed by Score – number of occurrence of FI Age – time since last score update.
Score (R) exceeds score threshold (S) causes Grey to Black transition.
Age (A) exceeds age threshold (M) triggers Grey to White transition. Purge
![Page 22: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/22.jpg)
Bayesian Content Filtering
Determine the probability that a message is spam based on contents
Use Bayesian combination of spam probabilities
![Page 23: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/23.jpg)
Bayesian Training
Requires training corpus of HAM/SPAM
Find interesting tokens.
Create HAM/SPAM token tables
![Page 24: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/24.jpg)
Classification
Tokenize new message Calculate spam probability for each message Derive overall spam probablity using Bayes formula.
Sample Message = 0.0 Non-spam tokens outweigh spam tokens to prevent
false positives
Hi,
Just a reminder: don’t forget your allergy prescription when you visit New
York City today.
MomSample Message Spam Probability Table
![Page 25: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/25.jpg)
Real World Applications
TrustedSource.org
Messaging Security Architecture
![Page 26: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/26.jpg)
Summary
A variety of database techniques are used in Anti-Spam Technology IP Blacklisting Content filtering
Databases can contain: Network traffic: IP Addresses, Domain, Ports Message Content: Words, URLs, HTML Text
Challenges: Scalability – Must handle many connections or messages Minimize False Positive Rates – Cannot classify a HAM
message as SPAM. Finding useful SPAM features. Using machine learning
techniques.
![Page 27: Database Techniques for fighting SPAM Telvis Calhoun CSc 8710 – Advanced Databases Dr. Yingshu Li](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f2e5503460f94c47f37/html5/thumbnails/27.jpg)
References
Brodsky, et al, A Distributed Content Independent Method for Spam Detection, HotBots 2007
Jin, et al, Identifying and Tracking Suspicious Activities through IP Gray Space Analysis, MineNet 2007
Liu, et al, High-Speed Detection of Unsolicited Bulk Emails, ANCS 2007
Ramachandran, A., Filtering Spam with Behavioral Blacklisting, CCS 2007
Cheng, et al., A Divide-and-Merge Methodology for Clustering, ACM Transactions on Database Systems, 2006
Graham P., A Plan for Spam, www.paulgraham.com/spam.html, 2002
Secure Computing Corporation, http://trustedsource.org, 2008