analyzing social and stylometric features to identify spear phishing emails
DESCRIPTION
Targeted social engineering attacks in the form of spear phishing emails, are often the main gimmick used by attackers to infiltrate organizational networks and implant state- of-the-art Advanced Persistent Threats (APTs). Spear phishing is a complex targeted attack in which, an attacker harvests information about the victim prior to the attack. This infor- mation is then used to create sophisticated, genuine-looking attack vectors, drawing the victim to compromise confidential information. What makes spear phishing different, and more powerful than normal phishing, is this contextual information about the victim. Online social media services can be one such source for gathering vital information about an individual. In this paper, we characterize and examine a true positive dataset of spear phishing, spam, and normal phishing emails from Symantec’s enterprise email scanning service. We then present a model to detect spear phishing emails sent to employees of 14 international organizations, by using social features extracted from LinkedIn. Our dataset consists of 4,742 targeted attack emails sent to 2,434 victims, and 9,353 non targeted attack emails sent to 5,912 non victims; and publicly available information from their LinkedIn profiles. We applied various machine learning algorithms to this labeled data, and achieved an overall maximum accuracy of 97.76% in identifying spear phishing emails. We used a combination of social features from LinkedIn profiles, and stylometric features extracted from email subjects, bodies, and attachments. However, we achieved a slightly better accuracy of 98.28% without the social features. Our analysis revealed that social features extracted from LinkedIn do not help in identifying spear phishing emails. To the best of our knowledge, this is one of the first attempts to make use of a combination of stylometric features extracted from emails, and social features extracted from an online social network to detect targeted spear phishing emails.TRANSCRIPT
Unifying the Global Response to Cybercrime
Analyzing Social and Stylometric Features to Identify Spearphishing Emails
Prateek Dewan, Anand Kashyap, Ponnurangam Kumaraguru
Indraprastha Institute of Information Technology – Delhi (IIITD), India
Unifying the Global Response to Cybercrime
Overview
• What is spearphishing? • Spearphishing and Online Social Media
• Challenges and dataset
• Feature extraction
• Classification results
• Discussion
1
Unifying the Global Response to Cybercrime
What is spearphishing? • Targeted phishing attack
• Contains contextual content instead of random messages
• Harder to detect, since spearphishing emails look more genuine
• Victims are asked to • Download malicious attachments
• Reply with sensitive information
• Click on URLs • …
2
Unifying the Global Response to Cybercrime
Why study spearphishing? • Victims are 4.5 times more likely to fall for spear
phishing, than normal phishing [1].
• One of the main entry points for Advanced Persistent Threats.
• Causes losses worth millions.
[1] M. Jakobsson. Modeling and preventing phishing attacks. In Financial Cryptography, volume 5. Citeseer, 2005.
3
Unifying the Global Response to Cybercrime
Spearphishing and social media • Social media profiles can be a good source for
the “context” part of spear phishing emails
• FBI warning on July 04, 20131
• “…emails typically contain accurate information about victims obtained from data posted on social networking sites…”
1 http://www.computerweekly.com/news/2240187487/FBI-warns-of-increased-spear-phishing-attacks
4
Unifying the Global Response to Cybercrime
Data • Emails
• Spear phishing emails (Symantec)
• Spam / phishing emails (Symantec)
• Benign emails (Enron)
• LinkedIn profiles • Recipients of emails in the three datasets mentioned
above
• LinkedIn People Search API
5
Unifying the Global Response to Cybercrime
Challenges (social features) • Limited information about victim to identify her on
social media • Only first name, last name, organization available from
victim’s email ID
• Hard to find victim on Facebook, Twitter, Google+ • Too many profiles with same first name, last name
• Work field not searchable.
6
Unifying the Global Response to Cybercrime
Challenges (social features) contd. • LinkedIn – Only network which provides searching
using work field
• People search API access restricted. • We requested for access under their Vetted API access
scheme.
• Rate limited • Only 100 requests per day per app
7
Unifying the Global Response to Cybercrime
Dataset • Emails sent to employees of 14 international
organizations
• SPEAR (Targeted spear phishing emails from Symantec) • 4,742 emails à 2,434 victims / LinkedIn profiles
• SPAM (Spam / phishing emails from Symantec) • 9,353 emails à 5,912 victims / LinkedIn profiles
• BENIGN (Sample from Enron email corpus) • 6,601 emails à 1,240 victims / LinkedIn profiles
8
Unifying the Global Response to Cybercrime
Feature set creation
SPAM
SPEAR
BENIGN
Stylometric features from emails
http://api.linkedin.com/v1/people-search:
1. firstName 2. lastName 3. organization
LinkedIn Profile(s)
Social features from LinkedIn
Final feature vector Recipient
email address
9
Unifying the Global Response to Cybercrime
Stylometric Features • Subject based (7)
• Num. words, Num. characters, Richness
• Has words: “bank”, “verify”
• isReply, isForwarded
• Attachment based (2) • Length of attachment name
• Attachment size
• Body based (9) • Num. words, Num. characters, Num. unique words
• Has words: “attach”, “suspension”, “verify your account”
• Num. newlines, Richness, function words
10
Unifying the Global Response to Cybercrime
Social Features • Location
• Connections
• Summary based (5) • Num. words, Num. Characters, Num. unique words
• Length, Richness
• Profession based (2) • Job Level (0-7)
• Job Type (0-9)
11
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) Feature Set (num. features)
Classifier Random Forest J48 Decision Tree
Naïve Bayes
Subject (7) Accuracy (%) 83.91 83.10 58.87
FP Rate 0.208 0.227 0.371
Attachment (2) Accuracy (%) 97.86 96.69 69.15
FP Rate 0.035 0.046 0.218
All email (9) Accuracy (%) 98.28 97.32 68.69
FP Rate 0.024 0.035 0.221
Social (9) Accuracy (%) 81.73 76.63 65.85
FP Rate 0.229 0.356 0.445
Email + Social (18) Accuracy (%) 96.47 95.90 69.35
FP Rate 0.052 0.054 0.232
12
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) contd. • Most informative features
• Attachment size
• Length of attachment name
• Subject Richness
• No. of characters in subject
• Location (from LinkedIn profile)
• No. of words in subject
• LinkedIn connections
• …
13
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) contd.
14
SPEAR v/s SPAM subjects
ß Spam / phishing
Spear phishing à
15
Unifying the Global Response to Cybercrime
Results (SPEAR v/s BENIGN) Feature Set (num. features)
Classifier Random Forest J48 Decision Tree
Naïve Bayes
Subject (7) Accuracy (%) 81.19 81.11 61.75
FP Rate 0.210 0.217 0.489
Body(9) Accuracy (%) 97.17 95.62 53.81
FP Rate 0.031 0.048 0.338
All email (16) Accuracy (%) 97.39 95.84 54.14
FP Rate 0.029 0.044 0.334
Social (9) Accuracy (%) 94.48 91.79 69.76
FP Rate 0.067 0.103 0.278
Email + Social (25) Accuracy (%) 97.04 95.28 57.27
FP Rate 0.032 0.052 0.316
16
Unifying the Global Response to Cybercrime
Results (SPEAR v/s BENIGN) contd. • Most informative features
• Body richness • No. of characters in body • No. of words in body • No. of unique words in body • Location (from LinkedIn) • No. of newlines in body • Subject richness
• …
17
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM + BENIGN)
Feature Set (num. features)
Classifier Random Forest J48 Decision Tree
Naïve Bayes
Subject (7) Accuracy (%) 86.48 86.35 77.99
FP Rate 0.333 0.352 0.681
Social (9) Accuracy (%) 88.04 84.69 74.46
FP Rate 0.241 0.371 0.454
Email + Social (16) Accuracy (%) 89.86 88.38 73.97
FP Rate 0.202 0.248 0.381
18
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM + BENIGN) contd.
• Most informative features • Subject richness
• No. of characters in subject
• Location (from LinkedIn)
• LinkedIn connections
• No. of words in subject
• Email forwarded? (True / false)
• Email is a reply? (True / false)
• …
19
Unifying the Global Response to Cybercrime
Discussion • Social features features (from LinkedIn) did not help in
distinguishing spear phishing emails from non spear phishing emails. • Stylometric features from emails suffice to do so.
• Real world scenarios may be much different • Attackers may use information from other sources / social
networks, viz. Facebook, Twitter, etc.
• Dataset limitation • It is possible that no spear phishing mails in our dataset were
crafted using LinkedIn features
• We cannot conclude that such behavior would not be found outside our dataset, or in future.
20
Unifying the Global Response to Cybercrime
Thanks!
Prateek Dewan E: [email protected]
W: http://precog.iiitd.edu.in/people/prateek
21
Unifying the Global Response to Cybercrime
Backup slides…
Unifying the Global Response to Cybercrime
Results (SPEAR v/s SPAM) contd.
Unifying the Global Response to Cybercrime
Attachment names
Results (SPEAR v/s BENIGN) contd.
ß Benign emails
Spear phishing à
Unifying the Global Response to Cybercrime
Attachment types
Unifying the Global Response to Cybercrime
Details of organizations