analyzing social and stylometric features to identify spear phishing emails

Unifying the Global Response to Cybercrime

Analyzing Social and Stylometric Features to Identify Spearphishing Emails

Prateek Dewan, Anand Kashyap, Ponnurangam Kumaraguru

Indraprastha Institute of Information Technology – Delhi (IIITD), India


Overview

•  What is spearphishing? •  Spearphishing and Online Social Media

•  Challenges and dataset

•  Feature extraction

•  Classification results

•  Discussion

1


What is spearphishing? •  Targeted phishing attack

•  Contains contextual content instead of random messages

•  Harder to detect, since spearphishing emails look more genuine

•  Victims are asked to •  Download malicious attachments

•  Reply with sensitive information

•  Click on URLs •  …

2


Why study spearphishing? •  Victims are 4.5 times more likely to fall for spear

phishing, than normal phishing [1].

•  One of the main entry points for Advanced Persistent Threats.

•  Causes losses worth millions.

[1] M. Jakobsson. Modeling and preventing phishing attacks. In Financial Cryptography, volume 5. Citeseer, 2005.

3


Spearphishing and social media •  Social media profiles can be a good source for

the “context” part of spear phishing emails

•  FBI warning on July 04, 20131

•  “…emails typically contain accurate information about victims obtained from data posted on social networking sites…”

1 http://www.computerweekly.com/news/2240187487/FBI-warns-of-increased-spear-phishing-attacks

4


Data •  Emails

•  Spear phishing emails (Symantec)

•  Spam / phishing emails (Symantec)

•  Benign emails (Enron)

•  LinkedIn profiles •  Recipients of emails in the three datasets mentioned

above

•  LinkedIn People Search API

5


Challenges (social features) •  Limited information about victim to identify her on

social media •  Only first name, last name, organization available from

victim’s email ID

•  Hard to find victim on Facebook, Twitter, Google+ •  Too many profiles with same first name, last name

•  Work field not searchable.

6


Challenges (social features) contd. •  LinkedIn – Only network which provides searching

using work field

•  People search API access restricted. •  We requested for access under their Vetted API access

scheme.

•  Rate limited •  Only 100 requests per day per app

7


Dataset •  Emails sent to employees of 14 international

organizations

•  SPEAR (Targeted spear phishing emails from Symantec) •  4,742 emails à 2,434 victims / LinkedIn profiles

•  SPAM (Spam / phishing emails from Symantec) •  9,353 emails à 5,912 victims / LinkedIn profiles

•  BENIGN (Sample from Enron email corpus) •  6,601 emails à 1,240 victims / LinkedIn profiles

8


Feature set creation

SPAM

SPEAR

BENIGN

Stylometric features from emails

http://api.linkedin.com/v1/people-search:

1. firstName 2. lastName 3. organization

LinkedIn Profile(s)

Social features from LinkedIn

Final feature vector Recipient

email address

9


Stylometric Features •  Subject based (7)

•  Num. words, Num. characters, Richness

•  Has words: “bank”, “verify”

•  isReply, isForwarded

•  Attachment based (2) •  Length of attachment name

•  Attachment size

•  Body based (9) •  Num. words, Num. characters, Num. unique words

•  Has words: “attach”, “suspension”, “verify your account”

•  Num. newlines, Richness, function words

10


Social Features •  Location

•  Connections

•  Summary based (5) •  Num. words, Num. Characters, Num. unique words

•  Length, Richness

•  Profession based (2) •  Job Level (0-7)

•  Job Type (0-9)

11


Results (SPEAR v/s SPAM) Feature Set (num. features)

Classifier Random Forest J48 Decision Tree

Naïve Bayes

Subject (7) Accuracy (%) 83.91 83.10 58.87

FP Rate 0.208 0.227 0.371

Attachment (2) Accuracy (%) 97.86 96.69 69.15

FP Rate 0.035 0.046 0.218

All email (9) Accuracy (%) 98.28 97.32 68.69

FP Rate 0.024 0.035 0.221

Social (9) Accuracy (%) 81.73 76.63 65.85

FP Rate 0.229 0.356 0.445

Email + Social (18) Accuracy (%) 96.47 95.90 69.35

FP Rate 0.052 0.054 0.232

12


Results (SPEAR v/s SPAM) contd. •  Most informative features

•  Attachment size

•  Length of attachment name

•  Subject Richness

•  No. of characters in subject

•  Location (from LinkedIn profile)

•  No. of words in subject

•  LinkedIn connections

•  …

13


Results (SPEAR v/s SPAM) contd.

14

SPEAR v/s SPAM subjects

ß Spam / phishing

Spear phishing à

15


Results (SPEAR v/s BENIGN) Feature Set (num. features)


Naïve Bayes


FP Rate 0.210 0.217 0.489

Body(9) Accuracy (%) 97.17 95.62 53.81

FP Rate 0.031 0.048 0.338

All email (16) Accuracy (%) 97.39 95.84 54.14

FP Rate 0.029 0.044 0.334

Social (9) Accuracy (%) 94.48 91.79 69.76

FP Rate 0.067 0.103 0.278


FP Rate 0.032 0.052 0.316

16


Results (SPEAR v/s BENIGN) contd. •  Most informative features

•  Body richness •  No. of characters in body •  No. of words in body •  No. of unique words in body •  Location (from LinkedIn) •  No. of newlines in body •  Subject richness

•  …

17


Results (SPEAR v/s SPAM + BENIGN)

Feature Set (num. features)


Naïve Bayes


FP Rate 0.333 0.352 0.681

Social (9) Accuracy (%) 88.04 84.69 74.46

FP Rate 0.241 0.371 0.454


FP Rate 0.202 0.248 0.381

18


Results (SPEAR v/s SPAM + BENIGN) contd.

•  Most informative features •  Subject richness

•  No. of characters in subject

•  Location (from LinkedIn)

•  LinkedIn connections

•  No. of words in subject

•  Email forwarded? (True / false)

•  Email is a reply? (True / false)

•  …

19


Discussion •  Social features features (from LinkedIn) did not help in

distinguishing spear phishing emails from non spear phishing emails. •  Stylometric features from emails suffice to do so.

•  Real world scenarios may be much different •  Attackers may use information from other sources / social

networks, viz. Facebook, Twitter, etc.

•  Dataset limitation •  It is possible that no spear phishing mails in our dataset were

crafted using LinkedIn features

•  We cannot conclude that such behavior would not be found outside our dataset, or in future.

20


Thanks!

Prateek Dewan E: [email protected]

W: http://precog.iiitd.edu.in/people/prateek

21


Backup slides…


Results (SPEAR v/s SPAM) contd.


Attachment names

Results (SPEAR v/s BENIGN) contd.

ß Benign emails

Spear phishing à


Attachment types


Details of organizations

analyzing social and stylometric features to identify spear phishing emails

Engineering

fp rate

results spear vs spam

social media challenges

results spear vs benign

489body9 accuracy

victims linkedin profiles8

subject linkedin connections

unique words length