ieee_isi_2016_presentation_short.pdf - s3-us-west-2 ... · pdf filepublic archive of...

15
PhishMonger: A Free and Open Source Public Archive of Real-World Phishing Websites David G. Dobolyi & Ahmed Abbasi Center for Business Analytics McIntire School of Commerce University of Virginia

Upload: lykhanh

Post on 22-Feb-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

PhishMonger: A Free and Open Source Public Archive of Real-World Phishing

WebsitesDavid G. Dobolyi & Ahmed Abbasi

Center for Business AnalyticsMcIntire School of Commerce

University of Virginia

PhishTank

Ephemerality

PhishTank API• Indexes online, valid phishing sites• Typically 25,000 to 50,000 sites per request• Updated hourly• Data provided in JSON, XML, CSV, serialized PHP formats• Phish details include:

• phish_id and url• target• submission_time and verification_time

• http://www.phishtank.com/developer_info.php

PhishMonger• Invokes the PhishTank API hourly• Identifies newly added phish URLs• Fetches the new phishing websites• Adds the captured sites to a growing corpus

About the Platform• Leverages exclusively open-source software:

• Ubuntu Linux, GNU Wget, Filezilla Server FTP

• Coded in Python 3.5• Harnesses the Twisted library for time-based scheduling

• Runs on Amazon Web Services (AWS) Elastic Compute Cloud (EC2)

• Additional statistical scripts written in R

System DiagramAmazon Elastic Compute (EC2)

FTP Server

Instance

Virtual Machine

Virtual Machine Swapdisk

Phishing Storage

Archive Storage

Ubuntu Virtual

Machine

Twisted Looping Call

PhishMongerLive Site Collector

PhishMonger Application

Elastic Block Store (EBS)

GNU Wget

Internet

Developer API

PhishMonger Corpus• As of September 23, 2016 and run 3649, the collection

includes:• 171,360 sites• 129 targeted brands• 19,690,341 files and folders• 200GB of compressed storage

Targeted Brands with 200+ Sites

0

2000

4000

6000

AOLApple

DropboxeBay

FacebookGoogle IRS

JPMorgan Chase

Microsoft

PayPalUSAA

Wells Fargo

Yahoo

Targeted Brand

Freq

uenc

y

Top 6 Targeted Brands Over Time

0

300

600

900

Nov 2015 Dec 2015 Jan 2016 Feb 2016 Mar 2016 Apr 2016 May 2016 Jun 2016 Jul 2016 Aug 2016

Date

Freq

uenc

y

AOL Apple Facebook IRS PayPal USAA

Top 10 File Types in CorpusFile Extension Description Category n

png Portable Network Graphics Graphics 3,169,772html HyperText Markup Language Text 1,304,574jpg Joint Photographic Experts Group Graphics 1,251,227gif Graphics Interchange Format Graphics 1,208,424js JavaScript Code Text 947,420css Cascading Style Sheet Text 776,404ttf True Type Font Font 210,197svg Scalable Vector Graphics Graphics 176,856ico Icon Graphics 139,564

woff Web Open Font Format Font 134,308

Get the Data• Available at University of Arizona Artificial Intelligence Lab:

http://www.azsecure-data.org/phishing-websites.html• Contact:

• David G. Dobolyi, [email protected]• Ahmed Abbasi, [email protected]

• Acknowledgements:• US National Science Foundation ACI-1443019 “DIBBs for Intelligence and Security

Informatics Research Community”• Data Infrastructure Buildings Blocks (DIBBs) partner institutions: University of

Arizona’s AI Lab and collaborators at Drexel University, the University of Texas-Dallas, and the University of Utah

Miscellaneous Slides

Number of Targeted Brands and Percentage of Verified Phish in PhishTank Database