ieee_isi_2016_presentation_short.pdf - s3-us-west-2 ... · pdf filepublic archive of...
TRANSCRIPT
PhishMonger: A Free and Open Source Public Archive of Real-World Phishing
WebsitesDavid G. Dobolyi & Ahmed Abbasi
Center for Business AnalyticsMcIntire School of Commerce
University of Virginia
PhishTank API• Indexes online, valid phishing sites• Typically 25,000 to 50,000 sites per request• Updated hourly• Data provided in JSON, XML, CSV, serialized PHP formats• Phish details include:
• phish_id and url• target• submission_time and verification_time
• http://www.phishtank.com/developer_info.php
PhishMonger• Invokes the PhishTank API hourly• Identifies newly added phish URLs• Fetches the new phishing websites• Adds the captured sites to a growing corpus
About the Platform• Leverages exclusively open-source software:
• Ubuntu Linux, GNU Wget, Filezilla Server FTP
• Coded in Python 3.5• Harnesses the Twisted library for time-based scheduling
• Runs on Amazon Web Services (AWS) Elastic Compute Cloud (EC2)
• Additional statistical scripts written in R
System DiagramAmazon Elastic Compute (EC2)
FTP Server
Instance
Virtual Machine
Virtual Machine Swapdisk
Phishing Storage
Archive Storage
Ubuntu Virtual
Machine
Twisted Looping Call
PhishMongerLive Site Collector
PhishMonger Application
Elastic Block Store (EBS)
GNU Wget
Internet
Developer API
PhishMonger Corpus• As of September 23, 2016 and run 3649, the collection
includes:• 171,360 sites• 129 targeted brands• 19,690,341 files and folders• 200GB of compressed storage
Targeted Brands with 200+ Sites
0
2000
4000
6000
AOLApple
DropboxeBay
FacebookGoogle IRS
JPMorgan Chase
Microsoft
PayPalUSAA
Wells Fargo
Yahoo
Targeted Brand
Freq
uenc
y
Top 6 Targeted Brands Over Time
0
300
600
900
Nov 2015 Dec 2015 Jan 2016 Feb 2016 Mar 2016 Apr 2016 May 2016 Jun 2016 Jul 2016 Aug 2016
Date
Freq
uenc
y
AOL Apple Facebook IRS PayPal USAA
Top 10 File Types in CorpusFile Extension Description Category n
png Portable Network Graphics Graphics 3,169,772html HyperText Markup Language Text 1,304,574jpg Joint Photographic Experts Group Graphics 1,251,227gif Graphics Interchange Format Graphics 1,208,424js JavaScript Code Text 947,420css Cascading Style Sheet Text 776,404ttf True Type Font Font 210,197svg Scalable Vector Graphics Graphics 176,856ico Icon Graphics 139,564
woff Web Open Font Format Font 134,308
Get the Data• Available at University of Arizona Artificial Intelligence Lab:
http://www.azsecure-data.org/phishing-websites.html• Contact:
• David G. Dobolyi, [email protected]• Ahmed Abbasi, [email protected]
• Acknowledgements:• US National Science Foundation ACI-1443019 “DIBBs for Intelligence and Security
Informatics Research Community”• Data Infrastructure Buildings Blocks (DIBBs) partner institutions: University of
Arizona’s AI Lab and collaborators at Drexel University, the University of Texas-Dallas, and the University of Utah