wrangle 2016: malware tracking at scale

Post on 16-Apr-2017

79 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2016 Cloudera, Inc. All rights reserved.

Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

About me• Michael Bentley• Formerly Director of Research and Response @ Lookout• Currently working on data mining projects• KK6WCN• michael@setnorth.com

© 2016 Cloudera, Inc. All rights reserved.

Agenda• What we are trying to accomplish• How basic heuristics work• Where basic heuristics don’t work• Tracking with pairwise similarity and EMR• Visualizations to help extract more information• Mistakes and caveats

© 2016 Cloudera, Inc. All rights reserved.

What are we trying to accomplish• Searching for major versions of software (malware)• Find ways to detect it with simple heuristics• Find ways to track it• Dataset discovery

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics• Detect on static data• Detect on analysis stack created metadata

applications

analysisacquisition

Hashes

Strings

Who signed it / certificate

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics - hashes

APK file

Hashes Icon

Dex File

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics - string detection

• Nice ASCII string delimited by null bytes

• Malicious class path• Byte code• Exact match in one or both

directions of string• Ctrl + F

Null byte

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics- certificates• Same

malware• Different

certificates

© 2016 Cloudera, Inc. All rights reserved.

Where simple heuristics are good• Good for things that don’t change• Computationally cheap• About the same scenario for network (IDS) or

application inspection (malware detection)

© 2016 Cloudera, Inc. All rights reserved.

Where it’s problematic• Anything with funding/making money.• Malware created in Eastern Europe, Asia, Italy

(Hacking Team)• Mass creation of certificates• Code taken from Stack Overflow

• Anything with basic string obfuscation• Hunting for new major versions

© 2016 Cloudera, Inc. All rights reserved.

Enter pairwise similarityYou’re about to see a spreadsheet at a big data conference

http://gunshowcomic.com/648

© 2016 Cloudera, Inc. All rights reserved.

Application pairwise similarity

© 2016 Cloudera, Inc. All rights reserved.

Go from pick one app and rescan corpus

© 2016 Cloudera, Inc. All rights reserved.

Pick one application – Rescan corpus• Examine one app• Find heuristic• Rescan corpus• Rinse repeat ad infinitum• Throw people at the problem

http://bit.ly/2a0zcZR

© 2016 Cloudera, Inc. All rights reserved.

Decoding what you already have• Pairwise similarity defines the

relationships for us• Dots represent unique (SHA1)

applications• Colors represent major versions

of malware• Each color is within ~85% match of

code distance

© 2016 Cloudera, Inc. All rights reserved.

Clustering and intelligence

APK

APK

APK

APK

APK

APK

APKNearest neighbor

95% similar

Cluster 185% similar

Cluster 285% similar

Cluster 0< 85% similar

• APKs are nodes and edges• Clusters are neighborhoods

© 2016 Cloudera, Inc. All rights reserved.

Clustering and intelligence

© 2016 Cloudera, Inc. All rights reserved.

Clustering versus heuristics

© 2016 Cloudera, Inc. All rights reserved.

Evolution of malware over time• By taking the clustering data

and then overlaying it with the packaged at data we can watch malware evolve over time.

• Color represents major version• Time is a 4 month sliding

window• Shows iterations from malware

writers

© 2016 Cloudera, Inc. All rights reserved.

Pairwise problems and options• Comparing 3500 applications is 12,250,000 operations• As you bring more applications in, expect to scale EMR

cluster or reduce n.• You can overmatch on similarity – outlier issue

© 2016 Cloudera, Inc. All rights reserved.

Tripping over the bar• Pairwise similarity for 7k apps is about

5gB.• So is S3• Things go bad when you don’t respect the

bucket size• Troubleshooting CSV sizes is a thing

• Doesn’t work well on small applications• Temporary files on your local machine

that are 70gB cause problems

© 2016 Cloudera, Inc. All rights reserved.

Knowledge• I had never used NetworkX before ~2014• I had no idea how to go from what we had into a decent format for

visualizing this (GraphML).• Almost no experience in graph theory before ~2014• Gilad Lotan had a great PyCon talk which got me started. I still reference

his talks.• Gephi is a great shortcut for visualizing in 2D if you aren’t familiar with

D3• Seth Hardy who gave tons of amazing feedback while I was learning• Jack Urban who proved that it was possible to track applications as a

network• Gensim library is a great way to get started in doing comparisons of

applications• Lots of inspiration from the Defcon 22 OpenDNS talk (theirs is better)

Thank you.

top related