Wrangle 2016: Malware Tracking at Scale

23
© 2016 Cloudera, Inc. All rights reserved. Malware Tracking at Scale

Transcript of Wrangle 2016: Malware Tracking at Scale

Page 1: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Malware Tracking at Scale

Page 2: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

About me• Michael Bentley• Formerly Director of Research and Response @ Lookout• Currently working on data mining projects• KK6WCN• [email protected]

Page 3: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Agenda• What we are trying to accomplish• How basic heuristics work• Where basic heuristics don’t work• Tracking with pairwise similarity and EMR• Visualizations to help extract more information• Mistakes and caveats

Page 4: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

What are we trying to accomplish• Searching for major versions of software (malware)• Find ways to detect it with simple heuristics• Find ways to track it• Dataset discovery

Page 5: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics• Detect on static data• Detect on analysis stack created metadata

applications

analysisacquisition

Hashes

Strings

Who signed it / certificate

Page 6: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics - hashes

APK file

Hashes Icon

Dex File

Page 7: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics - string detection

• Nice ASCII string delimited by null bytes

• Malicious class path• Byte code• Exact match in one or both

directions of string• Ctrl + F

Null byte

Page 8: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Simple heuristics- certificates• Same

malware• Different

certificates

Page 9: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Where simple heuristics are good• Good for things that don’t change• Computationally cheap• About the same scenario for network (IDS) or

application inspection (malware detection)

Page 10: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Where it’s problematic• Anything with funding/making money.• Malware created in Eastern Europe, Asia, Italy

(Hacking Team)• Mass creation of certificates• Code taken from Stack Overflow

• Anything with basic string obfuscation• Hunting for new major versions

Page 11: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Enter pairwise similarityYou’re about to see a spreadsheet at a big data conference

http://gunshowcomic.com/648

Page 12: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Application pairwise similarity

Page 13: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Go from pick one app and rescan corpus

Page 14: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Pick one application – Rescan corpus• Examine one app• Find heuristic• Rescan corpus• Rinse repeat ad infinitum• Throw people at the problem

http://bit.ly/2a0zcZR

Page 15: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Decoding what you already have• Pairwise similarity defines the

relationships for us• Dots represent unique (SHA1)

applications• Colors represent major versions

of malware• Each color is within ~85% match of

code distance

Page 16: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Clustering and intelligence

APK

APK

APK

APK

APK

APK

APKNearest neighbor

95% similar

Cluster 185% similar

Cluster 285% similar

Cluster 0< 85% similar

• APKs are nodes and edges• Clusters are neighborhoods

Page 17: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Clustering and intelligence

Page 18: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Clustering versus heuristics

Page 19: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Evolution of malware over time• By taking the clustering data

and then overlaying it with the packaged at data we can watch malware evolve over time.

• Color represents major version• Time is a 4 month sliding

window• Shows iterations from malware

writers

Page 20: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Pairwise problems and options• Comparing 3500 applications is 12,250,000 operations• As you bring more applications in, expect to scale EMR

cluster or reduce n.• You can overmatch on similarity – outlier issue

Page 21: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Tripping over the bar• Pairwise similarity for 7k apps is about

5gB.• So is S3• Things go bad when you don’t respect the

bucket size• Troubleshooting CSV sizes is a thing

• Doesn’t work well on small applications• Temporary files on your local machine

that are 70gB cause problems

Page 22: Wrangle 2016: Malware Tracking at Scale

© 2016 Cloudera, Inc. All rights reserved.

Knowledge• I had never used NetworkX before ~2014• I had no idea how to go from what we had into a decent format for

visualizing this (GraphML).• Almost no experience in graph theory before ~2014• Gilad Lotan had a great PyCon talk which got me started. I still reference

his talks.• Gephi is a great shortcut for visualizing in 2D if you aren’t familiar with

D3• Seth Hardy who gave tons of amazing feedback while I was learning• Jack Urban who proved that it was possible to track applications as a

network• Gensim library is a great way to get started in doing comparisons of

applications• Lots of inspiration from the Defcon 22 OpenDNS talk (theirs is better)

Page 23: Wrangle 2016: Malware Tracking at Scale

Thank you.