case study: using mongo and mapreduce to analyze a difficult research problem

Download Case Study: Using Mongo and MapReduce to Analyze a Difficult Research Problem

If you can't read please download the document

Upload: dataversity

Post on 25-May-2015

810 views

Category:

Technology


0 download

DESCRIPTION

This presentation demonstrates how NoSQL technologies were used to solve a difficult analytical problem that traditional SQL databases could not. PDF malware has been on the rise for the past few years and has become one of the most successful methods for attackers to gain unauthorized access into a network. The standard way to do malware analysis in PDF documents has been very independent in nature. Commercial entities do not share their data so researchers must fend for themselves and more often than not, researchers analyze a PDF file independent of other malicious PDF files. I found this static approach to be highly inefficient, but storing multiple PDF documents in a database was a problem in itself. Traditional SQL databases didn’t seem like the right fit given their forced constraints and true relational models. PDF files also contain a lot of dynamic data that make them a tough fit in a traditional SQL model. PDF A could contain 40 objects where as PDF B could contain 3,000 objects. Scaling this out becomes quite difficult and messy. When looking at this problem, I ideally wanted to solve a number of issues at once. I needed a good way to share a PDF samples, an easy way to query on a corpus of documents and the ability to efficiently get my data back out so I could display it elsewhere. With all my samples in a JSON format, MongoDB just made sense as it could take in these objects and allow me to query on them as a whole or independently. MongoDB also provided me with a rich tool-set to further answer questions that had never been posed before. By using single and multi-step map/reduce jobs, I was able to aggregate PDF characteristics and apply simple averaging to identify shared commonalities between malicious documents. Though I have had great outcomes and successes with MongoDB, there have also been annoyances and the unexplained details. These issues pinned me against a wall for days and sometimes had me wondering if I had picked the wrong model for tackling this problem. At the end of the day though I was able to overcome these issues and account for them without hassle. This talk will cover new research methods and tools created using NoSQL technologies to analyze PDF documents in a more efficient manner that promotes collaboration among the community. It will also serve as a step forward in detecting malicious PDFs by looking at them from a statistical standpoint. When I look back on the choice to use MongoDB, I think I made a great decision. I can’t begin to think how I would have handled processing thousands of unique named function calls with multiple attributes for each PDF or running multi-step map/reduce jobs against a Mongo document instead of a blob in a SQL database. MongoDB provided me with a rich functionality that easily led me to success on my project.

TRANSCRIPT

  • 1. Malware, Mongo and Map ReduceBrandon Dixon 9b+

2. Who I Am Security Researcher GWU CERT9b+ Past Security Consultant @ G2, Inc. SMT/Network Engineer @ Windermere Focus PDF Malware Analysis Messaging Technologies 3. Agenda PDF Woes Overview of PDF X-RAY Why Mongo? Challenges Old Questions with New Answers Conclusions 4. The PDF Problem Extremely diverse and flexible Heavily used by attackers Difficult to parse and identify malicious content Widely distributed http://www.zdnet.com/blog/security/study-6-out-of-every-10-users-run-vulnerable-adobe-reader/9014 5. Oh, but theres more!PDF A (200KB) PDF B (3MB) 11 Objects 300 Objects 291 Names/Dicts 158 Names/Dicts Metadata Partial metadata Filters for compression No filters applied Embedded documents References to outside sites Multiple updates Single update 6. PDF X-RAYProcessStats1. Upload PDF 10 collections2. Convert to JSON ~80K documents3. Store in Mongo ~3GB of data4. Create report PyMongo5. User is informed Mongo Django 7. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 8. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 9. = { one documentto rule them all. } 10. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 11. No Thanks 12. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 13. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 14. http://www.youtube.com/watch?v=b2F-DItXtZs 15. MapReduce Fun How many unique named functions occur across allmalicious documents compared to good ones? Despite not being required, is metadata ever useful as anindicator when classifying a PDF? What can we glean about Anti-Virus software based onstored reports (assuming they are available)? 16. Question #1: ResultsGoodBad 17. MapReduce Fun How many unique named functions occur across allmalicious documents compared to good ones? Despite not being required, is metadata ever useful as anindicator when classifying a PDF? What can we glean about Anti-Virus software based onstored reports (assuming they are available)? 18. Question #2: ResultsGoodBad 19. MapReduce Fun How many unique named functions occur across allmalicious documents compared to good ones? Despite not being required, is metadata ever useful as anindicator when classifying a PDF? What can we glean about Anti-Virus software based onstored reports (assuming they are available)? 20. Question #3: Results 21. Gripes Size limits MapReduce troubleshooting Restoring collections into each other (no upsert) Getting the entire object back on specific queries Lack of triggers 22. Kudos 10gen Size limits keep getting bumped up Output shaping is in testing Simple aggregation through queries is in testing Google group responses are quick 23. Questions and ContactBrandon Dixonbrandon@9bplus.comwww.9bplus.comblog.9bplus.comwww.pdfxray.com@9bplus