case study: using mongo and mapreduce to analyze a difficult research problem

1. Malware, Mongo and Map ReduceBrandon Dixon 9b+

2. Who I Am Security Researcher GWU CERT9b+ Past Security Consultant @ G2, Inc. SMT/Network Engineer @ Windermere Focus PDF Malware Analysis Messaging Technologies 3. Agenda PDF Woes Overview of PDF X-RAY Why Mongo? Challenges Old Questions with New Answers Conclusions 4. The PDF Problem Extremely diverse and flexible Heavily used by attackers Difficult to parse and identify malicious content Widely distributed http://www.zdnet.com/blog/security/study-6-out-of-every-10-users-run-vulnerable-adobe-reader/9014 5. Oh, but theres more!PDF A (200KB) PDF B (3MB) 11 Objects 300 Objects 291 Names/Dicts 158 Names/Dicts Metadata Partial metadata Filters for compression No filters applied Embedded documents References to outside sites Multiple updates Single update 6. PDF X-RAYProcessStats1. Upload PDF 10 collections2. Convert to JSON ~80K documents3. Store in Mongo ~3GB of data4. Create report PyMongo5. User is informed Mongo Django 7. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 8. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 9. = { one documentto rule them all. } 10. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 11. No Thanks 12. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 13. Why Mongo? JSON in, JSON out Documents treated independently No fixed layout or schema MapReduce power Heard it was web-scale 14. http://www.youtube.com/watch?v=b2F-DItXtZs 15. MapReduce Fun How many unique named functions occur across allmalicious documents compared to good ones? Despite not being required, is metadata ever useful as anindicator when classifying a PDF? What can we glean about Anti-Virus software based onstored reports (assuming they are available)? 16. Question #1: ResultsGoodBad 17. MapReduce Fun How many unique named functions occur across allmalicious documents compared to good ones? Despite not being required, is metadata ever useful as anindicator when classifying a PDF? What can we glean about Anti-Virus software based onstored reports (assuming they are available)? 18. Question #2: ResultsGoodBad 19. MapReduce Fun How many unique named functions occur across allmalicious documents compared to good ones? Despite not being required, is metadata ever useful as anindicator when classifying a PDF? What can we glean about Anti-Virus software based onstored reports (assuming they are available)? 20. Question #3: Results 21. Gripes Size limits MapReduce troubleshooting Restoring collections into each other (no upsert) Getting the entire object back on specific queries Lack of triggers 22. Kudos 10gen Size limits keep getting bumped up Output shaping is in testing Simple aggregation through queries is in testing Google group responses are quick 23. Questions and ContactBrandon Dixonbrandon@9bplus.comwww.9bplus.comblog.9bplus.comwww.pdfxray.com@9bplus

case study: using mongo and mapreduce to analyze a difficult research problem

Technology

schema mapreduce power

allmalicious documents

mapreduce fun

pdf problem

upload pdf

fixed layout

pdf xrayprocessstats1

embedded documents references