presented by: satyajeet dept of computer & information sciences university of delaware
DESCRIPTION
Presented by: Satyajeet Dept of Computer & Information Sciences University of Delaware. Automatic Analysis of Malware Behavior using Machine Learning Author’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and Thosten Holz. Abstract & Introduction. Malware - - PowerPoint PPT PresentationTRANSCRIPT
CISC 879 - Machine Learning for Solving Systems Problems
Presented by: SatyajeetDept of Computer & Information Sciences
University of Delaware
Automatic Analysis of Malware Behavior using Machine LearningAuthor’s: Konrad Rieck, Philipp Trinius, Carsten Willems, and
Thosten Holz
CISC 879 - Machine Learning for Solving Systems Problems
Abstract & Introduction
• Malware - • Poses major threat to security of computer systems. • Very diverse – viruses, internet worms, trojan horses,• Amount of malware – millions of hosts infected
• Obfuscation and polymorphism impede detection at file level
• Dynamic analysis helps characterizing and defending.
CISC 879 - Machine Learning for Solving Systems Problems
Abstract & Introduction Contd..
• Framework for automatic analysis of malware behavior using Machine learning• Framework allows automatic analysis of novel
classes of malware with similar behavior – Clustering.
• Assigning unknown classes of malware to these discovered classes – Classification.
• An incremental approach based on both for behavior based analysis.
CISC 879 - Machine Learning for Solving Systems Problems
Automatic analysis of Malware Behavior
• Framework steps and procedure• Executing and monitoring malware binaries in
sandbox environment. Report generated on system calls and their arguments.
• Sequential reports are embedded in a vector space where each dimension is associated with a behavioral pattern.
• ML techniques then applied to the embedded reports to identify and classify malware.
• Incremental analysis progress by alternating between clustering and classification.
CISC 879 - Machine Learning for Solving Systems Problems
Report representation• Can be textual or XML
• Human readable and suitable for computation of general statistics
• But not efficient for automatic analysis
• Hence MIST (Malware Instr. Set)• Inspired from instr. set used in process design.
CISC 879 - Machine Learning for Solving Systems Problems
MIST
• Category of system calls• Operation - Reflects a particular system call• Arguments as argblocks.
CISC 879 - Machine Learning for Solving Systems Problems
Sandbox and MIST representation
CISC 879 - Machine Learning for Solving Systems Problems
Representation• These sequential reports identify typical behavior of
malware – Changing registry keys, modifying system files.
• But still not suitable for efficient analysis techniques. Hence the need to embed behavior reports in vector space – Using instruction q-grams.
• This embedding enables expressing the similarity of behavior geometrically – Calculating distance.
CISC 879 - Machine Learning for Solving Systems Problems
Clustering and Classification• Reports are embedded in vector space – Process
ready for applying ML techniques• Clustering of behavior – where classes of similar
behavior malware are identified.• Classification of behavior – which allows to assign
malware to known classes of behavior.• What allows us to do this? • Malware binaries are a family of similar variants
with similar behavior patterns !
CISC 879 - Machine Learning for Solving Systems Problems
Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Algorithms
• Prototype extraction• Iterative algorithm• Extracts small set of prototypes from set of reports. First
one chosen at random.
• Clustering using Prototypes• Prototypes at beginning are individual clusters• Algorithm determines and merges nearest pairs of
clusters
• Classification using Prototypes• Allows to learn to discriminate between classes of
malware.
CISC 879 - Machine Learning for Solving Systems Problems
Algorithms Contd..• For each report algorithm determines the nearest
prototype of clusters in training data, if within radius then assigns to cluster
• Else rejects and holds back for later incremental analysis.
• Incremental analysis• Reports to be analyzed are received from source.• Initially classified using prototypes of known clusters• Thereby variants of known malware are identified for
further analysis.• Prototypes extracted from remaining reports and
clustered again.
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results
CISC 879 - Machine Learning for Solving Systems Problems
Evaluating components• Prototype extraction
• Evaluated using Precision, Recall and Compression. • Precision – 0.99 when corpus compressed by 2.9 % & 7%
• Clustering• Evaluated using F-measure• F-measure for experiments – MIST 1 = 0.93 and MIST 2 =
0.95 better than previous related work 0.881
• Classification• F-measure for experiments – MIST 1= 0.96 and MIST 2 =
0.99
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Experiments and Results Contd..
CISC 879 - Machine Learning for Solving Systems Problems
Conclusion
• A new framework introduced which overcomes several previous deficiencies.
• The framework is learning based• Framework can be implemented in practice
• Steps – Collection of malware, a study in sandbox environment, embed observed behavior in vector space, apply learning algorithms – clustering and classification.
• This process is efficient and learns automatically after initial setup and run.
CISC 879 - Machine Learning for Solving Systems Problems
Thank you !