presented by: rag mayur chevuri dept of computer & information sciences university of delaware

30
CISC 879 - Machine Learning for Solving Systems Problems Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware BBehavioural Classification Tony Lee and Jigar J Mody

Upload: nicole

Post on 09-Jan-2016

60 views

Category:

Documents


6 download

DESCRIPTION

Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware. B Behavioural Classification Tony Lee and Jigar J Mody. Automatic malware classification. Human analysis inefficient and inadequate. Large number of new virus/spyware families - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Presented by: Rag Mayur ChevuriDept of Computer & Information Sciences

University of Delaware

BBehavioural Classification

Tony Lee and Jigar J Mody

Page 2: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Automatic malware classification • Human analysis inefficient and inadequate.

• Large number of new virus/spyware families

• Our focus : Classification problem

• Effective classification

Better Detection

Better Cleaning

Better Analysis solutions

Page 3: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Classification Process

Page 4: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Objectives of classification methodologies

• Efficiently and automatically.

• Minimal information loss.

• Structured to be stored, analyzed and referenced efficiently.

Page 5: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Objectives of classification methodologies (contd..)

• Applies learned knowledge to identifyfamiliar pattern and similarity relations in a given target automatically

• Adaptable and has innate learning abilities.

Page 6: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Approach

• Automated classification method based on:-runtime behavioral data -machine learning.

• Represent a file by its runtime behavior• Structure the event information • Store them in database. • Construct classifiers • Apply classifiers for the new objects

Page 7: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

A “good” knowledge representation

• Effectively capture knowledge of the object to represent

• The representation can persist in permanent storage.

• Enable classifiers to efficiently and effectively correlate data across large number of objects.

Page 8: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Representing behavior:• The meaning of a particular action -

resulted state• Construct the representation in a

consistent canonical format.

Vector Approach• Process data in vector format using

statically and probabilistic algorithms • Problem: vector size, scalability, and

factorability.

Page 9: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

The Opaque Object Approach

• Objects represent data in rich syntax

• Rich semantic representation of theactual object

• Precise distance between objects used for Clustering

Page 10: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Events Representation

• Sequence of events • Ordered according to

• time of the occurrence of program actions

• environment state transitions.

00:00 00:04

Registry Query File Write

Open Process

Network Listen

Registry Write

Allocate VM

Write VM

Terminate Process

Open Mutant Create Mutant

Page 11: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Event Properties

• Event ID • Event object (e.g registry, file,

process, socket, etc.) • Event subject if applicable (i.e. the

process that takes the action) • Kernel function called if applicable • Action parameters (e.g. registry value,

file path, IP address) • Status of the action (e.g. file handle

created, registry removed, etc.)

Page 12: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

An example (Register Event)

Page 13: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Generate Classifier for Classification

Page 14: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Which classifier?

• Case-based Classifier by treating existing malware collection as a database of solutions.

• Learn by CBR

• Nearest Neighbor algorithms.

• To make the CBR approach scalable, Apply “Clustering”.

Page 15: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Clustering

• Unsupervised learning • Organize objects into clusters• A cluster is a collection of objects

which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Page 16: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Distance Measure

• Levenshtein Distance – “minimum cost required to transform one sequence of objects to another sequence by applying a set of operations. ”

• Operation = Op (Event) • Cost (Transformation) = Σi Cost

(Operationi) • Cost of operation depends on operator

as well as the operand

Page 17: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Operation Cost Matrix for Similarity Measure

Page 18: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

k-medoid partitioning clustering algorithm

Place K points into the space.These points represent initial group Medoids.

Assign each object to the group that has the closest Medoid

Recalculate the positions of the K Medoids.

Repeat 2 and 3 until the Medoids no longer move.

Page 19: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Classifying a new object

Nearest Neighbor Classification

Compare the new object to all the medoids .

Assign the new object the family name of the closest medoid.

Page 20: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Experiment

• an automated distributed replication system

Page 21: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Data Analysis

• Test data :Experiment 1: 461 samples of 3 families Experiment 2: 760 samples of 11

families. • 10 fold cross validation • We vary and contrast experiments by

adjusting two parameters: • number of clusters (K),maximum

number of events(E)• Measure Error rate &Accuracy Gain

Page 22: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

• Error rate is defined as ER = number of incorrectly classified samples / total number of samples.

• Accuracy , AC = 1 – ER

• Accuracy Gain of x over y : G(x,y) = | (ER(y) – ER(x))/ER(x) |

Page 23: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Experiment A

Page 24: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Page 25: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Page 26: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Page 27: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Observations

• Accuracy vs. #Clusters Error rate reduces as number of clusters

increase. • Accuracy vs. Maximum #Events Error rate reduces as the event cap

increases->more events we observe-> more accurately capture-> more likely the clustering discovers the semantic similarity among variants of a family.

Page 28: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

• Accuracy Gain vs. Number of Events The gain in accuracy is more substantial

at lower event caps (100 vs. 500) than at higher event caps (500 vs. 1000)

• Accuracy vs. Number of Families The 11-family experiment outperforms in

accuracy the 3-family experiment in high event cap tests (1000), but the result is opposite in lower event cap tests (100).

Page 29: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

Conclusion

• Run time behavior +Machine learning allow us focus on pattern/similarity recognitions in behavior semantics

• Lack of code structural information• Combine static analysis to improve

classification accuracy• “Developing automated classification

process that applies classifiers with innate learning ability on near lossless knowledge representation is the key to the future of malware classification and defense. “

Page 30: Presented by: Rag Mayur Chevuri Dept of Computer & Information Sciences University of Delaware

CISC 879 - Machine Learning for Solving Systems Problems

References

• Jeff Kephart, Dave Chess and Steve White (1997). Blueprint for a Computer Immune System.

• Ford R.A., Thompson H.H. (2004). The future of Proactive Virus Detection.

• Wagner M. (2004). Behavior Oriented Detection of Malicious Code at Run-time. M.Sc. Thesis, Florida Institute of Technology

• Richard Ford, Jason Michalske (2004). Gatekeeper II: New approaches to Generic Virus Prevention.