machine learning-based malicious adversaries detection in an enterprise environment by using open...

IntroThe issues in general

MotivationSolution

ExperimentsToolseof()

Machine Learning-based Malicious AdversariesDetection in an Enterprise Environment by Using Open

Source Tools

Muhammad Najmi Ahmad ZabidiInternational Islamic University Malaysia

MOSC 2012Berjaya Times Square, Kuala Lumpur

9th July 2012

Muhammad Najmi Ahmad Zabidi MOSC 2012 1/34


MotivationSolution


About

• I am a research grad student at Universiti TeknologiMalaysia, Skudai, Johor Bahru, Malaysia

• My current employer is International Islamic UniversityMalaysia, Kuala Lumpur

• Research area - malware detection, narrowing onWindows executables

• For past few years (since 2003), I am a Subversion(SVN)committer for KDE localization project to Malay language(but now rarely commit.. need a new intern to replace :) )



MotivationSolution


Computing world as we knew it

• Interconnected machine

• Previously less connected, now ‘‘socialized’’ machines

• Brought real problems to the cyberworld



MotivationSolution


Risks

• Financial lost

• Company/government level espionage

• Privacy breach



MotivationSolution


Types of adversaries

• Spam

• Scam

• Phishing

• Malware, botnet, rookit etc

• Anything else?



MotivationSolution


Spam

• Annoying

• Productivity wasted in unneccesary file deletion

• Difficult to find important email - extreme case



MotivationSolution


Scam

• Preying on naive victims

• Sounds to good to be true, but still some people believed

• Organized crime/syndicate... with mules cooperating



MotivationSolution


Phishing

• Almost similar with scam, but different tactic

• More sophisticated, but does not need mule/physicalmeetup

• Main purpose to gain important details - online bankinglogin name, password hence access to the victim’saccount

• More secure to the criminal



MotivationSolution


Malware

• Safely to say,coverstrojan,virus,dialers,rabbits,worms,rootkit(bundlednowadays)

• Already infecting computers since 1980s, threat is moreobvious when the Internet is coming in

• Attacking any operating system, Linux, Windows, Mac...even Android phones



MotivationSolution


Problems with adversaries detection

• Some manually crafted, some automated

• React relatively fast, difficult to trace

• Too many (for example, spam) hence too time consumingfor manual work



MotivationSolution


In house analysis

• Given enough expertise, in house analysis could be useful

• Maintaining reputation, having own group of analysts tohandle incidents

• Try minimize costs, use open source tools wheneverpossible



MotivationSolution


Categories

Machine Learning

• Associated with the Artificial Intelligence

• Mimicking human (brain) learning

• Learns through experience

• Deals with known and unknown patterns

• Overlapping (or somehow originated) with Data Mining,Pattern Recognition



MotivationSolution


Categories

Table 1: Differences between clustering and classification

Classification Clustering

Deals with known data Deals with unknown data

Supervised learning Unsupervised learning

Popular algorithms includes:

• Random Forest

• Neural Networks

• k-Nearest Neighbor

• Decision Trees


• K-means

• Fuzzy C

• Gaussian

Predictive [Tan et al., 2005] Descriptive [Tan et al., 2005]



MotivationSolution


Categories


Classification

Clustering




• Random Forest

• Neural Networks


• Decision Trees


• K-means

• Fuzzy C

• Gaussian




MotivationSolution


Categories


Classification

Clustering

Deals with known data

Deals with unknown data



• Random Forest

• Neural Networks


• Decision Trees


• K-means

• Fuzzy C

• Gaussian




MotivationSolution


Categories


Classification

Clustering



Supervised learning

Unsupervised learning


• Random Forest

• Neural Networks


• Decision Trees


• K-means

• Fuzzy C

• Gaussian




MotivationSolution


Categories


Classification

Clustering



Supervised learning



• Random Forest

• Neural Networks


• Decision Trees


• K-means

• Fuzzy C

• Gaussian

Predictive [Tan et al., 2005]

Descriptive [Tan et al., 2005]



MotivationSolution


Categories





Supervised learning



• Random Forest

• Neural Networks


• Decision Trees


• K-means

• Fuzzy C

• Gaussian





MotivationSolution


Categories




Supervised learning



• Random Forest

• Neural Networks


• Decision Trees


• K-means

• Fuzzy C

• Gaussian





MotivationSolution


Categories






• Random Forest

• Neural Networks


• Decision Trees


• K-means

• Fuzzy C

• Gaussian





MotivationSolution


Categories

What to look?

• We look for patterns

• In some case, have the spam,phishing mails corpus ready

• We call these patterns as ‘‘features’’



MotivationSolution


Categories

Spam/scam

• The language that being used

• Perhaps words like ‘‘You have won GBP100,000,000’’notification through emails

• Spam bombarded emails, some might be true businesses,but irresistable to handle.

• Scam, asking people to bank in money for untruthfulreasons



MotivationSolution


Categories

Phishing mails

• Look for URL

• Current effort for example by PhishTank is done by usingpublic submission and (I believe) manual verification



MotivationSolution


Categories

Malware

• Researchers tend to look on the ApplicationProgramming Interface (API) calls, some on the opcodes

• Analysis done either by using static or dynamic analysis



MotivationSolution


Categories

Some example

Figure 1: Automated classification proposed by [Rieck et al., 2009]



MotivationSolution


The datasets

• Spam email research is already quite sometimescompared to the other (phishing)

• Sample dataset:• http://csmining.org/index.php/spam-email-datasets-.html• http://archive.ics.uci.edu/ml/datasets/Spambase

• Scam email somehow very much associated with spam,since it is unwanted email. Might as well beingcategorized as ‘‘sub-spam’’

• Phishing emails samples:• Sample dataset:

• http://phishtank.com


http://csmining.org/index.php/spam-email-datasets-.html

http://archive.ics.uci.edu/ml/datasets/Spambase

http://phishtank.com


MotivationSolution


Feature Selection/Extraction

• When analyzing, we’re interested with features• What kind of feature?

• Important keywords, strong features• Non important features will be phased out.. unneccesary• Some features might be redundant



MotivationSolution


• There are algorithms which meant for this:• Information Gain• Support Vector Machine (SVM)• other... some maybe hybrid algoritms(combining several

algorithms altogether) - also known as ensemble



MotivationSolution


WekaR languageOctavePython Scipy

List of tools

• Weka

• R language

• Octave (as replacement for Matlab)

• Python Sci-py with Matplotlib



MotivationSolution



Figure 2: Weka



MotivationSolution



Weka

• Obtained data are in numbers and visualizations

• Need to do some reading on how to interpret them

• Test with different algorithms to get the best results



MotivationSolution



R language

• No merely a tool, but a language by itself

• Usually being used by data analysts



MotivationSolution



Figure 3: These books use R language for their analysis purposes



MotivationSolution



Octave

• Octave is an open source alternative for Matlab (MATrixLABoratory)

• Works almost similar like Matlab does



MotivationSolution



Figure 4: Octave also has GUI, QtOctave - discontinued



MotivationSolution



Python Scipy

#!/usr/bin/env python"""Example: simple line plot.Show how to make and save a simple lineplot with labels, title and grid"""import numpyimport pylab

t = numpy.arange(0.0, 1.0+0.01, 0.01)s = numpy.cos(2*2*numpy.pi*t)pylab.plot(t, s)

pylab.xlabel(’time (s)’)pylab.ylabel(’voltage (mV)’)pylab.title(’About as simple as it gets,folks’)pylab.grid(True)pylab.savefig(’simple_plot’)

pylab.show()



MotivationSolution





MotivationSolution


FlowchartConclusion

The flow

Feature Selection Feature Categorization

Clustering Classification

Visualization

Weka,Octave,R

scipy, octave,R

Weka,Octave,R

scipy, octave,R



MotivationSolution


FlowchartConclusion

Conclusion

• Malicious/unwanted threats from spam, scam, phishingand malware is not easy

• Perhaps one sample could be done by hands, but havingthousands per day is tedious

• Machine learning assist in automation

• Open source provides alternative (free as in minimal cost)for the analysis

• In house analysis provides security in anorganization/enterprise reputation



MotivationSolution


FlowchartConclusion

Get in touch!

najmi.zabidi @ gmail.comhttp://mypacketstream.blogspot.com

This slides was created with LATEX Beamer


http://mypacketstream.blogspot.com


MotivationSolution


FlowchartConclusion

Bibliography

Rieck, K., Trinius, P., Willems, C., and Holz, T. (2009).

Automatic analysis of malware behavior using machine learning.TU, Professoren der Fak. IV.

Tan, P.-N., Steinbach, M., and Kumar, V. (2005).

Introduction to Data Mining, (First Edition).Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.


machine learning-based malicious adversaries detection in an enterprise environment by using open...

Documents

universiti teknologi

infecting computers

interconnected machine

windows executables

research grad student

current employer

berjaya times square

johor bahru