effects of feature transformation and … · prestasi klasifikasi itu adalah yang terbaik apabila...

EFFECTS OF FEATURE TRANSFORMATION AND SELECTION ON CLASSIFICATION OF

NETWORK TRAFFIC ACTIVITIES

LIM WEN YING

FACULTY OF COMPUTING AND INFORMATICS

UNIVERSITI MALAYSIA SABAH

2015

EFFECTS OF FEATURE TRANSFORMATION

AND SELECTION ON CLASSIFICATION OF

NETWORK TRAFFIC ACTIVITIES

LIM WEN YING

THESIS SUBMITTED IN PARTIAL FULFILMENT FOR THE BACHELOR OF COMPUTER SCIENCE

(NETWORK ENGINEERING)

FACULTY OF COMPUTING AND INFORMATICS

UNIVERSITI MALAYSIA SABAH

2015

ii

DECLARATION

I hereby declare that this thesis, submitted to Universiti Malaysia Sabah as

partial fulfilment of the requirements for the degree of Bachelor of Computer

Science (Network Engineering), has not been submitted to any other

university for any degree. I also certify that the work described herein is

entirely my own, except for quotations and summaries sources of which have

been duly acknowledged.

This thesis may be made available within the university library and may

be photocopied or loaned to other libraries for the purposes of consultation.

22 JUNE 2015 …………………………………

LIM WEN YING

BK 1111 0156

CERTIFIED BY

_________________________________

Dr. Mohd Hanafi Ahmad Hijazi

SUPERVISOR

iii

ACKNOWLEDGEMENT

First and foremost, I am grateful to God for the good health and well-being

that were necessary to complete this research paper. I must thank my parents and

family for their understanding and support. They have always given their kindness,

patience and tolerance when I had rough times.

I wish to express my utmost appreciation and deepest gratitude to my

supervisor, Dr Mohd Hanafi Ahmad Hijazi. He constantly provided me with

constructive comments for improvement to this project. Weeks after weeks and

consultations after consultations, he continuously enlightened me when I was in

doubt and when there were areas that I lacked knowledge. In addition, he always

gave words of wisdom that encouraged me to continuous work on this project.

Without his continuous dedication in guiding me, I would never have completed this

research paper.

Last but not least, Associate Professor Dr Rayner Alfred who provided me

advice on improving the quality of this research paper for which I am thankful to. I

also place on record, my gratitude to one and all, who directly or indirectly, have lent

their hand in this venture.

LIM WEN YING

22 JUNE 2015

iv

ABSTRACT

As new technologies are emerging day by day, network, regardless of the Internet or Intranet within a corporation often plays a crucial role in connecting people from all around the world. From military use to achieving business goals and household need, data security often get attention from computer scientists. Traditional security measures that include the installation of firewall and antivirus software are commonly utilised to prevent intrusion. However, such types of defence are merely sufficient to secure a network and data travelling across it. Thus, second lines of defence like Intrusion Detection System (IDS) and Intrusion Prevention System (IPS) are introduced to overcome the inadequacy of traditional security measures. Generally, IDS uses two approaches, the Anomaly Detection (A-IDS) and the Misuse Detection in order to identify patterns of intrusion. A-IDS often perform comparison of the model of normal and anomalous model. Depending on the ability to measure similarity or distance between a target and a known type, comparison is made to determine whether to establish a new target anomalous or not. This research aims to investigate the effects of feature transformation on the classification of network activities; the focus is to represent the data into point series form to permit the application of Time Series Classification (TSC). The TSC technique used is k-Nearest Neighbour (KNN) coupled with Dynamic Time Warping. Effects of using different similarity measures, Euclidean Distance (ED) and Cosine similarity algorithm are also investigated. Experiments conducted involve conversion of the categorical data by three different conversion techniques to generate point series data – simple, probability and entropy conversion. Comparison between different classifiers is also conducted. The performance of the classifier is best using 1NN with Euclidean distance and entropy conversion for categorical data, where the recorded accuracy is 99.19%.

v

ABSTRACK

Pembaharuan teknologi berlaku setiap hari, rangkaian, tidak kira daripada Internet mahupun Intranet yang terdapat dalam sebuah korperasi sering memainkan peranan penting dalam menghubungkan orang ramai dari seluruh dunia. Daripada penggunaan oleh pihak tentera atau dalam bidang perniagaan untuk mencapai matlamat harian dan keperluan isi rumah, keselamatan untuk data yang mengalir di seluruh rangkaian sering mendapat perhatian daripada ahli-ahli sains komputer. Langkah keselamatan tradisional termasuk pemasangan “firewall” dan perisian antivirus biasanya menggunakan untuk mencegah pencerobohan. Walau bagaimanapun, jenis pertahanan tersebut semata-mata adalah tidak cukup untuk memastikan keselamatan rangkaian dan data yang merentasinya. Oleh itu, pertahanan peringkat kedua seperti “Intrusion Detection System (IDS)” dan “Intrusion Prevention System (IPS)” diperkenalkan untuk mengatasi kekurangan langkah-langkah keselamatan tradisional. Secara umumnya, IDS menggunakan dua pendekatan, Pengesanan Anomali (A-IDS) dan Pengesanan Penyalahgunaan untuk mengenal pasti corak pencerobohan. Secara umumnya, A-IDS mengenal pasti pencerobohan dengan membuat perbandingan sasaran bersama modal biasa. Bergantung kepada keupayaan untuk mengukur persamaan atau jarak antara sasaran dan jenis yang dikenali, perbandingan dibuat untuk menentukan sama ada untuk memastikan sasaran baru anomali atau tidak. Kajian ini bertujuan untuk menyiasat kesan perubahan ciri klasifikasi aktiviti rangkaian; tumpuan adalah untuk mewakili data sebagai siri titik bagi membenarkan “Time Series Classification” (TSC) aplikasi. TSC teknik yang digunakan adalah “k-Nearest Neighbour” (KNN) berserta dengan “Dynamic Time Warping” (DTW). Kesan menggunakan pengukuran persamaan yang berbeza, “Euclidean Distance” (ED) dan “Cosine similarity” algoritma juga disiasat. Eksperimen yang dijalankan melibatkan penukaran data berkategori dengan menggunakan tiga teknik penukaran yang berbeza untuk menghasilkan data siri titik - mudah, kebarangkalian dan entropy. Perbandingan antara klasifikasi berbeza juga dijalankan. Prestasi klasifikasi itu adalah yang terbaik apabila menggunakan 1NN dengan pengukuran jarak Euclidean dan penukaran entropy untuk data berkategori, di mana ketepatan yang direkodkan adalah 99.16%.

vi

TABLE OF CONTENTS

DECLARATION ii

ACKNOWLEDGEMENT iii

ABSTRACT iv

ABSTRACK v

TABLE OF CONTENTS vi

LIST OF TABLE ix

LIST OF FIGURE xi

CHAPTER 1 1

INTRODUCTION 1

1.1 Chapter Overview 1

1.2 Problem Background 1

1.3 Problem Statement 4

1.4 Objective 4

1.5 Research Scope 5

1.5.1 Dataset 5

1.5.2 Time Series Classification (TSC) using K-Nearest Neighbour Algorithm with

Dynamic Time Warping (DTW) as similarity measure 7

1.6 Research Methodology 8

1.7 Organisation of Report 9

CHAPTER 2 11

LITERATURE REVIEW 11


2.2 Intrusion Detection System (IDS) 11

2.2.1 Introduction of IDS 11

vii

2.2.2 Anomaly-based Intrusion Detection System (IDS) 13

2.2.3 Challenges of Current IDS 14

2.3 Data Pre-processing 14

2.3.1 Conversion of symbolic features 14

2.3.2 Feature Selection 16

2.4 Time Series Analysis (TSA) 20

2.4.1 Time Series Classification (TSC) 20

2.4.2 Distance Similarity Measure 21

2.5 Classification Techniques 26

2.5.1 Classification of Data 26

2.5.2 k-Nearest Neighbour (k-NN) 27

2.5.3 Review of Network Traffic Classification 28

2.6 Summary 29

CHAPTER 3 31

METHODOLOGY 31


3.2 The Research Program of Work 31

3.3 Experimental Setting 37

3.4 Experiment Requirement 37

3.4.1 Hardware Requirement 37

3.4.2 Software Requirement 37

3.5 Performance Measure for Classification 38

3.6 Summary 39

CHAPTER 4 40

IMPLEMENTATION OF THE PROPOSED APPROACH 40


viii

4.2 Data Pre-processing 40

4.2.1 Conversion of data 40

4.2.2 Data Normalisation 52

4.2.3 Feature Selection 54

4.3 Experimental Setting 61

4.3.1 Experiment I: No Categorical Data 62

4.3.2 Experiment II: Simple Conversion 62

4.3.3 Experiment III: Probability and Entropy Conversion 62

4.3.4 Experiment IV: Feature Selection using Information Gain and Correlation

Feature Selection 63

4.4 Summary 63

CHAPTER 5 64

RESULT AND ANALYSIS 64


5.2 Experiment I: No Categorical Data 64

5.3 Experiment II: Simple Conversion 67

5.4 Experiment III: Probability and Entropy Conversion 69

5.5 Experiment IV: Feature Selection using Information Gain and Correlation

Feature Selection 70

5.6 Comparison of Performance of Network Traffic Classifier with other

Machine Learning Approach 73

5.7 Chapter Summary 74

Chapter 6 75

CONCLUSION 75


6.2 Summary of Research Paper 75

6.3 Future Works 77

ix

LIST OF TABLES

Table 1.1 Name of Features for NSL-KDD Data Set 7

Table 2.1 Summary of Reviewed Papers and Data Pre-processing Method on KDD

Cup 99 18

Table 2.2 Instances with Known Label 26

Table 2.3 Results for Various Algorithms 29

Table 2.4 Result for Application of SOM-ANN Algorithms 29

Table 3.1 Possible Outcomes 38

Table 4.1 Features Name and Type 41

Table 4.2 Alphabetically Simple Conversion of "protocol_type" 42

Table 4.3 Alphabetically Simple Conversion of "service" 43

Table 4.4 Alphabetically Simple Conversion of "flag" 43

Table 4.5 Statistic and Value for “protocol_type” After Conversion 45

Table 4.6 Statistic and Value for “service” After Conversion 46

Table 4.7 Statistic and Value for “flag” After Conversion 48

Table 4.8 Entropy of “protocol_type” Data and Corresponding Converted Value 50

Table 4.9 Entropy of “service” Data and Corresponding Converted Value 50

Table 4.10 Entropy of “flag” Data and Corresponding Converted Value 52

Table 4.11 Features with Minimum and Maximum Value for Simple Conversion 53

Table 4.12 Output of Information Gain Feature Selection 55

Table 4.13 Features Removed Corresponce with Features Percentage 57

Table 4.14 Selected Features and Their Respective Columns for Each Data Conversion

Techniques 61

Table 5.1 Result of K-NN with ED on Dataset with No Categorical Features 65

Table 5.2 Result of K-NN with Cosine on Dataset with No Categorical Features 66

Table 5.3 Result of K-NN with DTW on Dataset with with No Categorical Features 66

Table 5.4 Result of K-NN with ED on Dataset with Simple Conversion on Categorical

Features 67

Table 5.5 Result of K-NN with Cosine on Dataset with Simple Conversion on

Categorical Features 68

Table 5.6 Result of K-NN with DTW on Dataset with Simple Conversion on Categorical

Features 68

x

Table 5.7 Result of KNN-ED on Dataset with Probability Conversion on Categorical

Features 69

Table 5.8 Result of KNN-ED on Dataset with Entropy Conversion on Categorical

Features 69

Table 5.9 Result of KNN-Cosine on Dataset with Probability Conversion on Categorical

Features 69

Table 5.10 Result of KNN-Cosine on Dataset with Entropy Conversion on Categorical

Features 70

Table 5.11 Result of KNN-ED on Dataset with Reduced Features using Information

Gain =70% Feature Selection and Entropy Conversion on Categorical

Features 71



Features 71



Features 71



Features 72



Features 72

Table 5.16 Result of KNN-ED on Dataset with Reduced Features using Correlation

Feature Selection and Entropy Conversion on Categorical Features 72

Table 5.17 Results for Various Algorithms 73

Table 5.18 Results for Application of SOM-ANN Algorithms 73

Table 5.19 Comparison of the Performance of Proposed Method and Other Machine

Learning 74

Table 6.1 Work Done to Achieve the Objectives 77

xi

LIST OF FIGURES

Figure 1.1 Snapshot of NSL-KDD Original Dataset ............................................... 6

Figure 2.1 Stages in anomaly-based Intrusion Detection System ........................ 13

Figure 2.2 Matrix Representation of Two Sequence A and B ............................... 23

Figure 2.3 Algorithm to Perform DTW .............................................................. 24

Figure 2.4 The k-nearest neighbour classification algorithm ............................... 28

Figure 3.1 Overall Framework used in this research ........................................... 31

Figure 3.2 Phase I of the research ................................................................... 32

Figure 3.3 Sub-phase I of the research ............................................................. 33

Figure 3.4 Sub-phase II of the research ........................................................... 34

Figure 3.5 Sub-phase III of the research .......................................................... 35

Figure 3.6 Phase II of the research .................................................................. 36

Figure 4.1 Point Series Data with Simple Conversion ......................................... 44

Figure 4.2 Point Series Data with Probability Conversion .................................... 49

Figure 4.3 Snapshot on WEKA - Information Gain Feature Selection ................... 55

Figure 4.4 Script Written to Remove Features’ Column from Dataset ................... 58

Figure 4.5 Snapshot on WEKA - Correlation Feature Selection ............................ 59

Figure 4.6 Features Selected using Correlation Feature Selection ........................ 60

CHAPTER 1

INTRODUCTION

1.1 Chapter Overview

This chapter serves to present a brief background and introduction so as to aid

readers into better understanding of this research paper. Section 1.2 presents the

problem background. Section 1.3 and 1.4 describe research statement and

objectives. Section 1.5 presents the research scope. The methodology used in this

research paper is briefly lined out in Section 1.6 whereas the organization of this

report is described in Section 1.7.

1.2 Problem Background

In this 21st century that is dominated by social networking, the Internet has surged

to reveal itself as one of the most promising technologies that affect human in

numerous ways; it has become increasingly critical to human. Private and confidential

data that are propagated through the network are exposed and made vulnerable to

attacks. Recent attacks such as the Cyber-attack on U.S. Public utility and its control

system network and also the leakage of celebrity private photos (“Apple confirms

accounts compromised but denies security breach,” 2014) that are believed to be

obtained from Apple iCloud backup services again give prominence to the importance

of network security.

Traditional network traffic monitor detects regular network performance,

recognizing application’s identity by assuming that most applications constantly use

‘well known’ or common TCP/UDP port numbers (visible in the TCP or UDP headers).

While this convention has been active in the early days of the Internet, this however,

are merely sufficient in our modern days. Port-based estimates are currently

significantly not reliable; as unpredictable (or at least obscure) port numbers are

2

increasingly being used for various applications, and also with the continuous

emergence of new protocols, it has become increasingly difficult to get the details of

the network traffic component. For this reason, researchers propose a new method

to identify current sophisticated traffic data generated from various newly emerging

network-based applications.

An Intrusion Detection System (IDS) is a network-monitoring system that is

passive in nature. It is configured mainly to monitor, identify and initiate alerts for

attacks or compromise on the network. Unlike Intrusion Prevention System (IPS), it

does not do any direct action or measure to the potential breach. Signature-based

and Anomaly detection are two general approaches to computer IDS. An IDS that is

signature-based (also known as knowledge-based) uses pre-defined set of rule to

identify intrusion. By comparing the current traffic pattern of known and documented

attacks, signature-based IDS determines attack when there is a match to the

signature in the attack database. Signature-based is the most widely use type of IDS

currently (Chowdharyet al., 2014). However systems employing Signature-based

detection method has a limitation of being unable in detecting intrusion when the

signature of an attack is not recorded in the database. Furthermore, these systems

are incapable of integrating information that comes from heterogeneous sources

where the latter can provide informative details on the on-going network activities of

the system (More et al., 2012). In anomaly detection, the IDS capture the network

traffic activity and based on that create a profile representing its stochastic behaviour.

During the anomaly detection process, two data sets of network activity are involved,

with one as the real-time profile recorded over time and another would be the

previously trained profile. IDS function by attempt estimating the behaviour of the

network traffic activity, normal or abnormal and trigger anomaly alarms whenever a

predefined threshold (pre-defined abnormalities) is exceeded (García-Teodoro et al.,

2009). In general, two phases - the learning phase and the detection phase - made

up the algorithm performed within an Anomaly-based Intrusion Detection system (A-

IDS). The detector learns the normal behaviour of a network system by recording

the data representing normal or “non-malicious” system activity in the training phase.

Meanwhile, in the detection phase, the detector compares the input data to its learnt

model of nominal behaviour to report any deviations as anomalies or attacks. García-

3

Teodoro et al., 2009 in their research paper highlighted some of the most significant

challenges and issues in Anomaly-based Intrusion Detection:

(i) Low detection efficiency

This aspect is generally explained as arising from the lack of good studies on the

nature of the intrusion events. The problem calls for the exploration and

development of new, accurate processing schemes, as well as better structured

approaches to modelling network systems.

(ii) Low throughput and high cost

Mainly due to higher data rates (Gbps) that characterize current wideband

transmission technologies. Some proposals intended to optimize intrusion

detection are concerned with grid techniques and distributed detection

paradigms.

As mentioned earlier, A-IDS often performs a comparison of the model of

normal and anomalous model. Depending on the ability to measure similarity or

distance between a target and a known type, comparison is made to determine

whether to establish a new target anomalous or not. Thus the distance or similarity

employed will greatly affect the effectiveness of an A-IDS.

Data pre-processing is required in all knowledge discovery tasks, including

network-based intrusion detection, which attempts to classify network traffic as

normal or anomalous. Pre-processing converts network traffic into a series of

observations, where each observation is represented as a feature vector.

Observations are optionally labelled with its class, such as “normal” or “anomalous”.

These feature vectors are then suitable as input to data mining or machine learning

algorithms (Davis and Clark, 2011). Feature construction aims to create additional

features with a better discriminative ability than the initial feature set. This can bring

significant improvement to machine learning algorithms. A well-defined feature

extraction algorithm makes the classification process more effective and efficient

(Datti and Verma, 2010). To decrease the time needed for an IDS to detect an

intrusion, data dimension for a particular network traffic need to be reduced,

4

insignificant features should be removed or omitted, subsequently improving the

performance of the IDS. The goal of features extraction lies in shrinking a relative

huge data dimension to a smaller size and increasing the accuracy of classifier by

preserving the features that have the most significance on the class label and

omitting features that contribute less.

1.3 Problem Statement

From the previous section, the main question of this research paper would be “How

feature transformation and selection affects the performance of the classifier” This

question gives rise to two sub questions:

i. How to represent network traffic data that contains numerical and categorical

features into point series form?

ii. How does different similarity measures affect the performance of classifier

1.4 Objective

Four objectives have been identified to answer the questions identified in the

foregoing sub-section, which are:

a) To investigate and identify feature transformation technique that can

generate point series data for network activities classification.

b) To investigate the feasibility of Time Series Classification techniques by using

k-NN coupled with DTW to classify network traffic activities.

c) To investigate the effects of using different similarities measurement,

Euclidean Distance (ED) and Cosine similarity algorithm.

d) To compare the performance of network traffic classifier produced in (b) and

(c) with other machine learning techniques, Self-Organization Map (SOM)

Artificial Neural Network (ANN) by (Ibrahim, Basheer, and Mahmod, 2013)

5

and Discriminative Multinomial Naive Bayes (NB) proposed by (Panda,

Abraham, and Patra, 2010).

1.5 Research Scope

The scope of this research consists of examining the feasibility of representing

network traffic data into point series form so as to be classified using Time Series

Classification (TSC). Conversion of categorical data using three different approach

which is simple conversion, probability conversion and lastly entropy conversion

technique are also explored in this research paper. Two feature selection approaches

- Information Gain (IG) Feature Selection and Correlation Feature Selection (CFS) are

also being used to reduce the dimension of the dataset.

1.5.1 Dataset

This research paper will use a set of secondary data which was acquired from the

Internet. The chosen dataset is the NSL-KDD dataset, the improved version of

KDD’99 data set. Figure 1.1 illustrates the snapshot of the NSL-KDD original data set.

Features with different types and values are also shown in the figure below. Note

that the data shown in Figure 1.1 is the original dataset which have not been pre-

processed for the experiment. Data pre-processing of selected dataset will be further

discussed in Chapter 4 which focused on experimental settings.

NSL-KDD is a data set suggested to solve some of the inherent problems of

the KDD'99 data set. The NSL-KDD data set has the following advantages over the

original KDD data set (Tavallaee, Bagheri, Lu, and Ghorbani, 2009):

i. Redundant records are not included in the dataset, making the classifier

unbiased to frequently appear records.

ii. It does not include redundant records in the train set, so the classifiers will

not be biased towards more frequent records.

6

iii. There is no duplicate records in the proposed test sets; therefore, the

performance of the learners is not biased by the methods which have better

detection rates on the frequent records.

iv. The number of selected records from each difficulty level group is inversely

proportional to the percentage of records in the original KDD data set. As a

result, the classification rates of distinct machine learning methods vary in a

wider range, which makes it more efficient to have an accurate evaluation of

different learning techniques.

v. The number of records in the train and test sets are reasonable, which makes

it affordable to run the experiments on the complete set without the need to

randomly select a small portion. Consequently, evaluation results of different

research works will be consistent and comparable.

Figure 1.1 Snapshot of NSL-KDD Original Dataset

7

Table 1.1 Name of Features for NSL-KDD Data Set

1 duration 22 is_guest_login

2 protocol_type 23 count

3 service 24 srv_count

4 flag 25 serror_rate

5 src_bytes 26 srv_serror_rate

6 dst_bytes 27 rerror_rate

7 land 28 srv_rerror_rate

8 wrong_fragment 29 same_srv_rate

9 urgent 30 diff_srv_rate

10 hot 31 srv_diff_host_rate

11 num_failed_logins 32 dst_host_count

12 logged_in 33 dst_host_srv_count

13 num_compromised 34 dst_host_same_srv_rate

14 root_shell 35 dst_host_diff_srv_rate

15 su_attempted 36 dst_host_same_src_port_rate

16 num_root 37 dst_host_srv_diff_host_rate

17 num_file_creations 38 dst_host_serror_rate

18 num_shells 39 dst_host_srv_serror_rate

19 num_access_files 40 dst_host_rerror_rate

20 num_outbound_cmds 41 dst_host_srv_rerror_rate

21 is_host_login

Table 1.1 contains a more detailed list of the features for the NSL-KDD data. There

are a total 41 features for each data entry.

1.5.2 Time Series Classification (TSC) using K-Nearest Neighbour

Algorithm with Dynamic Time Warping (DTW) as similarity measure

The Time Series Classification technique that will be used in this research paper is

the Dynamic Time Warping (DTW) technique incorporated in the K-Nearest

Neighbour Algorithm (k-NN).

8

To perform classification, the k-NN algorithm takes an unlabelled data and

compares to a population observations to obtain class label. The unlabelled data, x

is classified by a majority vote of its neighbours, with x being labelled to the class

most common amongst its k-NN measured by a similarity or distance measure. In

this research paper, DTW algorithm is used to compute the similarity between two

sequences and further classify and label the test data using the k-NN algorithm.

Based on the related work reviewed, DTW is believed to have a better accuracy as

compared to other distance metric like Euclidean Distance. However, to the best of

my knowledge, no one has implemented KNN-DTW in the context of network traffic

so as in IDS. In this research paper, one of the challenges highlighted by García-

Teodoro et al., 2009 in Anomaly-based Intrusion Detection System, which is the low

detection efficiency is hope to be tackled by implementing the KNN-DTW in the

context of network traffic activities.

1.6 Research Methodology

The following section will discuss briefly on the research methodology used in this

research paper. A more detailed explanation will be provided in Chapter 4

Implementation of Proposed Approach. Four stages of experiments are divided in

order to achieve the objectives stated in Section 1.4.

The first stage of experiments is the extraction of numerically represented

features into point series format. In the first experiment, the categorical data are left

out. Data pre-processing of normalization using min-max normalization method is

performed. The dataset is then prepared in ten sets for ten-fold cross validation using

Time Series Classification (TSC) K-Nearest Neighbour classifier with three different

similarity measures which are the Euclidean Distance, Cosine Similarity and also the

Dynamic Time Warping (DTW).

Second experiment involved the conversion of categorical data using simple

conversion technique which is establishing a correspondence between each category

and a sequence of integer.

9

Third stage of experiment is performing TSC on dataset which have

undergone two different approach of categorical data conversion, namely Probability

Conversion and Entropy Conversion.

Feature selection technique, Information Gain (IG) and Correlation Feature

Selection (CFS) are implemented in the last stage of the experiments to reduce the

dimensionality of the dataset.

After all the stages of experiment are carried out, the results produced are

compiled and will be further discussed in Chapter 5 Result and Analysis. Comparison

of performance in terms of accuracy, sensitivity and specificity (if applicable) will be

made between different similarity measures and also with other machine learning

approach that are stated in Chapter 2 Literature Review.

1.7 Organization of Report

The remainder of this paper is organized as below. For Literature Review in Chapter

2, Intrusion Detection System (IDS), Time Series Classification (TSC) and

Classification of Network Traffic Data will be discussed.

In Chapter 3 Methodology, discussion is on the methodology used in the research in

order to achieve research objectives. Procedure of carrying out this research is listed

out with the aid of flow charts.

Chapter 4 Experimental setting covers in detail the steps involved to run experiments

in stages for this research paper. Data pre-processing including categorical data

conversion, and the experimental setup are discussed here.

All the result of the experiments carried out in this research paper is stated in Chapter

5 Result and Analysis. Followed by the detailed explanation and analysis of the result.

10

In the final chapter, Chapter 6 Conclusion summarizes all the works in this research

paper. Future works will also be discussed here. All the references that aided in this

paper are stated in the appendix in the last section of this research paper.

11

CHAPTER 2

LITERATURE REVIEW


This chapter reviews past similar work done on the classification of network traffic

and application of time series analysis of different data set and are not confined to

network traffic only. The reviewed findings and works will serve as the framework

which is used as main reference in this paper. Beside, discussion in this chapter also

focuses on the extraction of features that affect the performance and accuracy of

Intrusion Detection System (IDS). Section 2.2 presents a fundamental understanding

towards IDS whereas Section 2.3 will be discussing Time Series Analysis (TSA), Time

Series Classification (TSC) and more specifically Dynamic Time Warping (DTW).

Section 2.4 covers the classification techniques – the k-NN algorithm that will be used

in this research paper.

2.2 Intrusion Detection System (IDS)

“An Intrusion Detection System (IDS) is a device or a software application that

monitors network or system activities for malicious activities or policy violations and

produces reports to a management station.” (Chowdhary, Suri, and Bhutani, 2014).

2.2.1 Introduction of IDS

IDS concept was first introduced by (Anderson, 1980) in the effort of improving the

computer security auditing and surveillance capability. He proposed the user, data

12

set and program profiles can provide security personnel with information regarding

abnormal usage of a system.

According to (Robbins, 2002), ntrusion detection is the process of identifying

computing or network activity that is malicious or unauthorized. Generally, IDS has

comprise of common structure and components. He mentioned that an IDS comprise

of an agent (sensor) that observe one or more network traffic activities and apply

various types of detection algorithm. Thus, zero or more reaction will be activated.

In a research by (Deepa and Kavitha, 2012), the authors defines intrusion

detection as the field of trying to detect intrusions like computer break-ins, misuse

and unauthorized access to system resources and data. Activities of a given network

are monitored by an IDS and determine the behaviour of these activities as malicious

(intrusive) or legitimate (normal) based on system integrity, confidentiality and the

availability of the information resources. An IDS is mainly categorized by their

processing method which is detecting intrusion by misuse detection and anomaly

detection. Deepa and Kavitha (2012) in their research state that, in misuse detection,

the IDS search for specifying patterns or sequence of programs and user behaviour

that match well known intrusion scenarios. Whereas models of normal network

patterns is developed and by evaluating significant deviations from normal behaviour,

the new intrusions are detected is the method used in anomaly detection IDS.

Sabahi and Movaghar (2008) in their research further elaborate the misuse

detection into three sub-categories, which are the signature based, rule based and

state transition. In signature based misuse detection, intrusions are detected by

matching observed data from network activities to available signatures in its

database. For rule-base method, characterisation of intrusions is based on a set of

“if-then” implication rules. In state transition approach, from the network, a finite

state machine is deduced and the intrusions are identified using above states. The

finite state machine will contain various states of the network and an event will mark

a transit. Stateful protocol analysis is also defined as an additional method used in

IDS (Sabahi and Movaghar, 2008). Commonly recognizes definitions of good or

13

normal protocol activity in each protocol state is stored in a predefined profiles and

intrusion is identified if there is any deviation.

2.2.2 Anomaly-based Intrusion Detection System (IDS)

According to García-Teodoro et al. (2009), generally anomaly based intrusion

detection contains 3 basic stages. The first stage would be parameterization where

the monitored network traffic of a network system is represented in a pre-established

form. Followed by the training stage, a corresponding model is built based on the

characterised normal and abnormal behaviour of the system. At the last stage,

detection stage, the model is then compared with the parameterized (pre-

established) network traffic. Figure 2.1 below illustrates the stages mentioned above.

Figure 2.1 Stages in anomaly-based Intrusion Detection System

Recently, IDS is one of the widely discuss areas that aims to detect intrusion

in the fastest way. The target of IDS is to minimize false-positive (false alert) and

maximize true-positive (accurate alerts), that is, it will trigger alarm and alert the

administrator when detected potential attacks, and the alert is valid. Anomaly-based

IDS shows advantages when they do not require prior knowledge about the normal

activity of the target system; instead, they have the ability to learn the expected

behaviour of the system from observations. Secondly, statistical methods can provide

14

accurate notification of malicious activities occurring over long periods of time

(García-Teodoro et al., 2009). To deploy IDS, one must understand the network

traffic activities. Classifying the network traffic allows to observe what kind of traffic

is present, organizes network traffic to classes and also anomaly detection. Later

sections will briefly discuss about what is the classification as a whole and the

necessity for accurate classification of network traffic.

2.2.3 Challenges of Current IDS

Keeping low positive in any system that’s set aggressive policies to detect anomalies

is considered extremely difficult (Kumar, 2007). It may be difficult to distinguish flash

crowd from a Distributed Denial of Service attack (DDoS), thus a system may raise

false alarm during a flash crowd event assuming that it is a DDoS attack. Similarly,

network reconfigurations and transient failures may abruptly change the traffic profile

falsely raising the alarm. Challenges in IDS also include the assumption of attacks

are anomalous in nature as the attacker may try to attack in a way that cost minimal

to the disruption in the traffic. The availability of attack-free dataset which represents

normal traffic is impractical or nearly impossible to obtain..

2.3 Data Pre-processing

Data pre-processing converts raw data and signals into data representation suitable

for application through a sequence of operations (Li, Chen, & Huang, 2000). The

main aim of data pre-processing include size reduction of the input space, smoother

relationships, data normalization, noise reduction, and feature extraction.

2.3.1 Conversion of symbolic features

In a research by (Hernández-Pereira, Suárez-Romero, Fontenla-Romero, & Alonso-

Betanzos, 2009), a set of significant features regardless of quantitative or qualitative

that is selected will determined the successfulness of an IDS. Most of the machine

learning methods are unable to handle symbolic features directly, thus data pre-

processing technique that converts symbolic features to be compatible with machine

learning. In the paper, the authors demonstrate three types of conversion techniques

15

to apply on symbolic features which are the indicator variables, conditional

probabilities and the Separability Split Value (SSV) method. Some are as simple as to

establish a correspondence between each category and a sequence of integer values,

or to change the category symbolic value to a decimal number adding the ASCII of

its characters. These approximations were criticized for their simplicity, as different

category orders would generate different numerical values for each category.

Moreover, even with categories measured in ordinal scales, to assume equal or linear

distance is not normally reasonable. Furthermore, the arbitrary assignation may lead

to a very difficult classification problem, while a proper assignment may greatly

reduce the complexity of the problem (Duch, Grudzinski, & Stawski, 2000). In

indicator variables, binary coding scheme is used to categorise the occurrence of a

category. A binary number of 1 state the presence of a category and the absence of

that particular category is represented by the binary number 0. Subsequently, a

symbolic features containing n categories will have create n number of indicator

variables. For conditional probabilities each symbolic value xi of a feature a may be

replaced by the following N-dimensional vector of conditional probabilities:

(P(1|a=xi), (P(2|a=xi), … , (P(N|a=xi)) ∀i = 1,2, … , C

where N is the number of classes of the training set and C is the number of categories

of the symbolic value xi. The last approach stated in this research paper is the SSV

Criterion method which is based on a split value (or cut-off point) that produces a

subset of the set of alternative values of one feature.

SSV(s) = 2 ∗ ∑ |LS(s, f, D) ∩ Dc| ∗ |RS(s, f, D) ∩ (D − Dc)| −c∈C

∑ min(|LS(s, f, D) ∩ Dc|, |RS(s, f, D) ∩ Dc|)c∈C

Where M is the set of classes, Dm is the set of data vectors from the dataset D which

belong to class m ∈ M, f is a symbolic feature and the left side (LS) and right side (RS)

of the split value s of the feature f for D are defined as:

𝐿𝑆(𝑠, 𝑓, 𝐷) = {𝑥 ∈ 𝐷 ∶ 𝑓(𝑥) ∉ 𝑠},

𝑅𝑆(𝑠, 𝑓, 𝐷) = 𝐷 − 𝐿𝑆(𝑠, 𝑓, 𝐷).

(Bouzida & Cuppens, 2004) pre-processed the dataset in such a way that

discrete and categorical attributes are converted in continuous values. Then, the

authors further performed principal component analysis in order to reduce the

attributes. For each attribute, there is ni number of corresponding values. For every

16

possible value of the attribute, there exists one coordinate having a value of 1 and

the remaining corresponding coordinate will have a value of 0. For the protocol type

attribute which can take one of the following discrete attributes tcp, udp or icmp.

Then, there will be three coordinates for this attribute. If the connection record has

a tcp (resp. udp or icmp) as a protocol type then the corresponding coordinates will

be (1 0 0) (resp. 0 1 0 or 0 0 1). With this transformation, each connection record in

the different KDD 99 datasets will be represented by 125 (3 different values for the

protocol type, 11 different values for the flag attribute, 67 possible values for the

service attribute and 0 or 1 for the other remaining 6 discrete attributes) coordinates

instead of 41 according to the above discrete attribute values transformation.

2.3.2 Feature Selection

A well-defined feature extraction algorithm makes the classification process more

effective and efficient (Datti & Verma, 2010). To lower the time needed for an IDS

to detect an intrusion, data dimension for a particular network traffic need to be

reduced, insignificant features should be removed or omitted, subsequently

improving the performance of IDS. The goal of features extraction lies in shrinking a

relative huge data dimension to a smaller size and increasing the accuracy of classifier

by preserving the features that have most significant on the class label and omitting

features that contribute less.

a) Linear Discriminant Analysis

In the paper by (Datti & Verma, 2010), Linear Discriminant Analysis is used as a

features reduction tool and feed forward neural network as a learning tool. Four

procedures listed below were carried out to achieve the proposed algorithm:

i. Data pre-processing, using z-score normalization

ii. Application of an intermediate dimensionality reduction stage, which is the

Information gain that deal with singularity problem

iii. Dimensionality reduction using LDA

iv. Classification using Feed forward back propagation neural network algorithm

17

LDA provides a linear transformation of n-dimensional feature vectors (or

samples) into m-dimensional space (m < n), so that samples belonging to the

same class are close together but samples from different classes are far apart

from each other. The goal of LDA includes performing dimensionality reduction

“while preserving as much of the class discriminatory information as possible”.

Secondly, to find directions along which the classes are best separated. Thirdly,

LDA takes both the scatter within-classes and also the scatter between-classes

into consideration.

b) Principal Component Analysis

(Shyu, Chen, Sarinnapakorn, & Chang, 2003) proposed an anomaly detection

scheme based on principal components and outlier detection. With an assumption

of the attacks will appear as outliers in the normal data, the authors highlighted

two main advantages of the principal based approach. Differ from most of the

statistical based intrusion detection system that assumes normal distribution,

principal components based approach does not have any distributional

assumption. Secondly, as far as network traffic is concerned, they often exhibit

high data dimension, principal component analysis is used to reduce the

dimensionality of the data.

A framework for adaptive intrusion detection using machine learning

techniques including feature extraction, classifier constructions and sequential

pattern prediction is presented in a paper by (Xu, 2006). The proposed framework

is carried out in 3 stages: (i) Data acquisition and feature extraction, (ii) Classifier

construction and (iii) Sequential pattern prediction. The author has applied PCA

in stage (i) in order to reduce the dimensionality of the network data from 41 to

12, thus reducing the computational time and does not affect significantly on the

detection accuracy. Same as previous paper, the author of (Wang & Battiti, 2006)

also performed PCA to reduce the data dimension. Note that out of the 41

features, 34 are numeric and 7 are categorical. The categorical features are not

used in this research paper. Principal components are required to form the

subspace and the detection scheme is straightforward and easy to handle. The

18

PCA method only used 2 principal components and achieved better detection

results. The data distributional assumption is not used in the proposed model.

c) Information Gain (IG)

Information gain (IG) measures the amount of information in bits about the class

prediction, if the only information available is the presence of a feature and the

corresponding class distribution. Given SX the set of training examples, xi the

vector of ith variables in this set, |Sxi=v|/|SX| the fraction of examples of the ith

variable having value v (Roobaert et al., 2006):

IG(𝑆𝑥, 𝑥𝑖) = H(𝑆𝑥) − ∑ H(𝑆𝑥𝑖=𝑣)

|𝑆𝑋𝑖=𝑣|

|𝑠𝑥|

𝑣=𝑣𝑎𝑙𝑢𝑒𝑠(𝑥𝑖)

where entropy:

H(S) = −p+(S) log2p+(S) − p−(S) log2p−(S)

p±(S) is the probability of a training example in the set S to be of the

positive/negative class.

d) Correlation Feature Selection (CFS)

Correlation Feature Selection (CFS) valuates the worth of a subset of attributes

by considering the individual predictive ability of each feature along with the

degree of redundancy between them. Subsets of features that are highly

correlated with the class while having low intercorrelation are preferred.

Table 2.1 Summary of Reviewed Papers and Data Pre-processing Method

on KDD Cup 99

Title Author Data Pre-processing method

A novel anomaly detection scheme based on principal component classifier

Shyu M, Chen S, Sarinnapakorn K, Chang L 2003

PCA to reduce dimensionality

Adaptive intrusion detection based on machine learning: feature extraction, classifier construction and sequential

Xu X 2006

Principal component analysis (PCA) for feature selection

19

pattern prediction

An active learning based tcm-knn algorithm for supervised network intrusion detection

Li Y, Guo L 2007

Normalization: z-score used for continuous features, and discrete features converted to continuous values based on their frequency

Conversion methods for symbolic features: a comparison applied to an intrusion detection problem

Herna´ndez-Pereira E, Sua´rez-Romero JA, Fontenla-Romero O, Alonso-Betanzos A. 2009

Convert symbolic features to numeric using various algorithms: indicator variables, conditional probabilities and Separability Split Value

Efficient intrusion detection using principal component analysis

Bouzida Y, Cuppens F, Cuppens-Boulahia N, Gombault S 2004

7 Categorical attributes converted to continuous ones for total of 125 features. PCA to reduce dimensionality

Feature deduction and ensemble design of intrusion detection systems

Chebrolu S, Abraham A, Thomas J 2005

Feature selection using Markov blanket reduces 41 features to 17

Feature deduction and ensemble design of Parallel Neural Networks for Intrusion Detection Systems

Syed Muhammad Aqil Burney, M. Sadiq Ali Khan, Dr.Tahseen A. Jilani, 2010

Categorical features convert to integers, PCA used to reduce dimensionality

Identifying intrusions in computer networks with principal component analysis

Wang W, Battiti R 2006

Used subset of features: the 34 numeric features. PCA used to reduce dimensionality

Learning intrusion detection: supervised or unsupervised?

Laskov P, Dussel P, Schafer C, Rieck K 2005

Normalization. Transform categorical features.

Network anomaly detection based on TCM-KNN algorithm

Li Y, Fang B, Guo L, Chen Y 2007

Normalization: z-score used for continuous features, and discrete features converted to continuous values based on their frequency

Parzen-window network intrusion detectors

Yeung D, Chow C 2002

41 features expanded to 119, since symbolic ones converted to binary-valued features.

20

2.4 Time Series Analysis (TSA)

Time Series Analysis consists of methods for analysing the time series data so that

meaningful statistic and data characteristic can be extracted from. Before being able

to infer from the data, a hypothetical probability model must be set up to represent

the data (Brockwell & Davis, 2002). Elsayed et al. (2011) in their research highlighted

that data dimensions of TSA does not necessary include time. It can be applied to

any type of data that can be represented in a sequence or curve.

Time series data often arise when monitoring industrial processes or tracking

corporate business metrics. Time Series Analysis is also used to forecast future

patterns based on past events. The main motivation of Time Series Analysis is to do

forecast, this is widely used in the field of statistics, econometrics, quantitative

finance and seismology.

In the context of signal processing, communication and control engineering

it is used for signal detection and estimation, while in the context of data mining,

pattern recognition and machine learning, TSA can be used for classification, which

is known as Time Series Classification (TSC).

2.4.1 Time Series Classification (TSC)

In a paper by (Amr, 2012), a new form of a classification technique that new

algorithms or adapting existing machine learning methods to suit time-series data is

introduced and it is known as the Time-Series Classification (TSC) techniques.

Chaovalitwongse et al. (2007) in their research uses the K-Nearest Neighbour (K-NN)

and the Dynamic Time Warping distance (DTW) as the TSC techniques to classify the

abnormal brain activities.

Amr (2012) categorised the TSC in three manners depending on the metric

used for classification. Distance based classification is the classification algorithms

that are based on the distances between the data. One of the most famous distance

based classification algorithms is the k-nearest algorithm. Before passing to the

21

classification algorithm, Feature-based time-series classification requires data to be

transformed from time-series data to feature-set. In addition to above two

classification techniques, model based classification technique requires the modelling

of a data within a class and the new data is classified according to best-fit model.

2.4.2 Distance Similarity Measure

The definition of a distance measure includes three requirements. To define these

requirements, the function dist () is defined to takes as input two sequence 𝑋 =

{𝑥1, 𝑥2, … , 𝑥𝑁} and 𝑌 = {𝑦1, 𝑦2, … , 𝑦𝑀}, and returns the value of the distance. Then,

the requirements for a distance measure are as follows:

1) Non-negativity: The distance between X and Y must be a non-negative value

where it is always greater or equal to zero.

𝑑𝑖𝑠𝑡(𝑋, 𝑌) ≥ 0

2) Identity of indiscernible: The distance between X and Y is equal to zero if and

only if A is equal to B.

𝑑𝑖𝑠𝑡(𝑋, 𝑌) = 0 𝑖𝑓𝑓 𝑋 = 𝑌

3) Symmetry: The distance between X and Y is equal to the distance between Y

and X.

𝑑𝑖𝑠𝑡(𝑋, 𝑌) = 𝑑𝑖𝑠𝑡(𝑌, 𝑋)

Distances which conform to at least these three requirements are known as distance

measures(Weller-Fahy et al., 2014).

a) Euclidean Distance

This distance is commonly accepted as the simplest distance between

sequences. The distance between A and B is defined by

𝑑𝑖𝑠𝑡(𝑋, 𝑌) = √∑(𝑥1. 𝑦1)2 + ⋯ + ∑(𝑥𝑖 . 𝑦𝑖)2

b) Cosine Similarity

Cosine Similarity is the measure of calculating the difference of angle between

two vectors. Similarity between X and Y is defined by

𝑠𝑖𝑚(𝑋, 𝑌) = cos(𝜃) = 𝑋 . 𝑌

|𝑋||𝑌|

22

c) Dynamic Time Warping (DTW)

Dynamic Time Warping (DTW) is an algorithm used to find an optimal alignment

between two given time series data under certain restrictions. The time series,

which may be or not in the same length or phase are aligned and warped. When

DTW is first introduced in the 60s, it was applied in comparing or recognizing

speech patterns. In the domain of financial market forecasting, DTW is applied

to improve the fluctuation prediction of the YEN-Dollar market (Kia et al., 2013).

Gillian et al., (2011) implement DTW-based classification in the context of

biology and medicine where DTW is used to distinguish the 8 different items of

the Wolf Motor Function Test being performed, by collecting the time series

generated by an accelerometer placed on the arm.

The Dynamic Time Warping (DTW) method depends on the similarity of

shape between time series. Unlike Euclidean distance, the temporal relationship

between corresponding points in the two time series is maintained by time axis

that are nonlinearly ‘‘warped’. In order to find the similarity between time series,

the DTW algorithm ensures minimum distance between the aligned points (the

so-called warping path) by finding the alignment, and by performing this,

it ’’warps’’ the axis, choosing the best alignment path, and then generates a

distance measure between the two sequences (Muscillo et al., 2011).

Figure 2.1 shows two sequences A and B that were arranged on the sides

of grid, with sequence A on top and sequence B the left. The minimum value

for the sum of the distance between the individual elements is then divided by

the sum of the weighting function. The weighting function is the function used

to normalise the path length between sequence A and sequence B.

23

Figure 2.2 Matrix Representation of Two Sequence A and B

Source: 1 http://www.psb.ugent.be/cbd/papers/gentxwarper/DTWalgorithm.htm

Let X and Y be two discrete time series polled with the same sampling rate,

and with different lengths, being 𝑋 = {𝑥1, 𝑥2, … , 𝑥𝑁} the input signal and 𝑌 =

{𝑦1, 𝑦2, … , 𝑦𝑀} the reference signal. A matrix 𝑑(𝑖, 𝑗) ∈ 𝑅(𝑁 × 𝑀) is constructed in the

first step of the algorithm, in which the distance between the 𝑖th element of the

sequence 𝑋 and the 𝑗th element of 𝑌 is represented by each element in the matrix.

The Dynamic Programming algorithm looks for a DTW distance function between the

two time series, by minimizing a cost function specifically calculated on the matrix 𝑑.

This cost function is created through the generation of an alignment path (warping

path, W) between the time series. This alignment path defines the correspondence

of an element 𝑥𝑖 to 𝑦𝑗 with the condition that both the first and the last elements of

X and Y are aligned. Intuitively, 𝑑 is composed of small values when sequences are

similar, or large values if they are different. Figure 2.3 illustrates the algorithm

involved to perform DTW to compute the similarity measure between two sequences.

24

Figure 2.3 Algorithm to Perform DTW

Source: A global averaging method for dynamic time warping, with applications to

clustering

2.4.3 Review of TSC for classification

He et al. (2008) in aiming at traffic features of a large-scale communication network,

make every traffic feature a simple time series. The authors then take multiple traffic

feature as a whole to analyse and study through multiple time series data mining. In

this research paper, the approach of applying multiple time series data mining to

large-scale network traffic analysis is done in 5 steps:

i. Compute entropy of several flow level traffic features collected over each time

bin

ii. Apply Principal Component Analysis and subspace method to entropy time

series

iii. Apply time frequency analysis method and Piecewise Aggregate

Approximation and Symbolic Aggregate approximation to anomaly time series.

iv. Apply association rule mining to symbolic sequence.

v. Real-time monitoring with valid motif pattern.

25

Gillian, Knapp, and Modhrain (2011) presented a novel algorithm based on

Dynamic Time Warping (DTW) and extended it to classify any N-dimensional signal.

Musical gestures exhibits by a musician is still considered as difficult for a computer

to recognise. This is because the musical gestures are more often than not a cohesive

sequence of movements and not simple single static gestures. To improve the

performance of DTW, the authors adopted the warping path constraint methods. The

time needed for the computation of DTW is greatly reduced as there is no additional

need to construct a big proportion of the matrix when the warping window is small.

Gillian et al. (2011) n their research also highlighted the advantage of using the DTW

algorithm in related to the template (i.e. musical gesture in this research) where it

can be computed independently. This strait greatly utilizes new machines having

features of multi-threading, i.e. training can be done in parallel. The authors also

compare the implementation of DTW classification to other machine learning

algorithms such as ANN, where adding or removing an existing gesture are

inconvenient as entire system would need to be retrained.

Chaovalitwongse et al. (2007) in their research aim to develop a classification

technique that used to classify normal and abnormal or epileptic brain activities. The

authors use the approach of K-nearest Neighbour (k-NN) algorithm, Dynamic Time

Warping (DTW) and chaos theory in developing the novel classification techniques.

The first step, measure of chaos, known as the short-term maximum Lyapunov

exponent is estimated to quantify the chaoticity of the attractor. Then the EEG data

undergoes classification using the KNN classification and three similarity measure -

DTW, T-Statistical (Index) Distance and Euclidean Distance (ED) are used. The

authors state that the KNN classification with DTW as the similarity measure achieves

its best performance of 84% in sensitivity and a specificity of 75% when k=3.

In this research paper, distance-based time-series classification with k-NN

algorithm as classification technique and DTW as the similarity measure of two

sequences is selected. Conventionally, the Euclidean Distance (ED) serves as the

similarity measure, however the DTW, which provides a better elastic similarity

measure is trusted overcome the shortcoming of the conventional ED (Amr, 2012).

26

2.5 Classification Techniques

An algorithm that implements classification, especially in a concrete implementation,

is known as a classifier. The term "classifier" sometimes also refers to the

mathematical function, implemented by a classification algorithm that maps input

data to a category.

2.5.1 Classification of Data

Data classification involved a two-step process. In the first step, by describing

a predetermined set of data classes, a classifier is built. Classification algorithms then

build the classifier by analysing a learning (training) set and their features (Stolfo et

al., 1999). The features may be continuous, categorical or binary.

Supervised learning is the machine learning task of inferring a function from

labelled training data. The training data consists of a set of training examples. A

labelled data set with a huge number of instances with n features are illustrated in

table 2.2.

Table 2.2 Instances with Known Label

Data in standard format

Instance Feature 1 Feature 2 … Feature n Class

1 Xxx x Xx Normal

2 Xxx X Xx Abnormal

3 Xxx X Xx Normal

… Xxx X Xx Abnormal

y

Precise network traffic classification is vital as it aid to umpteen network activities

(Karagiannis, Papagiannaki, and Faloutsos, 2005). Stated below are few common

goals of accurate network classification:

a) Identification of application and user usage and trends:

27

The network administrator is able to look and inspect on the usage and trends.

This helps to ensure steady and quality of service provided as he can aggregate

suitable bandwidth according to demand for users or applications with higher

usage.

b) Identification of emerging applications

Accurate identification of new network applications can highlight frequent

emergence of disruptive applications that often rapidly alter the dynamics of

network traffic, and sometimes bring down valuable Internet services.

c) Accounting

Knowing the applications their subscribers are using may be of vital interest for

application-based accounting and charging or even for offering new products.

d) Anomaly Detection

Anomaly in a network traffic often indicate the propagation of unusual and

abnormal behaviour. Diagnosing anomalies are crucial for both network

administrator and user to ensure data confidentiality, integrity and availability.

2.5.2 k-Nearest Neighbour (k-NN)

Wu et al.(2008) provide a thorough explanation on how the k-NN algorithm is carried

out in their paper. The three key elements involved in k-NN are:

i. A set of labelled data, e.g., a set of stored records

ii. A distance or similarity metric to calculate the distance between the data

iii. The number of nearest neighbour, k

Given a labelled sequence data set D, a positive integer k, and a new

sequence z to be classified, the k-NN classifier finds the k nearest neighbours of z in

D, k-NN (z), the k-NN algorithm calculate the similarity (distance) between z and D

and returns the dominating class label in the neighbourhood as the label of z. KNN

is a lazy learning method and does not pre-compute a classification model

28

(Zhengzheng Xing, 2010). Figure 2.4 illustrates the process involved during the

execution of a k-NN algorithm.

Figure 2.4 The k-nearest neighbor classification algorithm

2.4.2 Review of k-Nearest Neighbour (k-NN) with Dynamic Time Warping

(DTW) as similarity measure

Although not much attention was given to DTW as the similarity measure for KNN in

the past, there is still quite a number that research on the possibility of the pair. Kia,

SamanHaratizadeh, and HadiZare (2013) uses k-NN and DTW to improve the

fluctuation prediction and to have better evaluation parameters in the literature of

financial market forecasting comparing to other researchers. A 500 sequences of 30

element exchange rate are built based on a data set 15331 USD/JPY exchange rate

records. The authors found a promising result of an improvement in directional

prediction compared to other researchers’ method which is one the most cited

research in the field of financial prediction using newer artificial intelligence and data

mining technique methodologies.

2.5.3 Review of Network Traffic Classification

Panda et al., 2010 apply discriminative multinomial Naïve Bayes with various filtering

analysis to build a network intrusion detection system. By using Principal Component

Analysis (PCA) as a filtering approach, the authors combines both of PCA and

discriminative parameter learning using Naïve Bayes (DMNB) classifier.

29

Before the data are classified, they undergo supervised and unsupervised

data filtering like the PCA, Random Projection (RP) and Nominal to Binary (N2B). The

discriminative parameter learning method learns parameters by discriminatively

computing frequencies from intrusion data. Table 2.3 below shows the result of the

proposed algorithm.

Table 2.3 Results for Various Algorithms

Classifier Detection Accuracy (%) False Alarm Rate in %

Discriminative Multinomial Naïve Bayes + PCA

94.84 4.4

Discriminative Multinomial Naïve Bayes + RP

81.47 12.85

Discriminative Multinomial Naïve Bayes + N2B

96.5 3.0

In the research paper by (Ibrahim et al., 2013), Self-Organization Map (SOM)

Artificial Neural Network (ANN) is performed on the intrusion database (KDD99 and

NSL-KDD). The goal of SOM is to transform an input data set of arbitrary dimension

to a one- or two-dimensional topological map. By building a topology preserving map,

it aims to observe the underlying structure of the input data set. The authors believed

that even in the absence of complete data or distorted data, the Neural Network

would be capable of analysing data from the network. Table 2.4 below shows the

result obtained by the author using SOM-ANN algorithms on the NSL-KDD dataset.

Table 2.4 Result for Application of SOM-ANN Algorithms

Classifier Successful Detection Rate (%)

SOM 68.88

2.6 Summary

In this chapter, the IDS is covered to provide readers a basic knowledge on the use

of anomaly detection. IDS has grown from time to time to suit the change of

30

technology. However, the flaws of IDS are yet to be overcome. Data classification is

also briefly discussed to show its role in an IDS.

After reviewing related works from the field, Time Series Classification (TSC)

is believed to have its potential for network traffic analysis. A mixed model of k-NN

algorithm and DTW is chosen in this research paper to classify network traffic

activities. Because the temporal dimension warping, DTW is good for classifying

sequences that have different frequencies or those that are out of phase.

CHAPTER 3

METHODOLOGY


This chapter covers the methodology used to perform this research. It is arranged in

the order of 3.2 Procedures, 3.3 Experimental Setting, 3.4 Validations of Findings and

finally the Summary. In 3.2, discussion is on the procedures involved in carrying out

this research. Followed by 3.3, where the experimental setting to validate the

proposed techniques will be discussed. After the experiment is carried out, it is vital

to validate the findings, methods of validation will be discussed in 3.4.

3.2 The Research Program of Work

Figure 3.3 Overall Framework used in this Research

Figure 3.1 illustrates the overall framework used within this research paper. Apart

from first experiment where all categorical features are removed from the train and

test set, all the train and test set undergoes data preprocessing step where

categorical data are converted into a numerical value based on three different

32

approaches, simple, probability and entropy conversion. These data are then

normalized using min-max normalization method. The converted and normalized

dataset was also undergone feature selection technique by Information Gain (IG)

and Correlation Feature Selection (CFS) generating train set and test set.

To achieve the objective stated in section 1.4, the methodology will be carried

out in two phases. Phase 1 (Figure 3.2) involved the identification of the best

approach to solve the problem “How to represent network traffic data into point

series form”.

Figure 3.2 Phase I of the research

Three sub phases are then further divided from it:

Sub-phase I involved the representation of the features extracted (as

provided in the dataset) in the form of point series by investigating and implementing

various feature transformation techniques. Throughout this research paper, only

secondary data set will be used. No data generation is involved. The data set selected

is the improved version of the famous KDD’99 data set - The NSL-KDD Data Set. The

mentioned original NSL-KDD data set can be obtained from the link:

<http://nsl.cs.unb.ca/NSL-KDD/>. In this phase, the original dataset with categorical

numeric and nominal attribute is transformed to be represented in a point series

form. For the first attempt, categorical data will be left out. Experimentations to

evaluate the performances of the selected feature transformation techniques

identified will be performed using a TSC technique which is Dynamic Time Warping

(DTW) as a similarity measure for the classification of network traffic activities using

the K-NN algorithm. Distance similarity measure, Euclidean Distance and Cosine

Similarity will also be incorporated into the K-NN classification (Figure 3.3).

33

Figure 3.3 Sub-phase I of the research

Sub-phase II involve the simple conversion of categorical data. In this phase

categorical data will be converted using alphabetically simple conversion method.

The distinct value within each feature will be arranged alphabetically and a

corresponding sequence of integer will be assigned to each of them. Then

classification using KNN with three different similarity measures will be applied to the

data. The framework of sub-phase II is illustrated in Figure 3.4.

34

Figure 3.4 Sub-phase II of the research

Sub-phase III involve the entropy conversion and the probability conversion

for categorical data (Figure 3.5).

35

Figure 3.5 Sub-phase III of the research

Phase II works with selecting features from the converted and normalized

dataset from phase I by using two feature selection methods which is Information

Gain (IG) and Correlation Feature Selection (CFS) (Figure 3.6).

36

Figure 3.6 Phase II of the research

37

3.3 Experimental Setting

To achieve the objectives stated in Section 1.4, two sets of experiments will be carried

out. Experiment I is to determine the best feature transformation technique.

In this research paper, a 10-fold cross validation approach will be used to

assess the result and how accurate the performance of predicted model is. A test set

containing known labelled (classified) data is used. The classifier will then be trained

to the test set using the train set. The NSL-KDD data set is subdivided into ten sub

dataset which in this research paper, one dataset will be used as test set and the

remaining nine sets function as training set. The iterations are repeated by ten times

with each iteration replaced with another dataset without repeating.

A mixed model of k-NN and DTW algorithm will be used to measure the

similarity of the data sequence and classify them. As proposed, the DTW algorithm

will be used as a similarity measure for two sequences, and k-NN is used to train and

classify the data to determine the class label.

Experiment II is designed to compare the performances of the proposed TSC

based approach compared to other approaches found in the literature.

3.4 Experiment Requirement

To implement the proposed research methodology of KNN-DTW, there are a few

hardware and software requirements that needed to be met.

3.4.1 Hardware Requirement

Listed below are the hardware used within a laptop to carry out the experiment:

Intel Core i7-4510U @2.3GHz

3.4.2 Software Requirement

Matlab 2014a is used in the research paper to carry out all the experiments.

38

3.5 Performance Measure for Classification

To measure the quality of performance of the proposed approach, the result of the

proposed model will be compared to ground truth (labelled data). For the NSL-KDD

data set, all the data were labelled, which is, the class of each instance is known.

Each instance is labelled as normal or anomaly. In table 3.1, the possible outcomes

of the nature of result of the proposed model are shown.

Table 3.1 Possible Outcomes

Predicted

Positive Negative

Truth Positive True Positive, tp False Negative, fn

Negative False Positive, fp True Negative, tn

Detection rate (DR) is calculated as the ratio between the number of correctly

detected intrusions and the total number of intrusions. Note: Detection Rate is also

known as the measure of sensitivity in some of the papers in Literature Review.

𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 𝑅𝑎𝑡𝑒, 𝐷𝑅 = 𝑡𝑝

𝑓𝑛 + 𝑡𝑝

False positive rate (FP) is calculated as the ratio between the numbers of

normal traffic that are incorrectly classified as intrusions and the total number of

normal traffic. Note: False Positive is also known as the measure of specificity in

some of the papers in Literature Review.

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒, 𝐹𝑃 = 𝑓𝑝

𝑡𝑛 + 𝑓𝑝

Accuracy indicates how correct the detection technique is. Performance,

precision is measured in percentage of accuracy, which is also the ratio between

correct detections and total detections obtained.

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦, 𝑎 = 𝑡𝑝 + 𝑡𝑛

𝑓𝑝 + 𝑓𝑛 + 𝑡𝑝 + 𝑡𝑛

39

The value achieves for Accuracy should not exceed 100%. The lack in

accuracy causes false positive.

3.6 Summary

To summarize, the procedures mentioned above are carried out to ensure that the

objective of the research can be achieved. The processes involved in each phase are

laid out in detail, and illustration is provided to aid understanding.

CHAPTER 4

IMPLEMENTATION OF THE PROPOSED APPROACH


In this chapter, experimental setup for the research to be carried out is presented.

Section 4.2 will explain in detail how the data set is pre-processed, whereas Section

4.3 discuss how the experiment is set up. And lastly, the summary made from current

studies and experiments is presented in Section 4.4.

4.2 Data Pre-processing

Phase I involved the representation of features of the dataset in the form of point

series data. The data set selected which is the NSL-KDD data set contains a total of

148517 entries and each with 41 features and one class label which label the data

as normal or anomaly. In this phase involved two tasks which are data conversion

and data normalization.

4.2.1 Conversion of data

Before proceeding to the experimental setting, the NSL-KDD dataset is first pre-

processed. The first step of data-pre-processing in this research paper involved the

conversion of feature type. First, identification of the feature type must take place.

Table 4.1 Features Name and Type

Features Type Features Type

duration Numeric count Numeric

protocol_type Nominal srv_count Numeric

service Nominal serror_rate Numeric

flag Nominal srv_serror_rate Numeric

src_bytes Numeric rerror_rate Numeric

dst_bytes Numeric srv_rerror_rate Numeric

land Nominal same_srv_rate Numeric

wrong_fragment Numeric diff_srv_rate Numeric

urgent Numeric srv_diff_host_rate Numeric

hot Numeric dst_host_count Numeric

num_failed_logins Numeric dst_host_srv_count Numeric

logged_in Numeric dst_host_same_srv_rate Numeric

num_compromised Numeric dst_host_diff_srv_rate Numeric

root_shell Numeric dst_host_same_src_port_rate

Numeric

su_attempted Numeric dst_host_srv_diff_host_rate

Numeric

num_root Numeric dst_host_serror_rate Numeric

num_file_creations Numeric dst_host_srv_serror_rate Numeric

num_shells Numeric dst_host_rerror_rate Numeric

num_access_files Numeric dst_host_srv_rerror_rate Numeric

num_outbound_cmds Numeric class Nominal

is_host_login Nominal

is_guest_login Nominal

Table 4.1 shows the features of the NSL-KDD data and the feature’s type

which is categorized as numeric or nominal. For features that are nominal and

contains no numeric representation, conversion to numeric value is made so that

representation of the features in point series data is made possible. Out of these 41

features and one class label, seven of the features are having nominal data. Followed

by three having not numeric data value. This three features, namely “protocol_type”,

“service” and “flag”. In this section, three conversion methods were applied to the

three categorical features mentioned above, which is the simple conversion,

probability conversion and lastly entropy conversion.

42

a) Alphabetical simple conversion

First method, simple conversion involves the process of assigning the categorical

data value alphabetically with numbers. First categorical data, “protocol_type”

contains three distinct values, which are “tcp”, “icmp” and “udp”. These values

are first arranged in alphabetically order.

Table 4.2 Alphabetically Simple Conversion of "protocol_type"

Existing Data After

Conversion

icmp 0

tcp 1

udp 2

Table 4.2 shows the available existing data for the feature “protocol_type”

and its numerical representation after conversion is done. The data “icmp” is

converted to “0”, whereas “tcp” is converted to “1” and “udp” is converted to “2”.

The next categorical data is the “service” feature which contains a total

of 70 distinct categorical values. After arranging the data in alphabetical order,

the data “aol” is converted to “0”, whereas “auth” is converted to “1” and “bgp”

is converted to “2” and so on. Table 4.3 below shows the existing data for the

feature “service” and its numerical value after conversion.

43

Table 4.3 Alphabetically Simple Conversion of "service"

Existing Data

After Conversion

Existing Data

After Conversion

Existing Data

After Conversion

aol 0 http_443 23 printer 46

auth 1 http_8001 24 private 47

bgp 2 imap4 25 red_i 48

courier 3 IRC 26 remote_job

49

csnet_ns 4 iso_tsap 27 rje 50

ctf 5 klogin 28 shell 51

daytime 6 kshell 29 smtp 52

discard 7 ldap 30 sql_net 53

domain 8 link 31 ssh 54

domain_u 9 login 32 sunrpc 55

echo 10 mtp 33 supdup 56

eco_i 11 name 34 Systat 57

ecr_i 12 netbios_dgm

35 Telnet 58

efs 13 netbios_ns 36 Tftp_u 59

exec 14 netbios_ssn

37 Tim_i 60

finger 15 netstat 38 Time 61

ftp 16 nnsp 39 urh_i 62

ftp_data 17 nntp 40 urp_i 63

gopher 18 ntp_u 41 uucp 64

harvest 19 other 42 uucp_path 65

hostnames 20 pm_dump 43 vmnet 66

http 21 pop_2 44 whois 67

http_2784 22 pop_3 45 X11 68

Z39_50 69

Table 4.4 Alphabetically Simple Conversion of "flag"

Existing Data After

Conversion Existing Data

After Conversion

OTH 0 S1 6

REJ 1 S2 7

RSTO 2 S3 8

RSTOS0 3 SF 9

RSTR 4 SH 10

S0 5

44

In Table 4.4, the feature “flag” contains 11 distinct data values, conversion

of the existing data to numeric representation is being made. The data “OTH” is

converted to “0”, whereas “REJ” is converted to “1” and “RSTO” is converted to

“2” and so on.

After the conversion of the categorical data using simple conversion, the

data can then be represented in a point series format. Figure 4.1 below illustrates

10 point series data generated results from the conversion method discussed.

There are a total of 41 features present for one point series.

Figure 4.1 Point Series Data with Simple Conversion

45

b) Probability Conversion

The second conversion method, the proposed method by this research paper

used to convert categorical data into numerical value is by probability conversion.

This proposed method works by computing the probability of anomaly occurred

for a specific data value in a feature. The proposed conversion method will divide

all the distinct value of the categorical data into categories according to their

probability of being classified as anomalous incidents. By applying discretization

using Weka, the feature “protocol_type” is divided into three categories.

Table 4.5 Statistic and Value for “protocol_type” After Conversion

pro

toco

l_

type anomaly normal

Total After

Conversion no. % no. %

icmp 7932 84.97964 1402 15.02036 9334 0

tcp 60127 49.45915 61442 50.54085 121569 1

udp 3404 19.32554 14210 80.67446 17614 2

Table 4.5 shows the statistic for the three distinct values of

“protocol_type” which are the “icmp”, “tcp” and “udp”. In this feature, there are

a total number of 9334 data which belongs to the “icmp”, out of these 9334 data,

7932 are classified as anomaly entries and 1402 are classified as normal entries.

This makes 84.98% of the “icmp” data being classified as anomaly data and will

be assigned with the number zero as the corresponding value. “tcp” data, having

49.46% chance of being anomaly is assigned with the numeric value of one and

“udp” with the probability of 19.33% is assigned with the numeric value of two.

Followed by “protocol_type” feature is the “service” feature which has a

total of 70 distinct values. Table 4.6 shows the information of data present in the

“service” and its converted value. Similar with “protocol_type”, discretization

using Weka is performed on the probability range of occurrence for anomalous

incidents “service” feature. This result the 70 distinct value of “service” being

grouped into three categories.

46

Table 4.6 Statistic and Value for “service” After Conversion

Service

anomaly normal No.

Aft

er

Convers

ion

Total no. % no. %

aol 2 100 0 0 2

0 63758

bgp 756 100 0 0 756

courier 774 100 0 0 774

csnet_ns 579 100 0 0 579

ctf 604 100 0 0 604

daytime 549 100 0 0 549

discard 564 100 0 0 564

echo 471 100 0 0 471

efs 518 100 0 0 518

exec 501 100 0 0 501

gopher 552 100 0 0 552

harvest 2 100 0 0 2

hostnames 483 100 0 0 483

http_2784 1 100 0 0 1

http_443 566 100 0 0 566

http_8001 2 100 0 0 2

iso_tsap 735 100 0 0 735

klogin 454 100 0 0 454

kshell 323 100 0 0 323

ldap 429 100 0 0 429

login 458 100 0 0 458

mtp 471 100 0 0 471

name 488 100 0 0 488

netbios_dgm 430 100 0 0 430

netbios_ns 383 100 0 0 383

netbios_ssn 377 100 0 0 377

netstat 386 100 0 0 386

nnsp 672 100 0 0 672

nntp 317 100 0 0 317

pm_dump 21 100 0 0 21

pop_2 91 100 0 0 91

printer 80 100 0 0 80

rje 94 100 0 0 94

sql_net 263 100 0 0 263

47

supdup 571 100 0 0 571

systat 509 100 0 0 509

uucp 830 100 0 0 830

uucp_path 735 100 0 0 735

vmnet 660 100 0 0 660

whois 733 100 0 0 733

Z39_50 907 100 0 0 907

sunrpc 539 99.81481 1 0.185185 540

link 515 99.8062 1 0.193798 516

imap4 950 99.6852 3 0.314795 953

remote_job 91 98.91304 1 1.086957 92

ssh 332 98.51632 5 1.48368 337

shell 77 95.06173 4 4.938272 81

domain 582 93.87097 38 6.129032 620

ecr_i 3591 93.78428 238 6.215722 3829

private 24793 93.11225 1834 6.887746 26627

eco_i 4325 89.21205 523 10.78795 4848

time 609 88.26087 81 11.73913 690

pop_3 1089 84.87919 194 15.12081 1283

telnet 3018 75.8482 961 24.1518 3979

auth 767 75.04892 255 24.95108 1022

finger 1311 68.89122 592 31.10878 1903

1 17271

ftp 1480 60.50695 966 39.49305 2446

tim_i 8 57.14286 6 42.85714 14

other 2528 48.64345 2669 51.35655 5197

ftp_data 2407 31.21515 5304 68.78485 7711

X11 17 19.31818 71 80.68182 88

2 67488

smtp 600 7.275373 7647 92.72463 8247

http 3469 7.19844 44722 92.80156 48191

IRC 4 2 196 98 200

urp_i 8 1.28 617 98.72 625

domain_u 11 0.110697 9926 99.8893 9937

ntp_u 0 0 178 100 178

red_i 0 0 8 100 8

tftp_u 0 0 4 100 4

urh_i 0 0 10 100 10

Table 4.6 presents the statistic of the data exist in the “service” feature.

A total of 55 data have probability more that 75% belongs to the anomaly class.

48

Thus, 63758 data are converted to the numeric value of zero. Whereas there are

5 data that has probability ranging from 31.22% to 68.89% being classified as

anomaly entries are converted to a numeric value of one. The rest of the data,

having the probability value of 0.00% to 19.32% is converted to the numeric

value of two.

Table 4.7 Statistic and Value for “flag” After Conversion

Flag anomaly normal

No. After Conversion

Total no. % no. %

SH 342 99.4186 2 0.581395 344

0 58169

RSTOS0 104 99.04762 1 0.952381 105

S0 36510 99.03971 354 0.960286 36864

RSTR 2900 93.85113 190 6.148867 3090

RSTO 2114 90.53533 221 9.464668 2335

S3 249 83.55705 49 16.44295 298

REJ 12355 81.91341 2728 18.08659 15083

OTH 39 78 11 22 50

SF 16820 18.72634 73000 81.27366 89820

1 90348 S2 15 10.56338 127 89.43662 142

S1 14 3.626943 372 96.37306 386

After performing discretization, the probability range for the occurrence

of anomalous incidents is divided into two categories, where data having the

probability of more than 78% belongs to the anomaly class being grouped into

one category. This category is assigned with the numeric value of zero and the

other group assigned with the numeric value of one have the probability ranging

from 3.63% to 18.73% (Table 4.7). Figure 4.2 below shows 10 point series data

generated using probability conversion techniques for categorical data.

49

Figure 4.2 Point Series Data with Probability Conversion

c) Entropy Conversion

The third conversion method used to convert the categorical features is by

entropy conversion which is converting the different value of the categorical data

into a sequence of integer. However, unlike the simple conversion, entropy

conversion first performs calculation of the entropy of value within a feature.

Then the values are sorted according to their entropy where lowest entropy has

the highest ranking of zero and so forth. The entropy value is computed

according to the formula:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦, 𝐸𝑛𝑡 = −[(𝑥

𝑁 × log2

𝑥

𝑁) + (

𝑦

𝑁 × 𝑙𝑜𝑔2

𝑦

𝑁)]

Where x: total number of anomalous incidents of that particular value within a

feature

y: total number of normal incidents of that particular value within a

feature

N: total number of entries of that particular value within the feature

50

Table 4.8 Entropy of “protocol_type” Data and Corresponding Converted

Value

protocol_ type

anomaly normal total Entropy After Conversion

icmp 7932 1402 9334 0.61 0

udp 3404 14210 17614 0.71 1

tcp 60127 61442 121569 1.00 2

Table 4.8 shows that the icmp has the highest ranking with the lowest entropy

value of 0.61, thus is being converted to zero, whereas udp is converted to the

numeric value of one and tcp is being converted to the number two for having

the lowest ranking.

Table 4.9 Entropy of “service” Data and Corresponding Converted Value

Service Anom

aly Normal Total Entropy

After Conversion

aol 2 0 2 0 0

bgp 756 0 756 0 0

courier 774 0 774 0 0

csnet_ns 579 0 579 0 0

ctf 604 0 604 0 0

daytime 549 0 549 0 0

discard 564 0 564 0 0

echo 471 0 471 0 0

efs 518 0 518 0 0

exec 501 0 501 0 0

gopher 552 0 552 0 0

harvest 2 0 2 0 0

hostnames 483 0 483 0 0

http_2784 1 0 1 0 0

http_443 566 0 566 0 0

http_8001 2 0 2 0 0

iso_tsap 735 0 735 0 0

klogin 454 0 454 0 0

kshell 323 0 323 0 0

ldap 429 0 429 0 0

login 458 0 458 0 0

mtp 471 0 471 0 0

name 488 0 488 0 0

netbios_dgm 430 0 430 0 0

51

netbios_ns 383 0 383 0 0

netbios_ssn 377 0 377 0 0

netstat 386 0 386 0 0

nnsp 672 0 672 0 0

nntp 317 0 317 0 0

pm_dump 21 0 21 0 0

pop_2 91 0 91 0 0

printer 80 0 80 0 0

rje 94 0 94 0 0

sql_net 263 0 263 0 0

supdup 571 0 571 0 0

systat 509 0 509 0 0

uucp 830 0 830 0 0

uucp_path 735 0 735 0 0

vmnet 660 0 660 0 0

whois 733 0 733 0 0

Z39_50 907 0 907 0 0

ntp_u 0 178 178 0 0

red_i 0 8 8 0 0

tftp_u 0 4 4 0 0

urh_i 0 10 10 0 0

domain_u 11 9926 9937 0.012466 1

sunrpc 539 1 540 0.019478 2

link 515 1 516 0.020257 3

imap4 950 3 953 0.030698 4

remote_job 91 1 92 0.086504 5

urp_i 8 617 625 0.098831 6

ssh 332 5 337 0.111374 7

IRC 4 196 200 0.141441 8

shell 77 4 81 0.283769 9

domain 582 38 620 0.332546 10

ecr_i 3591 238 3829 0.335949 11

private 24793 1834 26627 0.361721 12

http 3469 44722 48191 0.373286 13

smtp 600 7647 8247 0.376117 14

eco_i 4325 523 4848 0.493487 15

time 609 81 690 0.521815 16

pop_3 1089 194 1283 0.612855 17

X11 17 71 88 0.70808 18

telnet 3018 961 3979 0.797556 19

auth 767 255 1022 0.810502 20

finger 1311 592 1903 0.894425 21

52

ftp_data 2407 5304 7711 0.895641 22

ftp 1480 966 2446 0.967908 23

tim_i 8 6 14 0.985228 24

other 2528 2669 5197 0.999469 25

Table 4.9 shows the distinct data for the feature “service”, a total number of 45

data have entropy value of zero, which are then converted to the numeric value

of zero. The rest is converted according to their ranking where highest entropy

has the lowest ranking. After performing calculation of entropy, the distinct value

within “flag” features are arranged from SH to OTH where SH has the lowest

entropy value and is ranked zero and OTH has the lowest ranking of ten (Table

4.10).

Table 4.10 Entropy of “flag” Data and Corresponding Converted Value

Flag Anomaly Normal Total Entropy After

Conversion

SH 342 2 344 0.051539 0

RSTOS0 104 1 105 0.07762 1

S0 36510 354 36864 0.078149 2

S1 14 372 386 0.224918 3

RSTR 2900 190 3090 0.333326 4

RSTO 2114 221 2335 0.451793 5

S2 15 127 142 0.486604 6

S3 249 49 298 0.644802 7

REJ 12355 2728 15083 0.681968 8

SF 16820 73000 89820 0.695712 9

OTH 39 11 50 0.760168 10

4.2.2 Data Normalization

Having a larger range for a feature tend to give greater weight, thus the data should

be normalized. For distance-based methods, normalization help prevent attributes

with initially, large ranges (e.g., income) from outweighing attributes with initially,

smaller ranges (e.g., binary attributes) (Kumar & Nancy, 2014). There exists many

ways for data normalization, one of which is the min-max normalization method

which preserves the relationships among the original data values.

53

Table 4.11 Features with Minimum and Maximum Value for Simple

Conversion

Features Min

Value Max

Value Features

Min Value

Max Value

duration 0 57715 is_guest_login 0 1

protocol_type 0 2 count 0 511

service 0 69 srv_count 0 511

flag 0 10 serror_rate 0 1

src_bytes 0 1.38E+09 srv_serror_rate 0 1

dst_bytes 0 1.31E+09 rerror_rate 0 1

land 0 1 srv_rerror_rate 0 1

wrong_fragment 0 3 same_srv_rate 0 1

urgent 0 3 diff_srv_rate 0 1

hot 0 101 srv_diff_host_rate 0 1

num_failed_logins 0 5 dst_host_count 0 255

logged_in 0 1 dst_host_srv_count

0 255

num_compromised 0 7479 dst_host_same_srv_rate

0 1

root_shell 0 1 dst_host_diff_srv_rate

0 1

su_attempted 0 2 dst_host_same_src_port_rate

0 1

num_root 0 7468 dst_host_srv_diff_host_rate

0 1

num_file_creations 0 100 dst_host_serror_rate

0 1

num_shells 0 5 dst_host_srv_serror_rate

0 1

num_access_files 0 9 dst_host_rerror_rate

0 1

num_outbound_cmds

0 0 dst_host_srv_rerror_rate

0 1

is_host_login 0 1 class 0 1

Table 4.11 shows the features of the KDD dataset and its corresponding

minimum and maximum value. From the table, at least 13 of the total features has

a wide range of data. After identifying the minimum and maximum value for

respective feature, the data are then scaled to fit in the range of zero and one. Using

the formula,

54

norm_data = 𝑑𝑎𝑡𝑎(𝑓𝑒𝑎𝑡𝑢𝑟𝑒) − 𝑚𝑖𝑛(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)

max(𝑓𝑒𝑎𝑡𝑢𝑟𝑒) − 𝑚𝑖𝑛(𝑓𝑒𝑎𝑡𝑢𝑟𝑒)

Where norm_data is the data after normalization

data(feature) is the data which is supposed to undergo normalization

min(feature) is the minimum value for the respective feature

max(feature) is the maximum value for the respective feature

Subsequently, the data are ready to be represented in curve format.

4.2.3 Feature Selection

Data above were then undergoing three feature selection technique. Information

Gain (IG) and Correlation Feature Selection (CFS) were used in this section. The

original data containing categorical and numeric data are first imported to WEKA for

feature selection.

a) Information Gain (IG)

The original dataset is uploaded to WEKA for feature selection using Information

Gain Method. Attribute evaluator of Information Gain is selected, by default the

search method for it must use Ranker Search method. Figure 4.3 below shows

the snapshot captured during the Information Gain Feature Selection process

using WEKA. After the feature selection process is completed, a table of the

average merit, average ranking for each attribute is listed out in a descending

order.

55

Figure 4.3 Snapshot on WEKA - Information Gain Feature Selection

Table 4.12 Output of Information Gain Feature Selection

average merit average rank attribute

0.781 +- 0 1 +- 0 5 src_bytes

0.625 +- 0.001 2 +- 0 6 dst_bytes

0.62 +- 0.001 3 +- 0 3 service

0.473 +- 0.001 4 +- 0 4 flag

0.46 +- 0.001 5 +- 0 30 diff_srv_rate

0.45 +- 0.001 6 +- 0 29 same_srv_rate

0.427 +- 0.001 7 +- 0 33 dst_host_srv_count

0.391 +- 0.001 8 +- 0 34 dst_host_same_srv_rate

0.372 +- 0.001 9 +- 0 35 dst_host_diff_srv_rate

0.36 +- 0.001 10 +- 0 12 logged_in

0.339 +- 0.001 11 +- 0 23 count

0.337 +- 0.001 12 +- 0 38 dst_host_serror_rate

0.331 +- 0.001 13 +- 0 25 serror_rate

56

0.329 +- 0 14 +- 0 39 dst_host_srv_serror_rate

0.312 +- 0.001 15 +- 0 26 srv_serror_rate

0.248 +- 0.001 16 +- 0 37 dst_host_srv_diff_host_rate

0.185 +- 0.001 17.1 +-0.3 32 dst_host_count

0.184 +- 0.001 17.9 +-0.3 36 dst_host_same_src_port_rate

0.14 +- 0 19 +- 0 31 srv_diff_host_rate

0.091 +- 0.001 20 +- 0 41 dst_host_srv_rerror_rate

0.079 +- 0 21 +- 0 27 rerror_rate

0.077 +- 0 22 +- 0 40 dst_host_rerror_rate

0.072 +- 0 23 +- 0 28 srv_rerror_rate

0.069 +- 0 24 +- 0 24 srv_count

0.058 +- 0 25 +- 0 2 protocol_type

0.034 +- 0.001 26 +- 0 1 duration

0.015 +- 0 27 +- 0 10 hot

0.007 +- 0 28 +- 0 13 num_compromised

0.006 +- 0 29 +- 0 8 wrong_fragment

0.003 +- 0 30 +- 0 16 num_root

0.002 +- 0 31 +- 0 11 num_failed_logins

0.002 +- 0 32 +- 0 19 num_access_files

0.001 +- 0 33 +- 0 17 num_file_creations

0 +- 0 34 +- 0 15 su_attempted

0 +- 0 35 +- 0 14 root_shell

0 +- 0 36.4 +-0.49 7 land

0 +- 0 36.7 +-0.64 21 is_host_login

0 +- 0 37.9 +-0.3 18 num_shells

0 +- 0 39 +- 0 20 num_outbound_cmds

0 +- 0 40 +- 0 22 is_guest_login

0 +- 0 41 +- 0 9 urgent

57

Table 4.13 Features Removed Corresponce with Features Percentage

Features 80% 70% 60% 50% 40% 30%

Number of features to remove -8 -13 -17 -21 -25 -29

1 duration X X X X

2 protocol_type X X X X

3 service

4 flag

5 src_bytes

6 dst_bytes

7 land X X X X X X

8 wrong_fragment X X X X X

9 urgent X X X X X X

10 hot X X X X

11 num_failed_logins X X X X X

12 logged_in

13 num_compromised X X X X

14 root_shell X X X X X X

15 su_attempted X X X X X X

16 num_root X X X X X

17 num_file_creations X X X X X

18 num_shells X X X X X X

19 num_access_files X X X X X

20 num_outbound_cmds X X X X X X

21 is_host_login X X X X X X

22 is_guest_login X X X X X X

23 count

24 srv_count X X X

25 serror_rate X

26 srv_serror_rate X

27 rerror_rate X X X

28 srv_rerror_rate X X X

29 same_srv_rate

30 diff_srv_rate

31 srv_diff_host_rate X X

32 dst_host_count X X

33 dst_host_srv_count

34 dst_host_same_srv_rate

35 dst_host_diff_srv_rate

36 dst_host_same_src_port_rate X X

37 dst_host_srv_diff_host_rate X

38 dst_host_serror_rate

39 dst_host_srv_serror_rate X

40 dst_host_rerror_rate X X X

41 dst_host_srv_rerror_rate X X

58

Table 4.12 shows the output of the IG feature selection process done in WEKA.

After the ranking is identified, the data are then prepared accordingly. Table 4.13

above shows the attributes that were to be removed for each selected features

percentage.

Figure 4.4 Script Written to Remove Features’ Column from Dataset

Figure 4.4 above shows the script written to aid the removal of unwanted

features from the dataset. The example shown is for the removal of 8 features

for the experiment which involved 80% of the total features. The rest was also

prepared in a similar fashion.

When 80% of original attributes are used, 8 features are removed, leaving

a total of 33 features. 13 features were removed from the dataset when 70% of

original attributes are used. Followed by removing 17 features when 60% of data

which is 24 feature is preserved, 21 features when using 50% of the original data.

In the case of selecting 40% features, 25 features were removed from the dataset

and finally removing 29 features when a 30% of the features are selected.

b) Correlation Feature Selection (CFS)

The original data set is again uploaded to WEKA for feature selection using

Correlation Feature Selection (CFS) Method. Attribute evaluator of Correlation

Feature Selection is selected, by default the search method for it is using Best

First Search method. Figure 4.5 below shows the snapshot captured during the

Correlation Feature Selection process using WEKA.

59

Figure 4.5 Snapshot on WEKA - Correlation Feature Selection

After the feature selection process is completed, a list of the selected features is

shown. Figure 4.6 shows the output of the selected features after CFS process

done in WEKA.

60

Figure 4.6 Features Selected using Correlation Feature Selection

After the features that are to be selected were identified, a table of the

list is drawn out to identify the location of the selected features. Table 4.14 shows

the selected features’ column number for each set of categorical data conversion.

For dataset which previously have simple conversion of categorical data will have

a remaining of 8 features, same goes to probability and entropy categorical data

conversion method.

61

Table 4.14 Selected Features and Their Respective Column for Each Data

Conversion Techniques Colu

mn

Features

Sele

cted

Featu

res

Colu

mn

Features

Sele

cted

Featu

res

1 duration 22 is_guest_login

2 protocol_type 23 count

3 service 24 srv_count

4 flag 4 25 serror_rate

5 src_bytes 5 26 srv_serror_rate 26

6 dst_bytes 6 27 rerror_rate

7 land 28 srv_rerror_rate

8 wrong_fragment 29 same_srv_rate 29

9 urgent 30 diff_srv_rate 30

10 hot 31 srv_diff_host_rate

11 num_failed_logins 32 dst_host_count

12 logged_in 12 33 dst_host_srv_count

13 num_compromised 34 dst_host_same_srv_rate

14 root_shell 35 dst_host_diff_srv_rate

15 su_attempted 36 dst_host_same_src_port_rate

16 num_root 37 dst_host_srv_diff_host_rate 37

17 num_file_creations 38 dst_host_serror_rate

18 num_shells 39 dst_host_srv_serror_rate

19 num_access_files 40 dst_host_rerror_rate

20 num_outbound_cmds 41 dst_host_srv_rerror_rate

21 is_host_login 42 class

4.3 Experimental Setting

Previous section discusses how all the data being pre-processed for experiments,

whereas in this is section, details about experiments carried in this research are

stated clearly. After the data is successfully represented in a point series format, the

data is then separated into ten smaller sets to perform 10-fold cross validation. Each

set contains 14851 of data entries except for last test set which contains a total of

14857 data entries. A total of four set experiments was carried out in realizing the

objectives of this research paper.

62

4.3.1 Experiment I: No Categorical Data

The main objective of performing Experiment I is to provide a benchmark for

comparison for following experiments. We must first know categorical data affects

the classification thus categorical features “protocol_type”, “service” and “flag” were

first remove from the dataset. K-Nearest Neighbour with three different similarity

measure will be performed to classify the network activities, they are the Euclidean

Distance, Cosine Similarity and also Dynamic Time Warping. The K-Nearest

Neighbour classification method is also repeated using different value of k where k=1,

3, 5 …19. The result obtained from this experiment serve as a baseline for

comparison.

4.3.2 Experiment II: Simple Conversion

First method used for the conversion of categorical data is the simple conversion

techniques that were discussed in Section 4.2.1 (a). This method is chosen as the

Experiment II for it is the most common categorical data conversion technique found

in the literature. Again, K-Nearest Neighbour (K-NN) with three different similarity

measure will be performed to classify the network activities. The K-NN classification

method is also repeated using different value of k, where top three k (k=1, 3 and 5)

value that yields the best performance in terms of average accuracy, sensitivity and

specificity in Experiment I are selected. Result from this experiment serves as the

second baseline.

4.3.3 Experiment III: Probability and Entropy Conversion

In Experiment III, classification of network traffic activities is performed on dataset

which have undergone probability and entropy conversion [discussed previously in

4.2.1 (b) & (c)]. Best two similarity measure (Euclidean Distance and Cosine Similarity)

of K-Nearest Neighbour (K-NN) in 4.3.2 are selected to perform classification in

Experiment III, like previous experiments, K-NN classification algorithm is repeated

by selecting the best three values of k (k=1, 3 and 5).

63

4.3.4 Experiment IV: Feature Selection using Information Gain and

Correlation Feature Selection

Last set of experiment involves in identifying the feature selection techniques that

are able to reduce the data dimension of the dataset. Feature Selection method used

in this research are the Information Gain and the Correlation Features Selection

(4.2.3). K-Nearest Neighbour is then applied to the data to classify the network traffic

activities.

Results from the above experiments are properly recorded for analysis purpose in

the next chapter. The results obtained will also be used to compare with the

performance of other machine learning techniques stated in the literature.

4.4 Summary

This chapter aims to provide readers on the steps taken before and during the

experiment. The data must be prepared or rather pre-processed before it can be

loaded to carry out the experiment. Three important steps were carried out in the

data pre-processing phase, which are the conversion of categorical data, data

normalization and lastly feature selection. Conversion of categorical data into numeric

value is accomplished using three conversion technique, namely, Alphabetical Simple

Conversion, Probability Conversion and Entropy Conversion. The converted data were

then combined with the rest of numerical data and further normalized using the min-

max normalization method. K-NN classification with three similarity measure,

Euclidean distance, Cosine Similarity and Dynamic Time Warping were performed on

the data above. The data were also later undergo feature selection method to reduce

the data dimensionality. As the involved dataset is very huge, the process is thus

cumbersome.

CHAPTER 5

RESULLT AND ANALYSIS


In this chapter, the result and analysis of experiments that were performed in the

previous chapter are stated. Comparison of the performance of each experiment will

be explained in detail. Section 5.2 shows the result and analysis for Experiment I,

whereas Section 5.3 discuss the result obtained from Experiment II and its

corresponding analysis. In Section 5.4, analysis is based on the result obtained from

the Experiment III. Section 5.5 discuss about the result from Experiment IV. In

section 5.6, comparison is made between the obtained results and related works

identified in the literature review.

5.2 Experiment I: No Categorical Data

In Experiment I, categorical data are being removed from both of the test set and

train set. K-Nearest Neighbour (K-NN) Classification with Euclidean Distance, Cosine

Similarity and Dynamic Time Warping were performed. For the first experiment,

Euclidean Distance is used as the distance similarity measure in K-NN. The

experiments were also repeated using different number of k, ranging from K = 1, 3,

5, 7, 9, 11 … 19. Table 5.1 below shows the average accuracy, sensitivity and

specificity and the corresponding standard deviation when performing Euclidean

Distance (ED) as the distance similarity measure with no categorical data present in

the dataset.

65

Table 5.1 Result of K-NN with ED on Dataset with No Categorical

Features

Euclidean No Categorical Conversion

K accuracy sensitivity specificity

Average Std dev Average Std dev Average Std dev

1 98.66 0.12 98.61 0.14 98.72 0.14

3 98.58 0.28 98.50 0.46 98.67 0.46

5 98.60 0.13 98.50 0.12 98.71 0.12

7 98.51 0.16 98.42 0.14 98.61 0.14

9 98.46 0.15 98.41 0.12 98.50 0.12

11 98.36 0.16 98.32 0.11 98.41 0.11

13 98.27 0.18 98.23 0.14 98.33 0.14

15 98.22 0.18 98.20 0.16 98.25 0.16

17 98.17 0.20 98.16 0.17 98.18 0.17

19 98.11 0.23 98.12 0.14 98.11 0.14

From Table 5.1, it is clearly seen that the best three K value equals to 1, 3

and 5 for K-NN with ED. With K value equals to 1, the K-NN achieved an average

accuracy of 98.66%. For the sensitivity, an average of 98.61% score is recorded and

a score of 98.72% of average specificity is recorded. When K is equal to 5, the

average accuracy achieved a score of 98.60%, 98.50% for average sensitivity and

average specificity of 98.71% making its performance slightly worse than that of k=1.

When K=3, the performance of K-NN with ED performed the third best with average

accuracy of 98.58%, average sensitivity of 98.50% and 98.67% for average

specificity.

For the second experiment, Cosine Similarity is used as the distance similarity

measure in K-NN. The experiments were also repeated using different number of k,

ranging from K = 1, 3, 5, 7, 9, 11 … 19. Table 5.2 below shows the average accuracy,

sensitivity and specificity and the corresponding standard deviation when performing

Cosine Similarity as the distance similarity measure with no categorical data present

in the dataset.

66

Table 5.2 Result of K-NN with Cosine on Dataset with No Categorical

Features

Cosine No Categorical Conversion

K Accuracy Sensitivity Specificity


1 98.67 0.08 98.63 0.13 98.71 0.12

3 98.66 0.06 98.56 0.13 98.77 0.09

5 98.63 0.08 98.52 0.13 98.75 0.12

7 98.53 0.10 98.43 0.15 98.63 0.16

9 98.47 0.10 98.43 0.13 98.52 0.13

11 98.38 0.09 98.34 0.15 98.43 0.13

13 98.31 0.11 98.25 0.17 98.38 0.15

15 98.26 0.12 98.21 0.20 98.31 0.15

17 98.18 0.12 98.14 0.19 98.23 0.16

19 98.14 0.14 98.10 0.20 98.18 0.16

From Table 5.2, the best three K value equals to 1, 3 and 5 for K-NN with

Cosine. With K value equals to 1, the K-NN achieved an average accuracy of 98.67%.

For the sensitivity, an average of 98.63% score is recorded and a score of 98.71%

of average specificity is recorded. When K is equal to 3, the average accuracy

achieved a score of 98.66%, 98.56% for average sensitivity and average specificity

of 98.77% making its performance slightly worse than that of K=1. When K=5, the

performance of K-NN with Cosine performed the third best with average accuracy of

98.63%, average sensitivity of 98.52% and 98.75% for average specificity.

Table 5.3 Result of K-NN with DTW on Dataset with No Categorical

Features

DTW - No Categorical Features



1 98.96 0.36 98.90 0.30 99.01 0.50

3 98.58 0.36 98.50 0.28 98.67 0.46

Table 5.3 shows the result obtained from the experiment which perform K-NN with

DTW. From table 5.3, the classification of KNN-DTW have the best performance,

achieving a score of 98.96% for accuracy, 98.90% for sensitivity and specificity of

67

99.01% when k=1. Meanwhile, when k=3, accuracy of 98.58% is recorded,

sensitivity of 98.50% and 98.67% for specificity.

Note that in this experiment only K = 1 and 3 is used. This is due to the time

constraint exist in performing the classification using Dynamic Time Warping for K-

Nearest Neighbour. Furthermore, previous experiments provide experience on the

trend of K, where K performs the best when is equal to one. Thus, the classification

of KNN–DTW stops at K=3. Following experiments which involved KNN-DTW are also

performed in similar fashion.

5.3 Experiment II: Simple Conversion

In Experiment II, the categorical data is being converted to numerical data using the

simple conversion method. K-Nearest Neighbour (K-NN) Classification with Euclidean

Distance, Cosine Similarity and Dynamic Time Warping were performed. The

experiments were repeated using different number of k where k = 1, 3, 5 [three k

values that have the best performance on previous section]. Table 5.4, 5.5 and 5.6

below shows the average accuracy, sensitivity and specificity and the corresponding

standard deviation when performing Euclidean Distance (ED), Cosine Similarity and

Dynamic Time Warping as the distance similarity measure with simple converted

categorical data present in the dataset.

Table 5.4 Result of K-NN with ED on Dataset with Simple Conversion on

Categorical Features

Euclidean Simple Categorical Conversion - Full Features

N Accuracy Sensitivity Specificity


1 97.97 0.21 97.99 0.13 97.94 0.13

3 97.87 0.17 97.90 0.16 97.84 0.16

5 97.78 0.22 97.90 0.19 97.66 0.19

68

Table 5.5 Result of K-NN with Cosine on Dataset with Simple Conversion

on Categorical Features

Cosine Simple Categorical Conversion - Full Features



1 97.98 0.11 98.00 0.22 97.95 0.15

3 97.87 0.12 97.90 0.20 97.85 0.16

5 97.79 0.16 97.88 0.24 97.69 0.19

Table 5.6 Result of K-NN with DTW on Dataset with Simple Conversion on


DTW - Simple Conversion Full Features



1 88.92 0.45 91.59 0.37 86.08 0.44

3 83.53 0.33 85.42 0.30 78.62 0.31

From the three tables above, we can say that both ED and Cosine Similarity have

about the same performance in terms of accuracy (97.97% & 97.98%), sensitivity

(97.99% & 98.00%) and specificity (97.94% & 97.95%). Whereas DTW does not

performe as ED and Cosine Similarity, achieving a total score of 88.92% for the

average accuracy, 91.59% for sensitivity and 86.08% for the specificity.

The experiments on KNN-DTW will be put to a stop due to i) the objective in Section

1.4 (b) has already achieved; ii) time constraint – time taken to run a KNN-DTW

experiments take too long.

69

5.4 Experiment III: Probability and Entropy Conversion

In Experiment III, classification of K-Nearest Neighbour on dataset which two

proposed categorical data conversion technique which is the probability conversion

and the entropy conversion technique is used (Table 5.7 – Table 5.10).

Table 5.7 Result of KNN-ED on Dataset with Probability Conversion on


KNN-ED Probability Conversion - Full Features



1 92.25 9.29 92.40 6.74 90.37 10.78

3 91.82 8.99 89.92 14.61 88.58 11.21

5 92.14 8.87 89.94 15.44 88.30 11.00

Table 5.8 Result of KNN-ED on Dataset with Entropy Conversion on


KNN-ED Entropy Conversion – Full Features



1 99.19 0.04 99.16 0.08 99.22 0.08

3 99.12 0.05 99.13 0.08 99.12 0.10

5 99.03 0.06 99.09 0.09 98.96 0.09

Table 5.9 Result of KNN-Cosine on Dataset with Probability Conversion

on Categorical Features

KNN-Cosine Probability Conversion - Full Features



1 92.28 9.43 92.49 6.63 90.26 11.10

3 91.94 9.10 90.20 14.42 88.55 11.38

5 92.12 9.05 90.19 15.06 88.21 11.21

70

Table 5.10 Result of KNN-Cosine on Dataset with Entropy Conversion on


KNN-Cosine Entropy Conversion – Full Features



1 99.21 0.05 99.18 0.09 99.24 0.09

3 99.12 0.04 99.12 0.09 99.12 0.09

5 99.02 0.04 99.07 0.09 98.97 0.08

By using the proposed entropy conversion technique, the performance of

classification using both the KNN-ED and KNN-Cosine has increased in all aspects.

Accuracy, sensitivity and specificity of using simple conversion technique mark a

score of around 97.9% where entropy conversion achieves a score of around about

99.3%. This is because by calculating the entropy of the value within a feature, the

ranking between them is conserved. Whereas, if using the simple conversion

technique where the distinct value is arranged alphabetically and further assigned to

a sequence of integer does not conserve the ranking within them. Conversion using

probability conversion technique shows a weaker performance of around 91%.

5.5 Experiment IV: Feature Selection using Information Gain and

Correlation Feature Selection

In this section, only the result of classification using K-NN with Euclidean Distance on

binaurally converted dataset is shown for it has the best performance amongst other

approach mentioned in the previous section.

71

Table 5.11 Result of KNN-ED on Dataset with Reduced Features using

Information Gain =70% Feature Selection and Entropy Conversion on


Ent-IG- 70%



1 99.16 0.04 99.11 0.09 99.21 0.08

3 99.08 0.05 99.06 0.11 99.10 0.09

5 98.99 0.06 99.02 0.11 98.95 0.09




Ent-IG- 60%



1 99.02 0.06 98.96 0.11 99.08 0.07

3 98.91 0.07 98.88 0.12 98.94 0.07

5 98.81 0.07 98.82 0.12 98.79 0.10




Ent-IG- 50%



1 98.94 0.05 98.87 0.07 99.00 0.07

3 98.84 0.08 98.81 0.12 98.87 0.11

5 98.72 0.09 98.75 0.14 98.69 0.12

72




Ent-IG- 40%



1 98.61 0.12 98.61 0.16 98.61 0.15

3 98.57 0.11 98.50 0.16 98.66 0.14

5 98.52 0.12 98.46 0.18 98.58 0.16




Ent-IG- 30%



1 98.21 0.10 98.22 0.16 98.19 0.09

3 98.21 0.10 98.26 0.15 98.16 0.12

5 98.19 0.11 98.32 0.17 98.06 0.16


Correlation Feature Selection and Entropy Conversion on Categorical

Features

Ent-CFS: 8 features



1 94.08 0.84 97.09 2.08 90.83 0.56

3 91.94 0.78 91.89 2.09 92.00 0.72

5 91.94 0.78 91.92 2.10 91.96 0.72

Using Information Gain Features Selection techniques, 5 sets of experiments were

derived with 70%, 60%, …, 30% of total features selected. The results are shown in

Table 5.11 until Table 5.15. The performance of KNN-ED is the best when features

selected are 70% of the total features. Whereas a total of 8 features is selected using

the Correlation Feature Selection technique (Table 5.16). The performance are as

follow, 94.08% for accuracy, 97.09% for sensitivity and specificity of 90.83%.

73

After experiment IV, the performance of feature selection method is evaluated, it is

clearly seen that the performance of the KNN-ED classifier degrade if feature

selection is performed on the data.

5.6 Comparison of Performance of Network Traffic Classifier with other

Machine Learning Approach

Panda et al., 2010 apply discriminative multinomial Naïve Bayes with various filtering

analysis to build a network intrusion detection system. Table 5.17 below shows the

results of the proposed algorithm.

Table 5.17 Results for Various Algorithms

Classifier Detection Accuracy (%) False Alarm Rate in %


94.84 4.4


81.47 12.85


96.5 3.0

Table 5.18 Result for Application of SOM-ANN Algorithms

Classifier Successful Detection Rate (%)

SOM 68.88

In the research paper by (Ibrahim et al., 2013), Self-Organization Map (SOM)

Artificial Neural Network (ANN) is performed on the intrusion database (KDD99 and

NSL-KDD). Table 5.18 above shows the result for the application of SOM-ANN

Algorithms.

Compared machine learning methods are Discriminative Multinomial Naïve

Bayes (DMNB), Self-Organization Map (SOM) Artificial Neural Network (ANN) and K-

Nearest Neighbour algorithm with Euclidean Distance as the distance similarity

measure (Table 5.19). Using the same data set, 1NN-ED Entropy Conversion has the

best performance ~99% as compared to DMNB + PCA (94.84%) and DMNB + N2B

74

(96.5%) in terms of accuracy. The other conversion techniques – probability

conversion however, have a weaker performance 92.24%, but still manage to

outperform DMNB+RP (81.47%). In terms of false alarm rate, 1NN-ED Entropy

Conversion also outperformed the rest machine learning approach having the lowest

value of 0.84%. As far as sensitivity is concerned, the 1NN-ED Entropy Conversion

also has the best performance (99.22%). Categorical data that are converted into

numeric value using the Entropy conversion method conserved the ranking within

them thus showing such good performance.

Table 5.19 Comparison of the Performance of Proposed Method and

Other Machine Learning

5.7 Chapter Summary

This chapter shows the tabulation of the result of each experiment. We can conclude

that using different conversion technique to convert categorical data yields different

results. Form the experiment, using full features with the Entropy conversion of the

dataset does provide the best performance of classifier. The performance of

Euclidean distance as a distance similarity measure for K-Nearest Neighbour

algorithm are the best in terms of accuracy, sensitivity and specificity and also the

time taken to classify a network traffic data.

Classifier Accuracy / Detection

Accuracy (%)

False Alarm Rate (%)

Sensitivity / Detection Rate

(%)


94.84 4.4


81.47 12.85


96.5 3.0

SOM-ANN 68.88

1NN-ED Entropy Conversion (Proposed)

99.19 0.84 99.22

1NN-ED Probability Conversion (Proposed)

92.24 7.6 90.37

CHAPTER 6

CONCLUSION


This chapter summaries the whole research paper by revising the problem

background, the research’ objectives. Discussion is also on the summary of the

literature review. An outline will also be given to the proposed methodology.

6.2 Summary of Research Paper

The evolving of new types of attacks and virus has urged the need for the emergence

of strong Intrusion Detection System (IDS). IDS that is anomaly-based detection

requires strong algorithm that is fast and cost saving in terms of computational

complexity. Thus, it arises the problem, “Is TSC feasible to classify activities of

network traffic in particular to IDS?” This research is carried out to answer the

question by identifying (i) What are the features that can be extracted from network

traffic to generate Point Series Data? And (ii.) How can the TSC technique best tuned

to detect intrusion on network data?

This research paper focus on the application of Time Series Classification

(TSC) technique to classify network activities so as to detect anomaly. K-Nearest

Neighbour Algorithm (k-NN) is performed to classify the network traffic activities and

subsequently, Dynamic Time Warping algorithm, one of the TSA technique is used

as the similarity measure of the k-NN algorithm.

In the Literature Review of Chapter 2, Introduction of IDS is discussed and

the application of DTW as one of the TSC techniques is also reviewed. Importance

76

of data pre-processing is discussed, including the conversion of symbolic features

and feature selection. The k-NN algorithm is also discussed here.

As for Chapter 3, the methodology that will be used for this research paper is

being discussed. The research program of work which outline the two phases, Phase

the identification of the best approach to solve the problem “How to represent

network traffic data into point series form” and Phase 2, comparison of the TSC

approach identified in Phase 1 with other approaches found in the literature review.

Three measurements which are the detection rate, false positive rate and also the

accuracy will be used to measure the performance of output.

In chapter 4, a detail explanation on how the experiment will be carried out

to achieve the objective of this research paper. Before starting the experiment, the

data must undergo pre-processing, conversion of data from nominal to numeric is

followed by data normalization. The normalized and converted data will then

undergone feature selection.

Chapter 5 discuss about the result obtained from the experiment. Discussion

on the performance of the result is also done.

Chapter 6 concludes everything that has been done in this research paper. A

summarize table on work done to achieve the objective is also drawn to provide an

easier understanding to the reader.

77

Table 6.1 Work Done to Achieve the Objectives

Objectives Work done

1 To investigate and identify feature transformation technique that can generate point series data for network activities classification

Three different techniques of categorical data features into numeric data is performed which is by using the simple conversion, probability conversion and entropy conversion. Refer Chapter 4.2.1, 5.3 and 5.4 for more details. Feature selection using Information Gain and Correlation Feature Selection Refer Chapter 4.2.3, 5.5 for more details.

2 To investigate the feasibility of Time Series Classification techniques by using k-NN coupled with DTW to classify network traffic activities

Experiment by using DTW as a distance similarity measure for KNN is carried out. Refer Chapter 4.2.1, 5.2 and 5.3 for more details.

3 To investigate the effects of using different similarities measurement, Euclidean Distance (ED) and Cosine similarity algorithm

Various set of experiments is carried out to determine the results of using different distance similarity measures. Refer Chapter 5.2, 5.3 and 5.4 for more details.

4 To compare the performance of network traffic classifier produced in (b) and (c) with other machine learning techniques, Self-Organization Map (SOM) Artificial Neural Network (ANN) by (Ibrahim, Basheer, and Mahmod, 2013) and Discriminative Multinomial Naive Bayes (NB) proposed by (Panda, Abraham, and Patra, 2010)

After all the experiments have been carried out, the result is tabulated and take to compare with other machine learning methods in the literature review. Refer Chapter 5.6 for more details.

6.3 Future Works

In this research paper three experiments were performed on the Dynamic Time

Warping distance similarity measure. Due to time and computational constraints,

DTW was not further explored. Future works might include the tuning on the

algorithm of DTW to enable fast computation. Besides, the proposed probability

conversion can be further explored to convert categorical data.

REFERENCES

Amr, T. (2012). Survey on Time-Series Data Classification, 1–10.

Anderson, J. P. (1980). Computer Security Threat Monitoring and Surveillance.

Apple confirms accounts compromised but denies security breach. (2014, September 2). BBC. Retrieved from http://www.bbc.com/news/technology-29011850

Bouzida, Y., & Cuppens, F. (2004). Efficient intrusion detection using principal component analysis. Proceedings of the. Retrieved from http://yacine.bouzida.free.fr/Articles/2004SAR.pdf

Brockwell, P. J., & Davis, R. A. (2002). Introduction to Time Series and Forecasting , Second Edition Springer Texts in Statistics.

Chaovalitwongse, W. A., Fan, Y., & Sachdeo, R. C. (2007). On the Time Series <formula formulatype=“inline”><tex>$K$</tex></formula>-Nearest Neighbor Classification of Abnormal Brain Activity. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 37(6), 1005–1016. doi:10.1109/TSMCA.2007.897589

Chowdhary, M., Suri, S., & Bhutani, M. (2014). Comparative Study of Intrusion Detection System. International Journal of Computer Sciences and Engineering, 2(4), 197–200.

Datti, R., & Verma, B. (2010). Feature Reduction for Intrusion Detection Using Linear Discriminant Analysis. International Journal on Computer Science and Engineering (IJCSE), 02(04), 1072–1078.

Davis, J. J., & Clark, A. J. (2011). Data preprocessing for anomaly based network intrusion detection: A review. Computers & Security, 30(6-7), 353–375. doi:10.1016/j.cose.2011.05.008

Deepa, a. J., & Kavitha, V. (2012). A Comprehensive Survey on Approaches to Intrusion Detection System. Procedia Engineering, 38, 2063–2069. doi:10.1016/j.proeng.2012.06.248

Elsayed, A., Hijazi, M. H. A., Coenen, F., Garc´ıa-Fi˜Nana˜, M., Sluming, V., & Zheng, Y. (2011). Time Series Case Based Reasoning for Image Categorisation. In Case-Based Peasoning Research and Development (pp. 423–436). doi:10.1007/978-3-642-23291-6_31

García-Teodoro, P., Díaz-Verdejo, J., Maciá-Fernández, G., & Vázquez, E. (2009). Anomaly-based network intrusion detection: Techniques, systems and challenges. Computers & Security, 28(1-2), 18–28. doi:10.1016/j.cose.2008.08.003

79

Gillian, N., Knapp, R. B., & Modhrain, S. O. (2011). Recognition Of Multivariate Temporal Musical Gestures Using N-Dimensional Dynamic Time Warping, (June), 337–342.

He, W., Hu, G., Yao, X., Gangyuan, K., Wang, H., & Hongmei, X. (2008). Applying multiple time series data mining to large-scale network traffic analysis. 2008 IEEE Conference on Cybernetics and Intelligent Systems, 394–399. doi:10.1109/ICCIS.2008.4670844

Hernández-Pereira, E., Suárez-Romero, J. a., Fontenla-Romero, O., & Alonso-Betanzos, a. (2009). Conversion methods for symbolic features: A comparison applied to an intrusion detection problem. Expert Systems with Applications, 36(7), 10612–10617. doi:10.1016/j.eswa.2009.02.054

Ibrahim, L. M., Basheer, D. T., & Mahmod, M. S. (2013). A Comparison Study for Intrusion Database (KDD99, NSL-KDD) Based on Self Organization Map(SOM) Artificial Neural Network. Journal of Engineering Sience and Technology, 8(1), 107–119.

Karagiannis, T., Papagiannaki, K., & Faloutsos, M. (2005). BLINC : Multilevel Traffic Classification in the Dark. In ACM SIGCOMM Conference 2005 (pp. 229–240). ACM.

Kia, A., SamanHaratizadeh, & HadiZare. (2013). Prediction of USD / JPY Exchange Rate Time Series Directional Status by KNN with Dynamic Time. Bonfring International Journal of Data Mining, 3(2), 12–16. doi:10.9756/BIJDM.4658

Kumar, S. (2007). Survey of Current Network Intrusion Detection Techniques. Citeseer, 1–18. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.129.7105&rep=rep1&type=pdf\npapers2://publication/uuid/42BBD57C-EACC-4349-AC44-F69CDF10E018

Li, H., Chen, C. L. P., & Huang, H.-P. (2000). Fuzzy Neural Intelligent Systems: Mathematical Foundation and the Applications in Engineering. Taylor & Francis. Retrieved from https://books.google.com/books?id=IzvqngEACAAJ&pgis=1

More, S., Matthews, M., Joshi, A., & Finin, T. (2012). A knowledge-based approach to intrusion detection modeling. Proceedings - IEEE CS Security and Privacy Workshops, SPW 2012, 75–81. doi:10.1109/SPW.2012.26

Muscillo, R., Schmid, M., Conforto, S., & D’Alessio, T. (2011). Early recognition of upper limb motor tasks through accelerometers: real-time implementation of a DTW-based algorithm. Computers in Biology and Medicine, 41(3), 164–72. doi:10.1016/j.compbiomed.2011.01.007

Panda, M., Abraham, A., & Patra, M. (2010). Discriminative multinomial naive bayes for network intrusion detection. In 2010 Sixth International Conference on

80

Information Assurance and Security (IAS) (pp. 5–10). Retrieved from http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5604193

Robbins, R. (2002). Distributed Intrusion Detection Systems: An Introduction and Review.

Roobaert, D., Karakoulas, G., & Chawla, N. V. (2006). Information gain, correlation and support vector machines. Retrieved June 21, 2015, from http://www.springerlink.com/index/KJ45153333192803.pdf

Sabahi, F., & Movaghar, a. (2008). Intrusion Detection: A Survey. 2008 Third International Conference on Systems and Networks Communications, 23–26. doi:10.1109/ICSNC.2008.44

Shyu, M.-L., Chen, S.-C., Sarinnapakorn, K., & Chang, L. (2003). A Novel Anomaly Detection Scheme Based on Principal Component Classifier. Retrieved from http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA465712\nhttp://www.dtic.mil/cgi-bin/GetTRDoc?Location=U2&doc=GetTRDoc.pdf&AD=ADA465712

Stolfo, S. J., Fan, W., Lee, W., Prodromidis, A., Street, W., York, N., & Chan, P. K. (1999). Cost-based Modeling and Evaluation for Data Mining With Application to Fraud and Intrusion Detection : Results from the JAM Project ∗.

Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. a. (2009). A detailed analysis of the KDD CUP 99 data set. 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, (Cisda), 1–6. doi:10.1109/CISDA.2009.5356528

Wang, W., & Battiti, R. (2006). Identifying intrusions in computer networks with principal component analysis. Proceedings - First International Conference on Availability, Reliability and Security, ARES 2006, 2006, 270–277. doi:10.1109/ARES.2006.73

Weller-Fahy, D., Borghetti, B., & Sodemann, A. (2014). A Survey of Distance and Similarity Measures used within Network Intrusion Anomaly Detection. IEEE Communications Surveys & Tutorials, PP(99), 1–1. doi:10.1109/COMST.2014.2336610

Wu, X., Kumar, V., Quinlan, J. R., Ghosh, J., Yang, Q., Motoda, H., … Dan, J. H. (2008). Top 10 algorithms in data mining. doi:10.1007/s10115-007-0114-2

Xu, X. (2006). Adaptive intrusion detection based on machine learning: Feature extraction, classifier construction and sequential pattern prediction. International Journal of Web Services Practices, 2(1), 49–58. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.107.9575&rep=rep1&type=pdf

effects of feature transformation and … · prestasi klasifikasi itu adalah yang terbaik apabila...

Documents