analyzing radial basis function neural networks for

IN DEGREE PROJECT INFORMATION AND COMMUNICATION TECHNOLOGY,SECOND CYCLE, 30 CREDITS

, STOCKHOLM SWEDEN 2019

Analyzing Radial Basis Function Neural Networks for predicting anomalies in Intrusion Detection Systems

SAI SHYAMSUNDER KAMAT

KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Analyzing Radial BasisFunction Neural Networks forpredicting anomalies inIntrusion Detection Systems

SAI SHYAMSUNDER KAMAT

EIT Digital Masters in Embedded SystemsDate: March 30, 2019Supervisor: Yu Yang, Saeid RashidiExaminer: Dr. Ahmed HemaniSwedish title: Utvärdera prestanda av radiella basfunktionsnätverkför intrångsdetekteringssystemSchool of Electrical Engineering and Computer Science

v

Abstract

In the 21st century, information is the new currency. With omnipresence of de-vices connected to the internet, humanity can instantly avail any information.However, there are certain cybercrime groups that use illegitimate methods tosteal information for personal benefits. An Intrusion Detection System (IDS)monitors a network for suspicious activities and alerts its owner about an unde-sired intrusion. These commercial IDS’es react after detecting intrusion attempts.With the cyber attacks becoming increasingly complex, it is expensive to waitfor the attacks to happen and respond later. It is crucial for network ownersto employ IDS’es that preemptively differentiate a harmless data request from amalicious one. Machine Learning (ML) can solve this problem by recognizingpatterns in internet traffic to predict the behaviour of network users. This projectstudies how effectively Radial Basis Function Neural Network (RBFN) with DeepLearning Architecture can impact intrusion detection. On the basis of the existingframework, it asks how well can an RBFN predict malicious intrusive attempts,especially when compared to contemporary detection practices.

Here, an RBFN is a multi-layered neural network model that uses a radialbasis function to transform input traffic data. Once transformed, it is possibleto separate the various traffic data points using a single straight line in extra-dimensional space. The outcome of the project indicates that the proposed methodis severely affected by limitations. E.g. the model needs to be fine tuned over sev-eral trials to achieve a desired accuracy. The results of the implementation showthat RBFN is accurate at predicting various cyber attacks such as web attacks,infiltrations, brute force, SSH etc, and normal internet behaviour on an average80% of the time. Other algorithms in identical testbed are more than 90% ac-curate. Despite the lower accuracy, RBFN model is more than 94% accurate atrecording specific kinds of attacks such as Port Scans and BotNet malware. Onepossible solution is to restrict this model to predict only malware attacks and usedifferent machine learning algorithm for other attacks.

Keywords: anomaly, cyber security, evaluation, machine learning, radial basisfunction, random forest classifier, supervised learning

vii

Sammanfattning

I det 21: a århundradet är information den nya valutan. Med allnärvaro av en-heter anslutna till internet har mänskligheten tillgång till information inom ettögonblick. Det finns dock vissa grupper som använder metoder för att stjäla in-formation för personlig vinst via internet. Ett intrångsdetekteringssystem (IDS)övervakar ett nätverk för misstänkta aktiviteter och varnar dess ägare om ettoönskat intrång skett. Kommersiella IDS reagerar efter detekteringen av ett in-trångsförsök. Angreppen blir alltmer komplexa och det kan vara dyrt att väntapå att attackerna ska ske för att reagera senare. Det är avgörande för nätverk-sägare att använda IDS:er som på ett förebyggande sätt kan skilja på oskadligdataanvändning från skadlig. Maskininlärning kan lösa detta problem. Den kananalysera all befintliga data om internettrafik, känna igen mönster och förutseanvändarnas beteende. Detta projekt syftar till att studera hur effektivt RadialBasis Function Neural Networks (RBFN) med Djupinlärnings arkitektur kan på-verka intrångsdetektering. Från detta perspektiv ställs frågan hur väl en RBFNkan förutsäga skadliga intrångsförsök, särskilt i jämförelse med befintliga detek-tionsmetoder.

Här är RBFN definierad som en flera-lagers neuralt nätverksmodell som an-vänder en radiell grundfunktion för att omvandla data till linjärt separerbar. Efteren undersökning av modern litteratur och lokalisering av ett namngivet datasetanvändes kvantitativ forskningsmetodik med prestanda indikatorer för att utvär-dera RBFN: s prestanda. En Random Forest Classifier algorithm användes ocksåför jämförelse. Resultaten erhölls efter en serie finjusteringar av parametrar påmodellerna. Resultaten visar att RBFN är korrekt när den förutsäger avvikan-de internetbeteende i genomsnitt 80% av tiden. Andra algoritmer i litteraturenbeskrivs som mer än 90% korrekta. Den föreslagna RBFN-modellen är emeller-tid mycket exakt när man registrerar specifika typer av attacker som Port Scansoch BotNet malware. Resultatet av projektet visar att den föreslagna metoden ärallvarligt påverkad av begränsningar. T.ex. så behöver modellen finjusteras överflera försök för att uppnå önskad noggrannhet. En möjlig lösning är att begrän-sa denna modell till att endast förutsäga malware-attacker och använda andramaskininlärnings-algoritmer för andra attacker.

Nyckelord: anomali, cybersäkerhet, utvärdering, maskininlärning, radialbaseradfunktion, slumpmässig skogsklassificering, övervakad inlärning

Contents

1 Introduction 11.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Ethics and Sustainability . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.6 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 72.1 Network Security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Intrusion Detection Systems . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Signature Based Monitoring . . . . . . . . . . . . . . . . . . . 92.2.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Attack Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.3.1 Types of Attacks . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 Commercial IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.1 Cognito Detect . . . . . . . . . . . . . . . . . . . . . . . . . . 142.4.2 Antigena . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.1 CICIDS2017 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5.2 Previous Work on the dataset . . . . . . . . . . . . . . . . . . 15

2.6 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182.6.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.8 Single Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 262.9 Multi-Layer Perceptron Neural Network . . . . . . . . . . . . . . . 28

2.9.1 Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.9.2 Activation Function . . . . . . . . . . . . . . . . . . . . . . . 30

ix

x CONTENTS

2.9.3 Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.10 Radial Basis Function Neural Network . . . . . . . . . . . . . . . . . 33

2.10.1 Definition of Radial Basis Function . . . . . . . . . . . . . . . 342.10.2 Types of Radial Basis Functions . . . . . . . . . . . . . . . . . 352.10.3 Working with an Example . . . . . . . . . . . . . . . . . . . . 36

2.11 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.11.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.11.2 Training at the First Hidden Layer . . . . . . . . . . . . . . . 382.11.3 Training at Remaining Layers . . . . . . . . . . . . . . . . . . 40

3 Approach 433.1 Selection Procedure of Algorithm for IDS . . . . . . . . . . . . . . . 43

3.1.1 CM1K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.1.2 Setting up the Environment . . . . . . . . . . . . . . . . . . . 44

3.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . 453.2.2 Data Vectorization . . . . . . . . . . . . . . . . . . . . . . . . 463.2.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.2.4 Splitting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.1 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.2 K-Means Clustering . . . . . . . . . . . . . . . . . . . . . . . 493.3.3 Data Transformation using RBF . . . . . . . . . . . . . . . . . 503.3.4 Generate a Keras Model . . . . . . . . . . . . . . . . . . . . . 503.3.5 Fit the model . . . . . . . . . . . . . . . . . . . . . . . . . . . 523.3.6 Plot the Accuracy and Validation Loss Curves . . . . . . . . 523.3.7 Make Predictions . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4 Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . 53

4 Results and Analyses 554.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.1.1 Analysis of Model Performance during Optimization . . . . 554.1.2 Model Performance after Gradient Descent . . . . . . . . . . 564.1.3 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . 57

4.2 Result Comparison and Validation . . . . . . . . . . . . . . . . . . . 584.2.1 Comparative Analysis with Random Forest Algorithm . . . 604.2.2 Comparative Analysis with other Algorithms in

Literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Conclusion and Future Work 67

CONTENTS xi

Bibliography 69

xii CONTENTS

List of Figures

2.1 Internet and Local Network Architecture . . . . . . . . . . . . . . . 82.2 Network Architecture with IDS . . . . . . . . . . . . . . . . . . . . . 92.3 Types of IDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Supervised Learning Flowchart . . . . . . . . . . . . . . . . . . . . . 192.5 Working of SMOTE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.6 Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.7 Single Layer Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . 272.8 Linear Separability of Data Points . . . . . . . . . . . . . . . . . . . . 292.9 Learning Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.10 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.11 Activation Function without Bias Value . . . . . . . . . . . . . . . . 322.12 Activation Function with Bias Value . . . . . . . . . . . . . . . . . . 322.13 Non Linear Transformation . . . . . . . . . . . . . . . . . . . . . . . 332.14 Effect of extra dimension on Separability . . . . . . . . . . . . . . . 342.15 Radial Basis Function distance from Receptor t . . . . . . . . . . . . 352.16 Radial Basis Function Plots . . . . . . . . . . . . . . . . . . . . . . . 362.17 Transformed Feature Space . . . . . . . . . . . . . . . . . . . . . . . 382.18 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 392.19 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402.20 Training Process in Neural Network . . . . . . . . . . . . . . . . . . 41

3.1 Block Diagram of Preprocessing Tasks . . . . . . . . . . . . . . . . . 463.2 Block Diagram of Modeling Tasks . . . . . . . . . . . . . . . . . . . 493.3 Radial Basis Function Feed Forward Neural Network . . . . . . . . 51

4.1 Trials behind Hyperparameter Optimization . . . . . . . . . . . . . 564.2 Training vs Validation Performance post Gradient Descent . . . . . 574.3 Confusion Matrix for RBFN IDS with Accuracy per Class Distribu-

tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.4 Performance Metrics of RBFN in IDS . . . . . . . . . . . . . . . . . . 59

xiii

xiv LIST OF FIGURES

4.5 Precision Score Comparison of RF vs RBFN . . . . . . . . . . . . . . 614.6 Recall Score Comparison of RBFN in IDS . . . . . . . . . . . . . . . 624.7 F1 Score Comparison of RBFN in IDS . . . . . . . . . . . . . . . . . 634.8 Evaluation Metrics Score comparison of RBFN vs Other Surveyed

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654.8 Evaluation Metrics Score comparison of RBFN vs Other Surveyed

Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

List of Tables

2.1 Nework Attacks and Description . . . . . . . . . . . . . . . . . . . . 162.2 Network Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242.4 Plot of Feature Space post Transformation . . . . . . . . . . . . . . . 37

3.1 Computing Resource Features . . . . . . . . . . . . . . . . . . . . . . 453.2 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

xv

xvi LIST OF TABLES

Acronyms

AE Auto EncoderAI Artificial IntelligenceAMF Attention MultiflowAPI Application Programming InterfaceASIC Application Specific Integrated CircuitAUC Area Under the CurveAWID Aegean Wi-Fi Intrusion DatasetCICIDS2017 Canadian Institute for Cybersecurity Intrusion Detection

Evaluation Dataset 2017CSV Comma Separated ValuesDDoS Distributed Denial of ServiceDNN Deep Neural NetworkDNS Domain Name ServerDoS Denial of ServiceFN False NegativeFP False PositiveFTP File Transfer ProtocolGPS Global Positioning SystemIDS Intrusion Detection SystemIP Internet ProtocolISP Internet Service ProviderKNN K-Nearest NeighbourLSTM Long Short Term MemoryML Machine LearningNATO North Atlantic Treaty OrganizationNTA Network Traffic AnalysisPCA Principal Component AnalysisPCAP Packet CapturePPV Positive Predictive Value

xvii

xviii LIST OF TABLES

RBF Radial Basis FunctionRBFN Radial Basis Function Neural NetworkRF Random ForestROC Receiver Operating CharacteristicSLP Single Layer PerceptronSMOTE Sythentic Minority Oversampling TechniqueSSH Secure ShellSVM Support Vector MachineTN True NegativeTP True PositiveU2R User to RootXSS Cross Site Scripting

Chapter 1

Introduction

There is a high demand for usage of the internet is growing rapidly, supple-mented by an increase of threats on the network. From a small scale enterpriseto state systems, are equally at risk of a cyber attack. E.g. A variant of Wan-naCry ransomware infected 70 windows systems of a Swedish local authority[1].Or more recently, Russia allegedly disrupted Global Positioning System (GPS)signals in a North Atlantic Treaty Organization (NATO) exercise in Norway[2].

A Symantec report states that in 2016 itself, the number of malware rose by36% from the previous year, to about 430 million[3]. Europe is the second largestconsumer of network intrusion detection systems[4]. With the market expected togrow to 7,1 billion EUR by 2023[5], it is only prudent that industries invest heavilyin advancement of this technology, to stay a step ahead of the aforementionedattacks.

An IDS, is a software application that monitors an enterprise network of com-puters, connected over the internet or the intranet, to detect any malicious activi-ties and violation of policies.

1.1 Purpose

This thesis was conceptualized as an interdisciplinary application of two majorcomputer science domains, embedded computer architecture and machine learn-ing.

Lastly, it also demonstrates his capability to work independently as a Masterof Science in ICT Innovation.

1

2 CHAPTER 1. INTRODUCTION

1.1.1 Motivation

The risks caused by cyber attacks around information breaches and theft suchas IP, trade secrets have skyrocketed. A 2018 BDO report states cyber attackshave risen by 350% in ransomware, 250% in spoofing and comprise of businessemails[6]. These attacks have also grown immensely complex by integrating withvarious attack sources. E.g. many attacks are a combined effort ranging from alone wolf hacker to small hacktivist groups to nation-state sponsored cyber attackgroups.

Therefore cybersecurity with focus on threat based analytics is at the fore-frontof many enterprise organizations. These organizations no longer focus on mostcritical data assets About 73% of board directors have preferred to rely on thirdparty vendors for cyber based risk assessment[7].

The third party vendors meanwhile are interested in using advancements inneuromorphic computing as a USP to their product. As of 2019, the neural net-works hold a lot of promise due to the availability of large amounts of data accu-mulated over the years, development of more precise predictive algorithms andcomputational power[8].

Therefore there is a keen interest of application of neural networks in the in-dustry. We investigate a specific instance of using a RBFN to predict anomalies inan IDS.

1.2 Research Question

This paper explores one of the many possible ways where a RBFN as a deep ma-chine learning algorithm can contribute to the network security domain. It hasbeen found that such algorithms can be used to solve known problems using fa-miliar datasets.

Problem Statement

1. Is RBFN applicable to detecting anomalies in IDS? Consequently, what isthe effect of using RBFN for detecting intrusion on an enterprise network?

2. How does it perform in detecting anomalies when compared with the in-dustry standard models?

CHAPTER 1. INTRODUCTION 3

1.3 Methodology

Quantitative research methods have been applied to for anomaly detection onthe dataset used in this project. The dataset contains logs of network traffic dataextracted from Packet Capture (PCAP) files. It contains information such as In-ternet Protocol (IP) addresses, duration of flow etc. More on this is discussed insection 2.5.

Sufficient research has been put behind exploring possible alternatives to thedataset, and the machine learning model selection. Additionally, these modelswere fine tuned further to provide maximum performance, w.r.t to training andvalidation errors.

And lastly, the comparison of this model with industry standard algorithmshas been carried out in isolation, by using identical dataset and input parameters.This provides us an unbiased report on efficacy of this model.

1.4 Ethics and Sustainability

Anomaly detection is inherently an issue of sustainability. Anomalies are de-signed to undermine existing systems, and make them unsustainable.

Automatic tools such as machine learning to predict anomalies even beforethey strike a system, can mitigate such an occurrence. They also do not threatento substitute any human capital in this process, since no amount of humans canpractically monitor traffic data in real time.

However such systems need skilled personnel who are network administra-tors, statisticians to supervise from time to time.

It is essential that the ethics of using private information be held in high re-gard. The data analyzed in the model is a real-world data. Since this project isused for monitoring purposes, it is inevitable that it will use the data that theusers may not be comfortable in providing otherwise. E.g. IP addresses of traffic.Hence under condition of anonymity, the information about the identity of theusers has been kept private.

1.5 Delimitations

Delimitations are characteristics that limit the scope and define the boundaries ofthis study.

The research is restricted only to the machine learning model algorithm de-velopment and its evaluation. More specifically, the possible machine learning


model techniques that could be applied have been restricted to supervised learn-ing approaches. Since this project required that the model be later run on a hard-ware that could run RBFN.

A major limitation of this project is the set of constraints set by the selection ofthe Canadian Institute for Cybersecurity Intrusion Detection Evaluation Dataset2017 (CICIDS2017)[9] dataset. Some of these are rather limited information aboutperformance of predictive models on the dataset, as explained in section 2.5. Thedevelopers behind this dataset never published the ground truths. If they did,it would have improved the made the labeling process for every packet, morereliable. Furthermore, Koroniotis et al. state that profiling this data is difficult dueto its inherent complexity and vastness[10]. To summarize, due to lack of definitestatistical characteristics, a 100% perfect dataset may never be realized[11][12].

Second delimitation is that due to the ever changing nature of cyber attacks,it is possible that some of the most recent studies may not reflect the findings ofthis study. E.g. Point of Sale Malware were the most prevalent cyber attacks in2018[13], but these are not captured in the dataset used here. Lastly, Keras hasbeen used as the machine learning modeling framework because it is one of thewider used frameworks by peers[14].

1.6 Thesis Structure

Thesis Structure : This thesis is organized in the following:

• Chapter 1 - Introduction:

Gives an overview of the problem in network based intrusion detection, itsimpact, and this project solves the problems.

• Chapter 2 -Background:

Provides the prerequisites for understanding the topic, such as aspects ofnetwork security, machine learning etc. It is supplemented by a literaturereview on IDS and existing ML techniques on an active network. It alsodetails the state-of-the-art in intrusion detection systems.

• Chapter 3 - Approach:

This section illustrates the steps taken to answer the problem at hand. Itprovides planning steps taken with description. It also adds some detailsabout the used dataset, the test setup and the training process.

• Chapter 4 - Result and Analysis:

CHAPTER 1. INTRODUCTION 5

It reports the results and validation of these results. The results are majorlycomprised of performance metrics of the machine learning model. Thesemetrics are accuracy,Precision, recall and F1 score.

• Chapter 5 - Conclusion and Future Work:

Presents the conclusions drawn from the work and observation of this project.Future scope is also Included.

Chapter 2

Background

This chapter contextualizes the concepts that are needed to understand the ap-proach and the results discussed later on. It presents a detailed information onterminologies and technical details such as ML, RBF and its evaluation method-ologies among other details. This chapter serves well for a first-time reader aswell as for a person familiar with machine learning, but would like to refurbishthe concepts. Additionally, it also details the contemporary research and dataused in the IDS field, especially involving ML.

2.1 Network Security

As of 2018, 3,9 billion people are connected to the internet[15]. 188 million emailswere sent per minute in March 2019 alone. In the same duration, Google wasqueried 3,8 million times and 996k USD were spent in various transactions[16].For such a sheer omnipresence of devices connected to the internet, it has beenof paramount necessity that the network they are connected to, be secure. Such anetwork is demonstrated in figure 2.1.

A host of devices of a local area network connected via a server of an In-ternet Service Provider (ISP). Whenever a user connects to a website such aswww.google.com, his/her machine connects to an ISP such as Vodafone orTelia. This ISP then connects with a Domain Name Server (DNS) which fetches alist of IP addresses corresponding to the website. E.g. Google has DNS 8.8.8.8and 8.8.4.4[17]. This DNS then redirects the ISP to the server holding thatwebsite[18].

A traditional internet architecture also employs a host of security measuresthat are necessary to protect information from data theft, intrusion and misuse.Some of the commonly used network security measures are[19]:-

7

8 CHAPTER 2. BACKGROUND

Figure 2.1: Internet and Local Network Architecture

1. Password Authentication: This is the first line of defence against intrusions.Using encryption, access to a particular resource can be restricted. Howeverwith the advent of intrusion techniques such as brute force and phishing,this authentication technique is not foolproof.

2. Firewall: A firewall is a software subsystem that enforces a set of rules thatrestrict or provide access to certain resources depending upon who has ac-cess. E.g. a company restricting its employees from using social mediawebsites during work hours.

3. Intrusion Detection Systems: This is an umbrella term of a security systemcomprising of users, traffic monitoring software and anti-virus software.More on this is covered in the following section.

2.2 Intrusion Detection Systems

Data breach technology in local data networks is mainly based on software thattests incoming data in a von Neumann type "sandbox". This means that it takesrelatively long time for available (commercial) programs to detect intrusion at-

CHAPTER 2. BACKGROUND 9

tempts. The author implements RBFN [20] as Deep Neural Network (DNN), topredict anomalies in network intrusion attempts.

Figure 2.2: Network Architecture with IDS

An IDS is a strategically deployed network software system that regularlymonitors network traffic, and alerts the user or the network administrator of anyanomalies in the traffic[21]. Figure 2.2 depicts the installation of an IDS on theserver side, so that it can detect intrusions from the internet and also within theintranet.

The figure 2.3, which illustrates types of IDS has been adapted from [22].

2.2.1 Signature Based Monitoring

A commonly used method of monitoring networks is signature based. In thismethod, whenever an attack is attempted, it is cross-compared and analyzed withan existing database of known attacks. E.g. an IDS monitoring the web browsercan be conditioned to detect attacks from scripts containing the format phf, as anindicator for a Common Gateway Interface (CGI) attack.

2.2.2 Anomaly Detection

Another more recent monitoring technique is anomaly detection. The softwarehere can differentiate between a traffic behaviour that is normal from one thatis anomalous. For instance, a sudden increase in volume of web traffic or IP


packet density can be regarded as anomalous traffic behaviour stemming froman external Distributed Denial of Service (DDoS) attack.

Figure 2.3: Types of IDS

Once it detects an anomaly, an IDS alerts the system administrator shown infigure 2.2 of behaviour and also present the type of the behaviour based on itsattack profile.

2.3 Attack Profiles

All intrusive attacks can be categorized into following four classes[23]:-

1. Denial of Service (DoS): The hacker makes the end point computer’s mem-ory or resources too busy or too full, to serve requests, thereby denying anyservice to the users. Hence the name Denial of Service.

2. Remote to User: A user sends packets to a machine over the internet. Theuser doesn’t have access to this machine. When the machine receives suchpackets, it exposes itself and lets the hacker exploit the privileges of thevictim.


3. User to Root (U2R): A hacker with guest privileges attempts to gain theroot or super-user privileges by exposing the machine’s vulnerabilities. E.g.xterm and perm

4. Probing: A hacker scans or probes a victim user’s machine for weaknesses,which may later compromise the system. E.g. saint, portsweep, mscan,nmap.

2.3.1 Types of Attacks

According to a December 2018 study by McAfee[24], DDoS, Port Scan, SecureShell attacks, Infiltration and Web Based attacks are among the top network at-tacks in the Q3 of 2018.

DDoS

Routers are vital to network communications, since it acts as a junction pointfor all traffic packets. Without a router, it is very difficult to communicate twonetwork sub-systems.

Router based attacks such as Brute Force attacks and DDoS take advantage ofthe router communication protocols[25], weak authentication services and obso-lescence of router firmware.

A DDoS attack does not exploit the vulnerabilities of a router or a network.Instead, the hackers flood the network with traffic, and exhaust its resources[26].It is challenging to ascertain whether a traffic flow is legitimate or not, sincenone of the protocols or norms are flouted in this attack. It is easy to confusea DDoS attack for simply a large number of users on a particular sub system[27].Hence there are very high incidences of false positive alarms in detecting thisattack[10][28].

Additionally, [24] also notes the prevalence of DDoS attack methods as themost commonly discussed topic among cybercriminals.

Infiltration

Web Browsers such as Mozilla Firefox, Google Chrome etc. are not only used formere search. They also come with loads of extension (web-based) software thatenhance productivity and convenience for a layperson user. E.g. Browsers storepasswords and other personal data such as addresses, phone numbers, email IDsetc. in the form of cookies, which can later be retrieved to fill auto-fill forms.


There are JavaScripts that enforce conditional formatting of such data, critical forautomating many data entry use cases.

Cybercriminals exploit such conditional caching procedures, by masking ma-licious payloads in popular file formats such as PDF and Flash. JavaScripts pro-vides a broad support to such formats, and make them portable across variousplatforms. This increases potency of a malicious script to be reusable across vari-ous kinds of web application architectures[29].

Brute Force

The hacker attempts to gain access to a website and infect it with malicious scriptsinto its code. Successively repeated attempts are made to guess various securityauthentication combinations, to break and gain access to any data from a web site.Simply speaking it is an attempt to guess the password by trying various possiblecombinations. When the attack attempt gets the password right, it automaticallygains an authorized access to the site. So he or she can steal data without leavinga trace[30].

BotNet

A botnet is a network of computers which have been compromised by hackersand are being used remotely. A malware code from the botmaster is used to hackthis network. The hackers use this BotNet to send DDoS attacks and attemptphishing from data servers. This attack is dangerous because a bot(maliciouscomputer of the BotNet) can also infect other computers and incorporate them todo the bidding[31].

PortScan

Port Scan is an intrusion attempt where hackers scan the ports of the comput-ers connected to the internet by searching or scanning for ’open back doors’ toaccess them. It takes advantage of vulnerabilities of communication protocolswhere hacker sends a request to the port and waits for an acknowledgment. Thereceived response from the port can indicate the make of the operating system orany other information on the potential victim network[32].

Cross Site Scripting (XSS)

XSS is a JavaScript code written and injected into a victim’s web browser to gainaccess to sensitive data like cookies, password etc. When the script is injected into


a website, it appears to be another innocuous part of the website, and piggybackson the security certificates of the host website[33].

SQL Injection

SQL Injection is one of the most commonly used code-injection intrusion attemptswhere the vulnerabilities of internet protocols are used to undermine the victim’scomputer. Whenever the victim’s input is not validated sufficiently, the data isprovided by a victim in the form of an SQL Query, such that a part of the user’sinput is treated as an SQL code[34].

Table 2.1 later illustrates the aforementioned attacks and their cumulative fre-quency of occurrence in the dataset used in this project.

2.4 Commercial IDS

Commercial IDS’es fall under the purvey of Network Traffic Analysis (NTA) mar-ket. This potential 7,1 billion EUR market[5] is only expected to grow with vari-ous MNC’s and startup companies foraying into this domain.

A 2019 Gartner report[35] classifies the market into three market segments asfollows:-

1. Pure-Play Companies: Comprising of startups, these companies specializein only predicting the anomalies. However there are also some startupswho are enhancing their abilities to counter. E.g Vectra AI, DarkTrace Ltd.

2. Network Centric Companies: These are the next stage of pure play compa-nies. These constitute companies that monitor network performance and di-agnostics. Using statistical analysis and machine learning techniques, thesecompanies have also come up with solutions to counter the threats. E.gMcAfee.

3. Others: These constitute companies that do not fit either of the aforemen-tioned categories. These are large network security providers, e.g. Cisco,which began as network sand-boxing companies, but gradually diversifiedinto network security and diagnostics.

We cover a couple of Commercial IDS instances here.


2.4.1 Cognito Detect

Produced by Vectra Artificial Intelligence (AI), a San Jose-based[35] company, itsflagship Cognito Detect uses a combination of several supervised and unsuper-vised machine learning algorithms to predict anomalous attacks. For superviseddata, Cognito uses the Random Forest (RF) algorithm[36].

A comprehensive comparison of this algorithm with algorithm proposed bythe author, is covered in section 4.2.

2.4.2 Antigena

This IDS is built by Cambridge UK based company Darktrace. It has been knownto have used over 50 unsupervised machine learning techniques[35] in its sys-tem. And a white paper[37] by the group corroborated this by particularly nam-ing Clustering[38], combined with Recursive Bayesian Estimation[39] as its go toalgorithm.

2.5 Dataset

Scientific research heavily relies on availability and validity of relevant data [40].This project uses the CICIDS2017 dataset to connect network data with machinelearning prediction models.

2.5.1 CICIDS2017

The CICIDS2017 is a publicly available dataset, created by the University of NewBrunswick, Canada. It contains benign and seven most common network attackflows, which simulate the real world[9].

It is available in two formats:-

1. PCAP: comprising of full packet payloads.

2. Comma Separated Values (CSV): comprising of corresponding profiles withattack labels.

Since the latter format comprises of labels, it is better suited for application insupervised form of machine learning models.

This dataset contains realistic background traffic, produced due to networkactivities of 25 users. The features of the network traffic are exclusive and un-paralleled, especially when compared with contemporary datasets[41]. These in-


clude other datasets such as Aegean Wi-Fi Intrusion Dataset (AWID)[42], GPRS[43],CIDD-001[44], UNSW-NB15[45][46].

The developers of this dataset encapsulated network events into a set of cer-tain features. These features include:-

1. Distributions of packet sizes of a protocol

2. Number of packets per flow,

3. Certain patterns in the payload,

4. Size of the payload

5. Request time distribution

It contains real traces of both benign and malicious network activities. Thereare over 2.8 million total instances of network activities in the CSV format. Ofthese, benign and malicious network traffic and comprises 83.3%, 16.7% respec-tively[47].

The developers of this dataset encapsulated the raw data of the network trafficin the aforementioned PCAP instances. These were then fed to CIC-FlowMeter[49],to extract traffic features. Each network traffic instance, benign or malicious haveup to 83 features, that quantify the traffic characteristics. The Flowmeter pro-vided the extracted features in the form of CSV files and described in table 2.2.

2.5.2 Previous Work on the dataset

The people behind the development of this dataset, Sharafaldin et al. imple-mented a RF Aggressor algorithm, to prioritize best set of features, from table2.2) for each attack. The authors cross validated the performance with classicalmachine learning algorithms such as K-Nearest Neighbour (KNN), Adaboost, RFetc. RF gave the highest precision value of 98%[9].

However Abdulhammed et al. argue that this can further be improved by, ifthe features are more discriminate, representative and reduced in number. Theirstudy uses two dimensional feature reduction techniques: Auto Encoder (AE)and Principal Component Analysis (PCA) to reduce the number of features from83 to just 10. Additionally they also achieve an accuracy of 99.6% in multiclassclassification[41].

Meanwhile Zhu and associates tried a new radical approach of using statis-tical features of multi-flows instead of single flows. They added an attention


Attack Type Count Description

Benign 1.869.700 Normal trafficPort Scan 158.930 The hacker runs a service which sends packets to

various destination ports of the victim machine.Using this s/he can access information aboutoperating system of the victim.

DDoS 41.835 It occurs when multiple compromised systemssimultaneously attack a victim network, therebyflooding its bandwidth or resources.

File TransferProtocol(FTP) Patator

7938 FTP Patator is a Python script for brute force

attacks to guess the victim machine’s FTP logincredentials.

Secure Shell(SSH) Pata-tor

5034 FTP Patator is a Python script for brute force

attacks to guess the victim machine’s FTP logincredentials.

BotNet 1966 Hacker uses a large number of IoT devices tosteal data, spam the victim, and access tovictim’s device.

Web Attack: 1507 Hacker uses brute force trial and errorBrute Force techniques to break the victim’s passwords and

other protected data.Web Attack:XSS

625 The hacker develops malicious scripts whichdo not validate user-input properly and exploitweaknesses in web app source code[33].

Infiltration 36 The hacker creates a backdoor entry into the vic-tim’s machine by exploiting a vulnerability in asoftware on the machine. S/he can then conducta multitude of different attacks such as IP sweep,port scan etc[9].

Web Attack: 21 An attack where malicious code is inserted intoSQL strings. These are then used to create instancesInjection of SQL server for parsing and execution[48].

Table 2.1: Nework Attacks and Description


No. Feature No. Feature No. Feature

1 Flow ID 29 Fwd IAT Std 57 ECE Flag Count2 Source IP 30 Fwd IAT Max 58 Down/Up Ratio3 Source Port 31 Fwd IAT Min 59 Average Packet Size4 Destination IP 32 Bwd IAT Total 60 Avg Fwd Segment Size5 Destination Port 33 Bwd IAT Mean 61 Avg Bwd Segment Size6 Protocol 34 Bwd IAT Std 62 Fwd Avg Bytes/Bulk7 Time Stamp 35 Bwd IAT Max 63 Fwd Avg Packets/Bulk8 Flow Duration 36 Bwd IAT Min 64 Fwd Avg Bulk Rate9 Total Fwd Packets 37 Fwd PSH Flags 65 Bwd Avg Bytes/Bulk10 Total Backward Packets 38 Bwd PSH Flags 66 Bwd Avg Packets/Bulk11 Total Length of Fwd Pck 39 Fwd URG Flags 67 Bwd Avg Bulk Rate12 Total Length of Bwd Pck 40 Bwd URG Flags 68 Subflow Fwd Packets13 Fwd Packet Length Max 41 Fwd Header Length 69 Subflow Fwd Bytes14 Fwd Packet Length Min 42 Bwd Header Length 70 Subflow Bwd Packets15 Fwd Pck Length Mean 43 Fwd Packets/s 71 Subflow Bwd Bytes16 Fwd Packet Length Std 44 Bwd Packets/s 72 Init_Win_bytes_fwd18 Bwd Packet Length Min 46 Max Packet Length 74 Min_seg_size_fwd19 Bwd Packet Length Mean 47 Packet Length Mean 75 Active Mean20 Bwd Packet Length Std 48 Packet Length Std 76 Active Std21 Flow Bytes/s 49 Packet Len. Variance 77 Active Max22 Flow Packets/s 50 FIN Flag Count 78 Active Min23 Flow IAT Mean 51 SYN Flag Count 79 Idle Mean24 Flow IAT Std 52 RST Flag Count 80 Idle Packet25 Flow IAT Max 53 PSH Flag Count 81 Idle Std26 Flow IAT Min 54 ACK Flag Count 82 Idle Max27 Fwd IAT Total 55 URG Flag Count 83 Idle Min28 Fwd IAT Mean 56 CWE Flag Count 84 Label

Table 2.2: Network Features


mechanism (Attention Multiflow (AMF)) to the original Long Short Term Mem-ory (LSTM) to help their model learn which of the traffic flows is more influentialto anomaly detection. They achieve 10% more accuracy, as compared to classicalmachine learning models[50].

Preuveneers et al. went for a holistic approach by chaining incremental up-dates on anomaly detection, on a distributed ledger. They integrated machinelearning with blockchain, to circumvent the necessity of centralizing trainingdata. They managed to get a limited performance impact, varying between 5% to15%[51].

2.6 Machine Learning

A ML system learns by observations (gets trained) rather than being explicitlyprogrammed to do specific tasks. When presented with several thousands ofsamples of data, which are relevant to a task, the machine firstly recognizes pat-terns in the samples, and finds statistical rules to automate the task. Therefore, aML system essentially converts sample data into a representative form and thatdrives it to an expected output.

Since machines require large amounts of data for observing and recognizingpatterns, the quality and quantity of the data influence the accuracy of the algo-rithm that performs machine learning in the system. Consequently on the basisof the data provided to the algorithm, there are two major categories of machinelearning: supervised learning and unsupervised learning. The RBFN is a super-vised learning algorithm[52].

2.6.1 Supervised Learning

In supervised learning, the dataset used to train the model, also comes with la-bels, or classes. These classes state what is the expected category that the samplefrom the dataset belongs to. E.g. in binary classification of emails whether theyare spam or ham, the dataset that is used to train the model, also has categoriessignifying whether an email instance is a spam or not.

The following flowchart figure 2.4 provides an overview on how supervisedmachine learning works. It has been adapted from [53].

Raw Data Collection

The dataset acquired usually from open source or research institutes is oftencleaned and made ready for analysis directly. However there are other instances


Figure 2.4: Supervised Learning Flowchart


when it is not clean, but available as is. For instance to train a self driving car,the data may contain samples where the image capturing camera is obscured,jammed. Or the images are damaged by overexposure. Sometimes due to poorexternal lighting conditions, the road may be even visible. A machine will notcomprehend such raw data to the ability of a human. In this project the data isprovided by the University of New Brunswick, Canada. The details about thedataset can be found in section 2.5.

Preprocessing

The data required for training the algorithm to make predictions is most oftenvery noisy and contains extraneous information. Therefore it should be trans-formed either through geometrical transformation techniques like shifting origin,rotating, transposing etc. It should also be reduced such that only the most usefulinformation is retained, such as unnecessary features. For a network dataset, itcould be IP addresses or ports, whereas for a self driving dataset, it could be dataabout the sky above the horizon or the area outside the road lanes.

The data also needs to be transformed such that it is understandable to theML algorithm. E.g. network data of PCAP files need to be converted into floatingpoint values.

Sampling

In addition to the format and quality, the data also needs to be appropriately bal-anced. It implies that the dataset input has an equitable distribution of classes.However often the data provided by real world, does not reflect this attitude.More often than not, a class or two dominate the data. For instance, to predict theoccurrence of cancer in a sample population, the cases of healthy or benign indi-viduals heavily outnumber those individuals having a malignant form of cancer.Hence any machine learning algorithm will likely predict a malignant cancer caseas a healthy one, i.e. a false positive prediction.

Therefore it is necessary to balance the number of instances belonging to ma-jority class with the number of instances of the minority classes. This can be donein following two ways:-

1. Undersampling: The majority class samples are reduced in quantity to beequal to the minority classes. Hence the overall dataset size is reduced.While this reduces the computing time to train a model, it also makes themodel underfit to the training data.


2. Oversampling: The minority class samples are amplified in quantity to beequal to the majority classes. This increases the size of the dataset, but alsomakes the model fit better.

Statistically sampled model gives a better performance than underfit data.This is why, this project uses one of the Oversampling techniques Sythentic Mi-nority Oversampling Technique (SMOTE). Synthetically increasing the instances

(a) Plot of Data Points (b) Minority Data Points Vector

(c) Euclidean Distance betweenclosest Minority Data Points

(d) Randomly place a new DataPoint on the Distance Vector

(e) Repeat 2.5d for all other Minor-ity Data Points

(f) New Dataset with SynthesizedMinority Classes

Figure 2.5: Working of SMOTE


of the minority classes have proved to improve the performance of the classi-fiers[54][55]. SMOTE is a class provided by the imblearn package of Python toperform oversampling[56].

Figure 2.5 provides a brief overview of how this technique works. In the firstfigure, 2.5a, there are two classes, denoted by the blue dots and red boxes. It isevident that the blue class is the majority class. SMOTE dictates that the minorityclass be oversampled a balanced dataset. Next the feature vectors belonging tothe minority class data points are selected, in figure 2.5b. Then Euclidean dis-tances are calculated between the minority data points shown in 2.5c. A randomseed generator is used to provide a random integer value. This is then multipliedwith the Euclidean distance. The result corresponds to the location of a new syn-thetically created feature vector lying on the connecting vector, as shown in redin figure 2.5d. Repeating this process several times over, the dataset is now pop-ulated with as many minority class data points as the majority, thereby giving amore balanced dataset as shown in figure 2.5f.

Hyperparameter Tuning

A ML model does not work the best from the get go. It needs to be finely tunedby the 10s of parameters that influence its working.

Therefore the algorithm is defined called hyperparameters: model values de-fined before running a training procedure on a dataset. Each model has its ownset of hyperparameters. E.g. a RF classifier has hyperparameters such as numberof branches and number of nodes[57]. A neural network meanwhile has parame-ters such as number of hidden units and learning rate for logistic regression[58].

Optimizing the hyperparameters is long and tedious process where the MLdeveloper finds the best set/optimum parameters that can make the model givethe best possible performance. There are multiple ways of finding the best set ofcombinations of hyperparameters like Grid Search Optimization, Random SearchOptimization, Hand Tuning, Bayesian Optimization etc.

Some of the commonly listed hyperparameters in Neural Network models arelisted below:-Dropout: This is a popular regularization technique where during the forwardpass of a neural network, a fraction of neurons defined by the dropout value, areset to zero.

Learning Rate: Learning rate defines the rate at which a model updates its con-nection weights. A large learning rate speeds up the learning but the learningmay never converge. A low learning rate increases the time taken to learn, but


the process converges smoothly.

Number of Epochs: This is the number of times a neural network makes a pass ina forward and backward direction covering all examples. It should be as long asthe validation loss does not decrease any further. E.g in this project the validationloss in figure 4.2b the loss stops decreasing after 80 epochs.

Number of Hidden Units: This is a complex unit to pinpoint, since it does notbear a direct correlation with model accuracy. Though a complex model willrequire more number of hidden layer units, too many such units can make themodel overfit.

Size of batch: The training examples are divided into batches and fed to theneural network model during training. This size is referred to as the batch size.

Model Performance Metrics

It is crucial to decide which evaluation metrics should be used for a specificmodel. There are plenty full of instances, that legitimize selection of hold-outvalidation set as an evaluation protocol[59].

Secondly, if the dataset in question were a balanced one, it is appropriateto use the Receiver Operating Characteristic (ROC) and Area Under the Curve(AUC) as common metrics. However for imbalanced data, as illustrated in table2.1, sensitivity and precision are more suited, particularly average precision formulti-class single label instances[59].

As such, the model is evaluated after a series of successive IDS attack simula-tions (refer to figure 4.1), where the model is fed with random anomalous traffic.The model is examined on the basis of some standard metrics such as accuracy,precision, recall/sensitivity, F1 score and Confusion Matrix.

1. Confusion Matrix

A confusion matrix is usually used for binary classification[60], where one setof data belongs to a certain class, while the other does not. This is illustrated byan adapted diagram of confusion matrix in table 2.3.

1. True Positive (TP): In this figure, the traffic flows that were predicted to beanomalous and were also anomalous are classified as True Positive.


PredictedAnomalous Benign

Act

ual Anomalous True Positive (TP) False Positive (FP)

Benign False Negative (FN) True Negative (TN)

Table 2.3: Confusion Matrix

2. False Positive (FP): Flows that were predicted to be anomalous, but wereinfact benign are False Positives.

3. False Negative (FN): Flows predicted to be benign, but were actually anoma-lous comprise the False Negatives.

4. True Negative (TN): Meanwhile, flows that were correctly predicted to bebenign are classified as True Negatives.

2. PrecisionThis fundamental performance metric, aka Positive Predictive Value (PPV) is theratio of correctly predicted attacks against all recorded attacks, whether correctlypredicted or not[61]. E.g. in a sample of 200 flows, if there were 25 correctly pre-dicted anomalous packets(TP), and 30 falsely predicted anomalous packets(FP),then the precision would be 25

25+30= 0.45. I.e. Of all the predictions of anomalous

packets, only 45% were actually correct.

Pr =TP

TP + FP

3. RecallRecall, aka Sensitivity, Hit Rate or True Positive Rate(TPR) is the ratio of cor-rectly predicted attacks against all records. E.g. in a sample of 200 flows, if therewere 100 actually anomalous packets, and 25 predicted, then precision would be25/100 or 25%. I.e. 25% of the actual anomalous packets were correctly predictedby the model.

Rc =TP

TP + FN


4. F1 ScoreThis is one of the more practical metrics used, since it is harder to evaluate mod-els using either or both precision and sensitivity numbers. It could be argued thatF1 can be a mean of the two numbers.

But in many cases where the difference between the precision and sensitivityscores is large, the F1 midway between the two. E.g. if a model were 2% preciseand 100% sensitive, it would have an F1 score of 51%. Thus for a model which canonly predict 2% of all its predictions right, such a moderate looking score givesan incorrect impression of the model.

Therefore F1 score is calculated as the harmonic mean of Precision and Recall,because the final result is closer to the lower of the two scores. E.g. for afore-mentioned model, the F1 score will be 4%.

F1 = 2× Pr ×RcPr +Rc

5. AccuracyAccuracy signifies the percentage of instances when the model is overall actuallycorrect.

Acc =TP + TN

TP + TN + FP + FN

A model’s performance is conventionally proportional to its accuracy, but is apoor measure for imbalanced data[62]. E.g. if a dataset such as CICIDS2017 hasa large class imbalance, then the model will get it right most of the time, and givea high accuracy.

2.7 Neuron

Learning is the ability to learn from experiences and take actions in response tothem. In most sentient animals, learning happens in the brain. Teaching inorganicmachines to learn on their own, is inspired from the biological brain.

A mammalian brain comprises of 1011 parallel processing units called neu-rons. A brain essentially does exactly a task that it is required to do. It acquiresstimuli filled with noise, and generates responses. At its core, a neuron fires if theelectric potential across its cell membrane crosses a certain threshold[63].

McCulloch and Pitts[64] created the first mathematical model of a neuroncomprising of:-


1. Weighted Inputs: Corresponding to a synapse, the set of weighted inputsare the noisy stimuli to a neuron.

2. Adder: This sums all the weighted inputs together.

3. Activation Function: This is analogous to a threshold function, which de-cides whether the neuron should fire or not. In a Single Layer Perceptron(SLP), the activation function is 1.

This has been illustrated in figure 2.6. The circular neuron is fed a stimuli con-taining n number of features. Each feature is represented as x1, x2, . . . , xn. The setof n number of features is the feature vector X . Each of these features have corre-sponding weights, or scalar multiples. The set of these weights fromw1, w2, . . . , wn

is W .

�1

�2

�3

��

�1

�2

�3

��

� = ∑ ∗��

Figure 2.6: Neuron

2.8 Single Layer Perceptron

In neural networks, learning is an iterative process of updating the weights. Inback propagation or feed forward neural networks, the process of learning is notrestricted to binary classification. In its simplest form, a back propagation net-work is a SLP, depicted in figure 2.7.


Figure 2.7: Single Layer Perceptron

An SLP comprises of a set of N number of neurons in the input layer, repre-sented by layer i. The only other layer is the output layer represented as j.

The number of nodes in the ith layer is equal to the dimensionality of the fea-ture vector plus 1. Dimensionality in this context refers to the number of featuresof the input. E.g. if the input was the dimensions of a box, then the features ofthe box are its length, width and height. So there are three dimensions. So the di-mensionality of the feature vector is 3. Additionally, a neuron with feature input1 in the ith layer corresponds to addition of a bias.

Each neuron of ith is connected to jth layer, along with a weight wij . Theoutput layer j comprises of nodes equalling the number of classes it needs toclassify, denoted by Ci.

The job of a neural network is to classify input patterns into different classes.In a supervised method of machine learning, the input features and the expectedoutput is known. If the actual and predicted classes were t and o respectively,then the difference between the prediction and the actual class is given by error:

e = ||t− o||2 (2.1)


This error is a function of the connection weights Wij and must be minimized byoptimizing the them. The machine ’learns’ by minimizing the error function insuccessive iterations. This error function is propagated in the reverse direction,giving the neural network its name Back Propagation. To summarize, classifica-tion occurs in the forward direction, while learning occurs in the reverse direc-tion, as shown in figure 2.7.

Separability of Data Points

An SLP is sufficient classifier if the data can be separated by drawing one straightline between the data points. For instance, consider the representation of datapoints on a plot in for OR gate classifier in figure 2.8a and for AND gate classifierin figure 2.8b. When the input features X1 and X2, 0 and 0 on the axes linesrespectively, the output is plot as 0 in the graph space, and 1 for all other cases. Areverse case is illustrated in figure 2.8b for the AND gate classifier.

It is evident that there are two output classes 0 and 1. These can be separatedby a single straight line. Therefore an SLP can, given two input features X1, X2

successfully predict the output depending upon the bias, activation function andconnection weights.

However for a plot of data points depicting an XOR gate, it is not possibleto separate the two outputs 0 and 1, with a single straight line, as illustrated infigure 2.8c. Therefore an SLP is not sufficient to classify data which cannot belinearly separated. This is where the hidden layers of a neural network come in.These hidden layers represent piecewise linear boundaries.

Such a network comprising multiple hidden layers are called multi-layer neu-ral network. It is further classified as Convolutional Neural Network, RecurrentNeural Network, Long Short Term Memory etc, based on its architecture. A neu-ral network is also called a Deep Neural Network when there are more than onehidden layers.

2.9 Multi-Layer Perceptron Neural Network

For non-linear data classification explained in section before, it is necessary tohave more than one layer.

2.9.1 Learning

Figure 2.9 depicts a simple process of learning through update of connectingweights.


(a) OR Classification (b) AND Classification (c) XOR Classification

Figure 2.8: Linear Separability of Data Points

Figure 2.9: Learning Process

In a typical one layer neural network the output of a layer is a linear combi-nation of the input features and bias given by 2.2.

hj =∑i

wixi + bj (2.2)

The outputs of all layers are applied an activation function f(hj) as shown ineqution 2.3.

aj = f(hj) (2.3)

This process is repeated until the last layer is reached. The final output of allnodes of the layer(yp) is subtracted from the actual output(y) to give an error


function E in equation 2.4.E = y − yp (2.4)

A partial derivative is computed using the error value,

δ0 = Ef ′(hj) (2.5)

and it is also computed in each layer going recursively as shown in equation 2.6.

δk = δ0Wkf′(hk) (2.6)

The weights are updated with the new values Wk, till the Error function valueis as small as possible.

2.9.2 Activation Function

Most real world functions that map inputs to target outputs do not follow a linearproperty. Also due to the sheer quantity of input features that influence an outputof a system, it is not possible to maintain a linearity with all features.

Activation Function introduces non-linear properties to the network. Usingit, the mapping of an input feature vector to the output can be made non-linear.The activation function should also be differentiable (refer equation 2.5).

The three most popular activation functions are sigmoid, tanh and reLU. Theyare depicted in figures in 2.10 and have been redrawn from [65].

Sigmoid

Sigmoid activation function (refer figure 2.10a) is a less commonly however thefirst activation used, since it models the firing rate of a neuron, where 1 meansfire and 0 means no action. However it has fallen out of favour because firstlyit is non-zero centered, which makes it difficult to optimize. Secondly, it cannotsolve the issue of vanishing gradients.

ReLu

Rectified Linear Unit is the most commonly used Activation Function today, be-cause it gives an almost six times better performance result according to []. How-ever it should only be applied on the hidden layers. It is simply defined bymax(0, x). It means the value of the activation function is 0 whenever the in-put is less than 0, and it is the same input value when the input is greater than 0.ReLU has also been known to learn faster and avoid the vanishing gradient[].


(a) Sigmoid (b) ReLU

(c) Softmax

Figure 2.10: Activation Functions

Softmax

Softmax activation function is used only in the final layer, if the model is usedfor classification. This is because the softmax function gives the probabilities fordifferent target outputs.

2.9.3 Bias

A bias value is crucial to machine learning. It adds an intercept aspect to the ac-tivation response. The figure 2.11 shows the response of one node input and onenode output without bias. It employs a sigmoid activation function. As shown infigure 2.11, the output varies between 0 to 1, depending upon the input.

The weights (0.5, 1, 2) in the legend indicate the influence of the weightedinputs on the response. However in this figure, we can only change the slopeof the curves by the weight. If you needed the output of such a network to be 0whenever the input is 1, it cannot be reflected by all curves.

Now consider an alternative, where a bias value of 1.0 is added to everyweighted input. This time the weights to the input are equivalent, however tothe bias value vary. As seen in figure 2.12, it is possible to get an output of 0, evenwhen the input is 1.


Figure 2.11: Activation Function without Bias Value

Figure 2.12: Activation Function with Bias Value


Bias helps to make a neural network better responsive to input values.

2.10 Radial Basis Function Neural Network

An RBFN is a special kind of multi-layer perceptron that executes a transforma-tion on the feature space before feeding it for classification. This transformationby a radial basis function is performed in the following two ways:-

Non Linear Transformation

The input feature space on the left in figure 2.13 shows a bunch of colour codeddata points belonging to two different classes. The distribution indicates it is notpossible to classify them using a single line.

Figure 2.13: Non Linear Transformation

An non-linear transformation will force the feature vectors two cluster in thedirectional arrows shown on the right in figure 2.13. Once the classes are clusteredtogether, it is possible to separate them linearly.

Dimensionality Addition

Consider a feature space as depicted in figure 2.14, adapted from [66]. It depictsdata points belonging to two classes in red and blue. In a regular dimension onthe left, it is not possible to separate these data points by a single line. RadialBasis function transforms this feature space by adding one extra dimension.

In this new extra-dimensional feature space, it is possible to separate the datapoints by inserting a hyperplane.


Figure 2.14: Effect of extra dimension on Separability

2.10.1 Definition of Radial Basis Function

A feature vector is represented by function φ(X) in a regular feature space. It hasbeen established that an RBF transforms the feature space by adding a dimensionto it. Let the number of dimensions in the old feature space, that contains theoriginal feature vector, be P . Let the number of dimensions in the new featurevector be denoted by N . Then the feature vector of the old feature space, φ(X)

is represented in the new feature space as a collection of RBF’s as illustrated inequation 2.7.

φ(X) = [φ1(X), φ2(X), φ3(X), . . . , φN(X)]t (2.7)

where each element φ1(X), φ2(X) . . . gives a real value as its output. SinceN > P ,the dimensionality of the feature space is obviously higher.

Every Radial Basis Function (RBF) has a receptor point t. The real value of anRBF is dependent upon its radial distance r from t. As illustrated in figure 2.15.On each of the concentric circles, the value of the RBF remains constant.

To compute the radial basis function for a point X in original feature space,we first compute the radial distance from receptor. It is the Euclidean distancebetween the data point and the receptor t, given by equation 2.8.

r = ||X − t|| (2.8)

The relation between the r and the Radial Basis Function value depends upon thetype of function used.


Figure 2.15: Radial Basis Function distance from Receptor t

2.10.2 Types of Radial Basis Functions

Multiquadrics

φ(r) =√r2 + c2 (2.9)

where c > 0

To calculate RBF of a point located at the center t, then r = 0. To calculate RBF ofa point located at the center t, then r = 0. Therefore φ(r) = c which is minimum.As r increases, RBF value increases. This has been illustrated in figure 2.16a.

Inverse Multiquadrics

φ(r) =1√

r2 + c2(2.10)

where c > 0

The minimum value at r = 0 for the Inverse Multiquadrics is 1/c. As r increases,φ(r) decreases. This has been illustrated in figure 2.16b.

Gaussian

φ(r) = exp−

(r2

2σ2

)(2.11)

where σ > 0

This is the most commonly used Radial Basis Function[67]. φ(r) is maximumat receptor location t. It decreases in an inverse exponential trend as the pointmoves away from the receptor. This has been illustrated in figure 2.16c.


(a) Multiquadrics (b) Inverse Multiquadrics

(c) Gaussian

Figure 2.16: Radial Basis Function Plots

2.10.3 Working with an Example

This section describes how a RBF transforms a feature space with a familiar ex-ample for better understanding. Consider a problem where an XOR gate needsto be designed. It needs to successfully predict the 1’s from the 0’s. Therefore thefeature space for input vectors is illustrated in figure 2.8c.

Now we add the receptors t1 and t2 at the points (0, 0) and (1, 1) respectively.We also use the Gaussian Radial Basis Function. Assuming 2σ2 = 1, the RBF isgiven by equations 2.12, 2.13:

φ1(X) = exp−||X − t1||2 (2.12)

φ2(X) = exp−||X − t2||2 (2.13)

Input is at (0, 0)

When the input feature vector X has the features (0, 0), this coincides with re-ceptor t1. Therefore the Euclidean distance r between the feature vector and thereceptor is 0.

∴ φ1(X) = exp−02

= 1(2.14)


Meanwhile for φ2(X), the receptor t2 is at a distance of√2 from (0, 0).

∴ φ2(X) = exp−(√2)2

= exp−2 = 0.13(2.15)

Input is at (0, 1)

The input feature is now at a Euclidean distance of 1 from both t1 and t2. I.e.r = 1.

∴ φ1(X) = exp−12

= exp−1 = 0.36(2.16)

Similarly φ2(X) = 0.36

Plotting Values on Graph

From equations 2.14, 2.15 and 2.16 the values of φ1(X) and φ2(X) can be similarlyfound for other combinations of the feature vectors, and recorded as shown intable 2.4.

X1 X2 φ1(X) φ1(X)

0 0 1 0.130 1 0.36 0.361 0 0.36 0.361 1 0.13 1

Table 2.4: Plot of Feature Space post Transformation

Thus the plot of old feature space from 2.8c can now be transformed as shownin figure 2.17. It is evident that now data point corresponding to inputs (0, 0)

and (1, 1) in red is linearly separable from the data point for (0, 1) and (1, 0).

2.11 Neural Network

2.11.1 Architecture

A multi-layer perceptron that employs radial basis function has the following

1. An input layer: This layer accepts all the raw features after data vectoriza-tion.


Figure 2.17: Transformed Feature Space

2. Hidden Layer/s: In the hidden layer, there are N number of nodes, whereN is the number of dimensions that the original feature space should be up-graded to. Each node in this layer performs the non-linear transformationof the data from the input layer using a RBF.

3. Output Layer: This layer performs the linear combination of the output ofthe hidden layer nodes.

Figure 2.18 depicting a generic neural network model architecture has been adaptedfrom [59].

2.11.2 Training at the First Hidden Layer

There are two instances of training in an Radial Basis Function Neural Network.Each node in the hidden layer represents an RBF. From equation 2.11 and 2.8, thevalues of the Gaussian spread σ and the receptor t need to be calculated. To findout the receptors, clustering is used. These two parameters are computed duringthe training of the neural network. This is the purpose of training during in thehidden layer.

The second instance of training occurs after the hidden layer parameters. Theconnection weights between the hidden and output layer should be updated aswell. As stated in section 2.8 and specifically in equation 2.1, the error is mini-mized by updating the connection weights. The training occurs from the outputto the hidden layer in the backward direction. More on this is covered in section2.11.3.


Figure 2.18: Deep Neural Network

Clustering

This is one of the more widely used methods to find out the locations of the re-ceptor t[52]. The number of clusters is equivalent to the dimensions in the hiddenlayer feature space, i.e. N . The input feature space is divided into N number ofregions by randomly introducing N number of center points in the feature space.

Then depending upon the radial distance between the center point and aninput data point, given by equation 2.8, the shortest distances are associated withthat center point. In this first iteration of computing the distances, N numberof clusters are created. This is indicated in figure 2.19b. This figure depicts anenlarged part of the feature space of 2.19a, where two clusters C1 and C2 arecreated randomly at first. Then the Euclidean distances between the clusters andthe data points surround them is calculated. Here it is shown for data points X1

and X2. It is evident that X1 is closer to C1, and X2 is to C2. Therefore X1 and X2

will become part of C1 and C2 respectively.By taking the mean of all data points in a certain cluster, the new cluster center

is shifted to this mean value. The whole sequence is repeated until the clustercenters can no longer be moved to a new mean. This state is called convergence.

Computing the Spread σ

The second objective of training the hidden layer is to compute the spread of theGaussian RBF denoted by σ. Once the receptors have been frozen at convergence,the nearest receptor neighbours are calculated. Again the distance to find theproximity is found using the Euclidean distance, refer equation 2.8.


(a) Input Feature Space(b) Clustering as per RadialDistances

Figure 2.19: Clustering

If there are Nc number of closest neighbours then the minimal spread of areceptor is found by equation 2.17:

σj =

√√√√ 1

Nc

Nc∑i=1

(tj − ti)2 (2.17)

where σj is the spread of a hidden layer node j having a receptor at tj . There areNc number of other receptors that are closest to it. ti represents its closest receptorneighbours.

A close analogy of the distance can be found by considering the feature spacet1 and t2 in figure 2.8c. Consider t1 and t2 are at (0, 0) and (1, 1) respectively. Alsoimagine there are receptors (0, 1) and (1, 0). Then from the perspective of receptort1, the receptors (0, 1) and (1, 0) are its closest neighbours and not t2 at (1, 1).

2.11.3 Training at Remaining Layers

The next stage of training post transformation occurs at the output layers. Themodel learns to classify different target outputs based on the input data from theprevious layers. To reiterate, learning means finding a set of connecting weightsthat will help the model map the inputs to the target outputs. However thislearning occurs in a series of trials where the neural network model first finds outhow far its prediction is from the actual output. This difference is calculated bythe loss function or the objective function. For an RBF, this function is the radialdistance in equation 2.8. Thus loss function gets the data about the predicted and


Figure 2.20: Training Process in Neural Network

the actual output, and computes the difference for a particular sample.The difference or the loss score is sent to the model as a feedback to adjust

the weights. The unit responsible for adjusting the weights is the optimizer. Theprocess is illustrated with the help of figure 2.20 adapted from Chollet[59].

Let us consider a simple flow example. The model has its initial weights as-signed randomly. Therefore the score that it computes using the loss function,has a large value. This value is fed to the optimizer, which uses it to adjust theweights on the layers. These steps are performed on a loop 10s to 1000s of times,where at each iteration, the loss score is reduced by a fraction. This training fi-nally yields the right weight parameters that a model uses to minimize the loss.

Chapter 3

Approach

This chapter details how machine learning concepts explained in chapter 2 areused to find answers to the question of efficacy of RBFN in IDS.

3.1 Selection Procedure of Algorithm for IDS

This project was conceptualized by Techson AB[68]. The end point of its majorresearch has been exploring the possibility of using machine learning for predict-ing anomalies. Techson in collaboration with its sister company CogniMem[69],a neuromorphic chip provider, have been interested in R&D of their acceleratorboard, the CM1K[70].

3.1.1 CM1K

The CM1K is a publicly available neuromorphic Application Specific IntegratedCircuit (ASIC) that implements the machine learning algorithms, radial basisfunction neural network and K-nearest neighbour in hardware.

Due to its in-built parallelism, the CM1K recognizes classes regardless of itsknowledge base[71]. Therefore it is used to recognize patterns in data that is ei-ther audio visual, or any other vectorizable format[72]. The CM1K can run eithera RBFN or a KNN with by toggling a chip select pin. It has 1024 neurons that giveit an ability to run parallelized code faster. It has therefore been hypothesized thatusing RBFN will accelerate the anomaly prediction procedure.

However before focusing on the speed, it is necessary to focus on the accuracyof the algorithm. Therefore this project solely focuses on the software simulationof the RBFN and measuring its accuracy without implementing it on the CM1K.

43

44 CHAPTER 3. APPROACH

Hence the problem statement focuses only on simulation and correctness of thealgorithm and not its speed, as referenced in section 1.2.

3.1.2 Setting up the Environment

To answer the research question, a range of tools and frameworks have been em-ployed for development, testing and evaluation of the proposed RBFN algorithm.Here are the tools used:-

• Python: A scripting language, used for automating experiments and writ-ing the entire algorithm.

• Pyspark: A Python Application Programming Interface (API), for process-ing datasets consuming large memory in parallel.

• Scikit-learn: A machine learning library effective for evaluation of the ma-chine learning model.

• Numpy: A Python library that lends support to processing of large multi-dimensional data structures, as well as high level mathematical functions.

• Pandas: An additional Python library that provides high performance, sim-ple data structures called Data Frames.

• Pickle: A Python module that implements an algorithm for serializing anddeserializing large data structures. This module saves time by saving largenumpy-based data structures in serial format, so that these variables neednot be generated at run-time successive code execution attempts.

• Keras: A high-level neural network API written mostly in Python language.It is used for quick implementation of deep neural networks.

Computing Resources

The training and testing of the RBFN is conducted on a Dell desktop workstationwith 32 GB system memory. Table 3.1 illustrates its specifications:-

3.2 Preprocessing

Real world data is often full of undesirable traits such as duplicate values, un-readable formats, inconsistent formats etc. As explained in section 2.5, the dataavailable is in the form of CSV files, comprising 84 columns and over 2 million

CHAPTER 3. APPROACH 45

Device Type Quantity Unit

Architecture x86_64

CPUModel Name Intel(R) Core(TM) i5-8600K CPU @ 3.60GHzVendor Intel CorporationClock 100MHzSize 4192MHzCapacity 4300MHzWidth 64 bitsNo. of cores 6No. of enabled cores 6No. of threads 6L1 Cache Size 384KiBL2 Cache Size 1536KiBL3 Cache Size 9MiB

GPUModel Name GP102 [GeForce GTX 1080 Ti]Vendor NVIDIA CorporationClock 33MHzWidth 64 bits

Table 3.1: Computing Resource Features

rows. This raw data needs to be processed so that a machine learning algorithmcan read it and use it later on for further processing. This stage is called prepro-cessing and it is a crucial component of data mining.

Figure 3.1 provides a bird’s eye view of the conversion of the raw data to amachine learning model readable format.

3.2.1 Feature Selection

Data Redundancy sets in when the data instance becomes irrelevant or inconse-quential to the final model. This can be due to duplication of data points or datapoints containing features that have zero impact on the prediction of the model.Cleaning up the data is critical as it otherwise adds unnecessary load to the sys-tem resources. A large number of features when fed to a machine learning modeladversely impacts the model’s performance, and does not provide any additional


Figure 3.1: Block Diagram of Preprocessing Tasks

practical insight.[9] used Random Forest Aggressor algorithm, to reduce the number of fea-

tures which can best detect features based on each attack. Reducing the numberof features to classify from in table 2.2 not only speeds up the algorithm, but alsoimproves the predictive accuracy of classification[73].

Using pandas library, a new set of features is created as depicted in table 3.2..These features have the maximum impact on the prediction of intrusion at-

tacks on the dataset, while the excluded features do not impact it. E.g. sourceand destination IP addresses do not impact the learning because IP addresses canbe spoofed[74] using publicly available software.

3.2.2 Data Vectorization

Vectorization is a technique used to convert inputs to a neural network into float-ing point data. The pandas dataframe from section 3.2.1 above is a 2M x 32 featureframe as shown in figure below.

The features are stored in a separate dataframe, and the classes or labels ofthe last column are stored in a separate dataframe. In feature selection, it wasensured that the X set contains only floating point data. However Y containsclasses, which is textual data. To convert this sub-frame into floating point values,one-hot encoding[56] is used.

The X dataframe and the one-hot encoded Y dataframe is converted into n-dimensional numpy array of floating values, which can read and processed bythe model later.


Attack Type Feature Selected

Benign Bwd Packet Length Mean,Subflow Fwd Bytes,Total Length of Fwd Pck,Fwd Pck Length Mean

Port Scan Init_Win_bytes_fwd,Total Backward Packets,PSH Flag Count

DDoS Bwd Packet Length Std,Average Packet Size,Flow Duration,Flow IAT Std

FTP-Patator Init_Win_bytes_fwd,Fwd PSH Flags,SYN Flag Count,Total Fwd Packets

SSH Patator Init_Win_bytes_fwd,Subflow Fwd Bytes,Total Length of Fwd Pck,ACK Flag Count

BotNet Subflow Fwd Bytes,Total Length of Fwd Pck,Fwd Pck Length Mean,Total Backward Packets

Web Attack: Brute Force, Init_Win_bytes_fwd,XSS, Subflow Fwd Bytes,SQL Injection Init_Win_bytes_bwd,

Total Length of Fwd PckInfiltration Subflow Fwd Bytes,

Total Length of Fwd Pck,Flow Duration,Active Mean

Table 3.2: Feature Selection


3.2.3 Sampling

As evident in table 2.1, the CICIDS2017 dataset is heavily imbalanced in favourof benign class labels. More than 80% of the dataset at this instant comprisesonly of benign class. Such an imbalance gives an incorrect impression about theperformance of a model, as explained in accuracy in section 2.6.1.

Sampling is a statistical method of analyzing a part of data, considering it rep-resents a microcosm of the entire larger dataset. But since the class is imbalanced,it is difficult to obtain a sample set which accurately represents class distribution.

This can be remedied using oversampling techniques. SMOTE is one tech-nique is used to create synthetic data points of minority classes. In the project,Linear SMOTE technique of the scikit-learn library[75] is used to synthesize mi-nority samples randomly with a random seed of 42. Then the entire dataset issampled. The newly sampled dataset increases in size from 2M floating pointvalues and class labels, to just over 18,6M.

3.2.4 Splitting Data

If the entire 18,6M strong dataset is fed to the model, then it will likely overfit. Inthis context, overfit implies the model will learn everything about the dataset, andwill give exceptionally high accuracy. However when it encounters data pointsoutside this dataset, it will falter miserably. Therefore it is crucial that the modelbe trained on only 70 to 80% of all the available data, and use the remaining fortesting and validating.

The usual practice in validating machine learning model is to fit the modelon the training data, and to make a prediction on data that the model has notencountered before, i.e. the test data. The scikit-learn library provides a methodto split the large dataset into training, testing and validation datasets[56] . Theratios of splitting the sampled data are illustrated in the figure . In the end, weare left with three arrays: Training, Testing and Validation.

3.3 Model Training

3.3.1 Normalization

A normalized variant of the training array is created as follows:-

1. Find Mean and Standard Deviations of every column of the training ar-ray: This is achieved by passing the array to a numpy library function.


Figure 3.2: Block Diagram of Modeling Tasks

2. Calculate Normalized Value: This is done by following equation:

normalized_value =training_value− µ

σ

where µ is mean and σ is standard deviation.

The process is repeated for creating normalized variants of the test and vali-dation arrays. These elements in these arrays are new normalized data points ofeach of the three sets.

3.3.2 K-Means Clustering

We have covered the theory behind the clustering in section 2.11.2.

1. Initialize an array of zeroes: Size of this array is no_of_clusters×no_of_features.In our case it is 10× 31.

2. Populate this array with random data point values: However the valuesshould lie between the minimum and maximum of normalized data points.

3. Find Euclidean Distances between normalized training array and K-meansclusters: This is obtained using the following equation:

distance =

√√√√ N∑0

|normalized_value− kmean|2

where N = number of rows in the array.


4. Find minimal distance: Using the numpy library function argmin() tofind minimum value of the euclidean distances.

5. Cluster data points according to minimal distances: Now that we have aminimal distances between K-means and normalized data points, we clus-ter them accordingly, using the numpy append() function.

6. Search New K-mean: We compute new clusters by finding means of datapoints associated a specific cluster. We use the numpy mean() functionagain to compute new cluster centers.

7. Re-iterate the process until no new cluster centers are found: It has beenfound that convergence was reached after maximum 40 iterations.

In our model, there are 256 cluster centers or k-mean values.

3.3.3 Data Transformation using RBF

As explained in section 2.10, the data points even if normalized and segregatedinto clusters are not linearly separable. We need to transform them for the neuralnetwork model to linearly separate them data points later.

1. Transform Input Values: We use the Gaussian RBF to transform the nor-malized data points. The transformation is given as follows:-

new_value = exp−

(√∑N0

∑K0 |i− kmean|2

σ2

)

where N = number of instances, K = number of clusters, i = normalizedinput data point, kmean = kmean cluster values calculated before, σ =

standard deviation constant set to 1,2.

2. Repeat the process for other sets: Transform all the input values for train-ing, testing and validation datasets.

3.3.4 Generate a Keras Model

Keras framework was used because the author is familiar with its environment.Secondly, Keras also provides a flexibility and simplicity hitherto unparalleled.The simplest model is the sequential model, where the layers of the neural net-work are linearly stacked atop one another, as shown in figure 3.3.


Figure 3.3: Radial Basis Function Feed Forward Neural Network

Model Architecture

The proposed model is a neural network, comprising one input, one hidden and1 output layers. The input and hidden layer contain 256 and 64 cells/nodes re-spectively. The output layer has 10 nodes, for the 10 classes, that the classifierpredicts. It is a densely connected network, where each node in a layer is con-nected to every node of its adjacent layers.

Zeroing on many aspects of the architecture, such as the number of nodes in alayer, dropout rate, learning rate, optimizer, number of epochs have been decidedafter multiple trials of optimizing the hyper-parameters. More on this is coveredlater in section 3.4.

Dropout

Between each dense layer, a 5% dropout rate is set. Here dropout means ran-domly dropping out or nullifying a fraction of features off a layer, during training.E.g. if the layer returns/outputs a vector [1.2, 0.4, 0.2, 1.9, 2.0] then post dropoutoperation, it returns [1.2, 0.2, 1.9, 2.0]. This is indicated by light-grey cells in figure3.3.

Activation Function

relu is the most preferred choice of activation functions, as covered in section2.9.2. Therefore, relu is the chose activation function for the input and hiddenlayers.

Table has been adapted from [59]. It represents the parameters that need to


be considered while designing the final layer of a neural network. Therefore ourmodel uses a softmax activation function on its final layer.

Loss Function

The RBFN has the task of categorizing input into more than two categories be-cause there are 10 different types of attacks recorded in the dataset. Howeverit is also a single label classification task because any instance in the dataset ismapped to only one of the attacks at a time. Therefore we use thecategorical_crossentropy loss function.

Compiling the Model

After generating the model, it is compiled using the compile() function. Kerasis particularly useful because it can allow the user to change the configurations ofthe optimizer even further, giving maximum possible control of the model. Forthis model, through empirical trials, adamax[76] is found to be best performingmodel optimizer with a default learning rate of 0.01.

3.3.5 Fit the model

In training phase data in the form of numpy arrays is fed the neural networkmodel. In our case, it is the 31 network features. The model now needs to learnto associate the features with the classes or in our case, attack labels. Thereforewe ’fit’ the model invoking the fit() method, by passing the RBF-transformedarrays and one-hot encoded labels to the model object.

At this stage, we also configure the number of epochs, training or batch sizeand other parameters such as direction of training etc. We use 80 epochs andtraining batch size of 20 training instances. The number of iterations is given by

iterations = epochs× N

batch_size

where N = number of training instances. We have 13.461.840 training instancespost sampling and splitting.

∴ iterations = 80× 13461840

20= 53847360

3.3.6 Plot the Accuracy and Validation Loss Curves

We use the matplotlib library to plot four following curves:-


1. Training Accuracy

2. Validation Accuracy

3. Training Loss

4. Validation Loss

The plots of the curves greatly vary depending upon the parameters of thenetwork. They illustrate the performance of the model by depicting the trend ofthe curves across time. A model accuracy should increase with time and the loss,i.e. difference between actual and predicted values, should decrease with time.

3.3.7 Make Predictions

Then we make predictions by invoking predict() method with the test datasetas input parameter.

3.4 Hyperparameter Optimization

During training of a neural network, varying the hyperparameters can greatlyinfluence its performance. Due to sheer size and training time, grid search basedtuning was avoided. Also random search is not the best alternative because itis not possible to conclusive state that the parameters chosen were the best[77].Therefore, we use intuitive hand tuning to zero in on the most responsive hyper-parameters, which will make the RBFN model perform the best[78].

We conducted a battery of tests in fine tuning the model to achieve the max-imum desirable performance from it. The following parameters were modifiedduring the series:-

1. Hyperparameters for Network Architecture- Number of Layers- Number of Hidden Units- Dropout- Activation Function

2. Hyperparameters for Training Algorithm- Learning Rate- Size of training batch- Number of Epochs- Optimizer

Chapter 4

Results and Analyses

This chapter explains how machine learning algorithms are analyzed. It alsosheds some light on the parameters that are required to evaluate the efficiencyof the said algorithms.

The intent of this project is to understand and analyze how efficient RadialBasis Function Neural Networks (RBFN) are in the IDS. Therefore, how well thismodel performs forms the crux of this chapter.

However since it is necessary to quantify the results, this project also comparesthe evaluation parameters of RBFN with the evaluation parameters derived fromalgorithms used in the industry and in academia.

4.1 Results

4.1.1 Analysis of Model Performance during Optimization

This section details the results of different tests conducted on the RBFN model, toassess the model’s performance. The trials began with a series of hyperparame-ters frozen as default. One or two of these parameters were changed successivelyto achieve maximum performance.

E.g. We kept the parameters: epoch, batch size, layer count, node count, opti-mizer type as default in one trial. We continued the same default parameters inthe next trial, but changed one hyperparameter: the dropout rate. If a succeed-ing trial gave a worse accuracy than before then we reverted to the previouslyunchanged dropout rate, and modified some other parameter, and moved for-ward. Figure 4.1 depicts how well the model has performed, in comparison tothe number of trials conducted.

55

56 CHAPTER 4. RESULTS AND ANALYSES

Figure 4.1: Trials behind Hyperparameter Optimization

4.1.2 Model Performance after Gradient Descent

Figure 4.2a illustrates the plot of validation accuracy and training accuracy orhow well the model has performed over time, represented in the form of numberof epochs. Figure 4.2b meanwhile depicts the same for validation loss and train-ing loss. As explained in section 2.11, this loss is the absolute difference betweenthe predicted class and the actual class. It is the error of classification that everymachine learning model seeks to minimize.

It is also expected that with every epoch, the training loss will decrease andtraining accuracy will increase, due to gradient descent optimization[59]. If amodel is trained well, the validation loss curve should follow the training curveto close in on zero over time. The validation accuracy should follow and possiblyexceed the accuracy score over time. Note that accuracy is not the best metric forevaluating a model, especially if the dataset used, is imbalanced in favour of fewmajority classes.

The spotted lines in the figures 4.2 indicate temporal model performance fortraining dataset. On the other hand, the continuous lines show the efficiency ofthe model when it uses the hold-out validation dataset.

In figure 4.2a, the training accuracy maxes out at 77.5% over 80 epochs. The

CHAPTER 4. RESULTS AND ANALYSES 57

(a) Accuracy (b) Loss

Figure 4.2: Training vs Validation Performance post Gradient Descent

RBFN model validation accuracy curve follows the trend closely exceeds andreaches almost up to 80% in the same time. Unfortunately the performance onboth training and test data does not near zero in figure 4.2b. The spot lined train-ing data at the end of the 80th epoch has a 0.5 loss. Whereas the continuous linedvalidation data, has a slightly lower loss at 0.4, albeit not close to zero.

4.1.3 Performance Metrics

Since mere accuracy is often not the best solution for measuring a machine learn-ing model, there are few other metrics used to quantify its worth. They are allbased on the confusion matrix.

The Confusion Matrix table in figure 4.3 gives a succinct summary of howwell the RBFN model performs for every type of registered attack. The concepthas been sufficiently covered in section 2.6.1. Bear in mind that the number ofinstances captured in this case are amplified by the SMOTE sampling techniqueused in preprocessing of the dataset, as explained in section 2.6.1.

If one were to study how well the RBFN detects a BotNet attack, for instance,one can simply calculate the precision and sensitivity metrics for that attack.

The diagonal in green boxes in figure 4.3 illustrates all the correctly predictedinstances for every attack. E.g. the model correctly caught 345 717 instances ofBotNet traffic flows.

The next line in the green box also illustrates the accuracy for every class.Again for benign classes, the model has been calculated to be 92,45% accurate.

But it also falsely predicted 3066+376+66+10035+5332+3+2958+6349+388 =

28573 cases of Bot attack as well. Hence it predicted only 345717345717+28573

= 0.78 of Bot


Figure 4.3: Confusion Matrix for RBFN IDS with Accuracy per Class Distribution

attacks correctly, making it 78% precise.Also it did not predict 15548 + 413 + 86+ 997 + 1428 + 37140 + 2592 + 38952 +

2264 = 99420 cases which were BotNet. Hence of all the traffic flows which wereknown truth as BotNet, the RBFN predicted 345717

345717+99420= 0.92 as true. Hence its

sensitivity to BotNet attacks was 92%.The RBFN model provides the most accurate results for detecting Port Scan

attacks, followed by FTP and BotNet.A very comprehensive outlook of the model’s performance can be found by

examining figure 4.4. Here the evaluation metrics of the RBFN model have beenplotted against every recorded attack. The proposed model fares rather well atpredicting FTP, Port Scan attacks, and rather moderately at predicting malwarelike BotNet. However it gives a strictly lacklustre performance at detecting theweb attacks and also understanding harmless or benign user behaviour.

It can be surmised that RBFN is better at analyzing Malware than detectinganomalies, judging by how low the accuracy is for benign attacks.

4.2 Result Comparison and Validation

Validation is a procedure essential to assure stakeholders that the system that theyare interested in can meet their needs[79][80]. It is essential to question whether


Figure 4.4: Performance Metrics of RBFN in IDS

the right product is being built[81].In the software arena, validation can be done by simulations that sets a bench-

mark the product can be used to compare with. Since this project is a study ofRBFN, its performance can be calculated by comparing it quantitatively withother similarly used algorithms. This is an established form of measuring itsworth, as illustrated by research papers [82][83][84] among others.

Secondly, it is also fair to use standard values as benchmark if the dataset isfairly available to research.

As mentioned in section 2.4.1, Random Forest algorithm is a major classicalmachine learning algorithm that is used commercially to detect anomalies in IDS.Therefore, it as a comparable and competitive algorithm for the RBFN to go upagainst.

Therefore validation of the RBFN algorithm is two-fold:

1. Compare with a commercially tested software algorithm.

2. Evaluate with metrics available in scientific literature.


4.2.1 Comparative Analysis with Random Forest Algorithm

An algorithm is said to be valid, if it can be compared with its comparative variantunder standard environment and with identical inputs. This procedure shouldalso be repeated at different times times, and it is expected that the whole endeav-our will bear a similar result. With this in mind, the author also implemented aclassical Random Forest (RF) algorithm, and fed the same dataset to it. Conse-quently, RBFN is the proposed model, which is evaluated against an industrybased RF model.

The test has been calibrated by using the identical input dataset, identical eval-uation metrics and an identical test bed environment.

As explained in section 2.6.1, precision and recall are commonly used evalua-tion metrics for imbalanced data. Precision is the fraction of predicted outcomesthat are actually true. It informs the user how much of a particular behaviour amodel caught (i.e. FP).

Recall or sensitivity is the fraction of actual outcomes that are actually true. Ittells the user about how many instances of a particular class the machine learningmodel missed (i.e. FN).

Precision

Figure 4.5 illustrates the precision scores of the proposed RBFN model vs thecommercial RF model.

It is evident that the RF performs exceedingly well in predicting all of thebenign traffic flows correctly. It also gives a similar performance at detecting thetwo Patator attacks. However it is abysmal at predicting the web attacks of BruteForce, SQL Injection and Cross Script (XSS). The RF model could not correctlypredict whether a certain traffic flow is a port scan or a web attack. In all of theseinstances it wrongly predicted the outcome.

The RBFN model meanwhile works far better at predicting the Port Scan, Webbased and Bot attacks. It captures correctly predicts nearly 97% of all Port Scanattacks. But it performs ordinarily at detecting the benign flows which have thelion’s share of the data.

It can be inferred that RF is exceptional at predicting harmless user behaviour,while the RBFN is moderate at predicting malware kind of attacks.

Recall

Figure 4.6 depicts the sensitivity of proposed RBFN vs. standard RF. One maynotice the RF is excellent at predicting all cases of Bot attacks. The RBFN is not


Figure 4.5: Precision Score Comparison of RF vs RBFN

too far behind, with a very good sensitivity score of 92% itself. A similarly highperformance metric is observed when both models detect FTP-Patator based at-tacks.

However, RBFN performs very poorly at predicting benign, DDoS and Infil-tration attacks. RF has been observed to be even worse at catching none of theweb attacks again. The RF missed out on many of the web based attacks, butRBFN did not perform well there either. Both were comparable at Bot malwareattack predictions, and did not miss out too many instances.

F1 Score

F1 score in figure 4.7 provides the harmonic means for RBFN and RF, and stacksthe figures up for every attack. It is again clear that RBFN is only slightly betterthan RF at predicting BotNet, FTP and Infiltration.

If an IDS were to be attacked by Bots or FTP-Patator attempts, both modelswould catch most of the instances before hand. However while RF would letthrough all web based attacks, it would also correctly leave out innocuous webtraffic alone. The RBFN is better suited to be an all round classifier with moderate


Figure 4.6: Recall Score Comparison of RBFN in IDS

performance, but the RF is more poised to work on specific cases extremely well.

4.2.2 Comparative Analysis with other Algorithms inLiterature

There is a fair bit of literature that uses the CICIDS2017 dataset in IDS, albeit withdifferent methods of predicting the attacks. This also provides a macroscopicview of how the algorithm fares with the traditionally used classical models ofmachine learning. Most of these models are shallow learning because they donot have more than one hidden abstraction.

KNN[85], Support Vector Machine (SVM)[86], Naïve Bayes[87], Iterative Di-chotomiser 3(ID3)[88] are some of the commonly cited classical ML methodsused. Consequently, the performance metrics illustrated in figure 4.8 are derivedfrom a combination of:-

• Random Forest performance evaluation by the author.

• Performance evaluation metrics from literature study, particularly [9] and[50].


Figure 4.7: F1 Score Comparison of RBFN in IDS

The bar graphs in the figure 4.8 represent in percentages, how well the RBFN(marked blue) works against other surveyed algorithms in literature. The otherevaluation model, RF is also included in this procedure (marked green). Zhuet al.[50] only tested and published their unsupervised models AMF and LSTMresult in terms of Accuracy and Recall. Hence the corresponding metrics for Pre-cision and F1 are not available.

RF gives the highest accuracy, as depicted in figure 4.8a of 96%. This is as-tounding, especially knowing that the proposed RBFN was accurate only up to80% of the time. As shown in figure 4.8b, RBFN is could on an average get 80%of the predicted anomalies right.

This implies that of all the anomalous flows that were fed to the model, itpredicted 80% of them as anomalous flows correctly. Here again, the RF beat it tohave a precision of 96%. In figure 4.8c, unsupervised models such as AMF-LSTMalso work rather well at a sensitivity score of 91%. This is better than most otherstate of the art IDS predictive techniques. Predictably the F1 score of RF gave it asolid lead above others at 96%, while the proposed RBFN hovered at 79%. Whileit is evident that Iterative Dichotomy (ID3) algorithm has better numbers, RF hasbeen proven to be much faster at executing. [9] have clocked the execution time


of ID3 at 1m 14sec, while the author clocked it at 13 min.Regardless, as illustrated in table RBFN took over 9h to compute the same,

albeit at a much lower performance scores.In summary, the graphs in figure 4.8 confirm that the standard shallow learn-

ing based RF algorithm used by Cognito in section 2.4.1 and few other commer-cial IDS is excellent predicting anomalous behaviour and leaving alone harmlessbehaviour. The proposed RBFN model unfortunately is not suitable for IDS sys-tems, even if it uses a Deep Learning architecture.


(a) Accuracy

(b) Precision Score Comparison

Figure 4.8: Evaluation Metrics Score comparison of RBFN vs Other SurveyedModels


(c) Recall Score Comparison

(d) F1 Score Comparison

Figure 4.8: Evaluation Metrics Score comparison of RBFN vs Other SurveyedModels

Chapter 5

Conclusion and Future Work

It has been observed that other machine learning algorithms used to predict anoma-lies and verified on the same dataset produced far better classification results thanthe proposed RBFN model. For instance the RF classifier used by Antigena wasverified to be 96% accurate as compared to 80% accuracy by this project. Conclu-sively, the proposed method to predict anomalies in IDS’es is not up to the markas compared to the contemporary algorithms.

Additionally, the model gave a rather poor performance at default hyperpa-rameters. After hand-tuning, of over several trials, it gave a comparable amountof accuracy.

On the positive side the RBFN performed extremely well at detecting specifickinds of attacks. E.g. it was extremely precise and accurate at detecting attacksbased on Port Scan, BotNet and FTP-Patators. Therefore, if there were an IDSsuited to specifically predict the aforementioned attacks, then RBFN will proveto be handy.

The study is however delimited by the data of its time. Since cyber attacksevolve on a daily basis, it is necessary to test this model with more instances ofrecent types of cyber attacks such as Crypto-attacks, Heartbleed and other typesof Ransomwares. Further studies need to conclusively prove if RBFN can suit thecurrent cyber warfare ecosystem. Also it would be wise to improve the compu-tation time of predicting the attacks, by employing parallel processing of radialbasis transformations of the input features.

Deep learning feed forward neural network methods, especially employingRBF are still not the best method for predicting anomalies in IDS because theystill have not developed well enough to compete with industrial standards. Sig-nificant improvements are expected in this prediction model with the advent ofadversarial learning.

67

68 CHAPTER 5. CONCLUSION AND FUTURE WORK

Lastly machine learning can be used well to automate anomaly predictiontasks. However two things need to be kept in mind. First understand the prosand cons of the automation technique and second, have the system regularly su-pervised by network administrators.

Bibliography

[1] “Swedish local authority says hit by cyber attack”. en. In: Reuters (May2017). URL: https://www.reuters.com/article/us-britain-security-hospital-sweden-idUSKBN1882OI (visited on 05/04/2019).

[2] Russians Tried to Jam NATO Exercise; Swedes Say They’ve Seen This Before. en-US. URL: https://breakingdefense.com/2018/11/russians-tried- to- jam- nato- exercise- swedes- say- theyve- seen-this-before/ (visited on 05/04/2019).

[3] Internet Security Threat Report. Network Security. Apr. 2016. URL: https://www.symantec.com/content/dam/symantec/docs/reports/istr-21-2016-en.pdf (visited on 05/04/2019).

[4] Intrusion Detection System/Intrusion Prevention System (IDS/IPS) Market IsThriving Worldwide expected to Witness Significant Growth between 2019 to2025 Checkpoint, Cisco, Corero Network Security, Dell, Extreme Networks, HP,IBM – Global Market Research. en-US. URL: https://aglobalmarketresearch.com/intrusion-detection-system-intrusion-prevention-system-ids-ips-market-is-thriving-worldwide-expected-to-witness-significant-growth-between-2019-to-2025-checkpoint-cisco-corero-network-security-de/ (visited on05/04/2019).

[5] Global Market Insights Inc. Intrusion Detection / Prevention System Marketto hit $8bn by 2025: Global Market Insights, Inc. Mar. 2019. URL: http://www.globenewswire.com/news-release/2019/03/26/1767329/0/en/Intrusion-Detection-Prevention-System-Market-to-hit-8bn-by-2025-Global-Market-Insights-Inc.html (visitedon 05/04/2019).

[6] The Middle Market Manufacturer’s Roadmap to Industry 4.0. URL: https ://www.www.bdo.com/insights/industries/manufacturing-distribution/the-middle-market-manufacturer-s-roadmap-

69

70 BIBLIOGRAPHY

to-in-(1)/the-middle-market-manufacturer-s-roadmap-to-indust (visited on 06/26/2019).

[7] Cyberattacks Skyrocketed in 2018. Are You Ready for 2019? Dec. 2018. URL:https://www.industryweek.com/technology-and-iiot/cyberattacks-skyrocketed-2018-are-you-ready-2019 (visited on 06/26/2019).

[8] Niklas Donges. Pros and Cons of Neural Networks. Apr. 2018. URL: https://towardsdatascience.com/hype-disadvantages-of-neural-networks-6af04904ba5b (visited on 06/30/2019).

[9] Iman Sharafaldin et al. “Towards a Reliable Intrusion Detection BenchmarkDataset”. In: Software Networking 2018.1 (Jan. 2018), pp. 177–200. ISSN: 2445-9739. DOI: 10.13052/jsn2445- 9739.2017.009. URL: https://riverpublishers.com/journal_read_html_article.php?j=JSN/2017/1/9 (visited on 01/12/2019).

[10] Nickolaos Koroniotis et al. “Towards the Development of Realistic Bot-net Dataset in the Internet of Things for Network Forensic Analytics: Bot-IoT Dataset”. In: CoRR abs/1811.00701 (2018). arXiv: 1811.00701. URL:http://arxiv.org/abs/1811.00701.

[11] Ali Shiravi et al. “Toward developing a systematic approach to generatebenchmark datasets for intrusion detection”. In: Computers & Security 31.3(May 2012), pp. 357–374. ISSN: 0167-4048. DOI: 10.1016/j.cose.2011.12.012. URL: http://www.sciencedirect.com/science/article/pii/S0167404811001672 (visited on 04/18/2018).

[12] J. O. Nehinbe. “A critical evaluation of datasets for investigating IDSs andIPSs researches”. In: 2011 IEEE 10th International Conference on CyberneticIntelligent Systems (CIS). Sept. 2011, pp. 92–97. DOI: 10.1109/CIS.2011.6169141.

[13] Paolo Passeri. 2018: A Year of Cyber Attacks. en-US. Jan. 2019. URL: https://www.hackmageddon.com/2019/01/15/2018- a- year- of-cyber-attacks/ (visited on 05/08/2019).

[14] Ian H Witten et al. Data Mining: Practical machine learning tools and techniques.Morgan Kaufmann, 2016.

[15] This text provides general information Statista assumes no liability for theinformation given being complete or correct Due to varying update cyclesand Statistics Can Display More up-to-Date Data Than Referenced in theText. Topic: Internet usage worldwide. en. URL: https://www.statista.com/topics/1145/internet-usage-worldwide/ (visited on 06/04/2019).

BIBLIOGRAPHY 71

[16] User-generated internet content per minute 2019 | Statistic. en. URL: https://www.statista.com/statistics/195140/new-user-generated-content-uploaded-by-users-per-minute/ (visited on 06/04/2019).

[17] Public DNS. en. URL: https://developers.google.com/speed/public-dns/ (visited on 06/04/2019).

[18] Preston Gralla. How the Internet Works. en. Google-Books-ID: iCMCwXLLd-scC. Que Publishing, 1998. ISBN: 978-0-7897-1726-9.

[19] HS Venter and JHP Eloff. “Network security: Important issues”. In: NetworkSecurity 2000.6 (2000), pp. 12–16.

[20] Paweł Strumiłło and Władysław Kaminski. “Radial basis function neuralnetworks: theory and applications”. In: Neural Networks and Soft Computing.Springer, 2003, pp. 107–119.

[21] Elike Hodo et al. “Machine Learning Approach for Detection of nonTorTraffic”. In: arXiv:1708.08725 [cs] (2017). arXiv: 1708.08725, pp. 1–6. DOI: 10.1145/3098954.3106068. URL: http://arxiv.org/abs/1708.08725 (visited on 07/12/2018).

[22] Anirudh Janagam and Saddam Hossen. Analysis of Network Intrusion Detec-tion System with Machine Learning Algorithms (Deep Reinforcement LearningAlgorithm). 2018.

[23] Swati Paliwal and Ravindra Gupta. “Denial-of-Service, Probing & Remoteto User (R2L) Attack Detection using Genetic Algorithm”. In: 2012.

[24] Christiaan Beek, Raj Samani, and Alexandre Mundo Alguacil. McAfee LabsThreats Report December 2018. en. Tech. rep. McAfee LLC, Dec. 2018, p. 34.URL: https://www.mcafee.com/enterprise/en-us/assets/reports/rp-quarterly-threats-dec-2018.pdf (visited on 03/02/2018).

[25] Types of Attacks on Routers. en. URL: https://smallbusiness.chron.com/types-attacks-routers-71576.html (visited on 01/12/2019).

[26] Felix Lau et al. “Distributed denial of service attacks”. In: Smc 2000 confer-ence proceedings. 2000 ieee international conference on systems, man and cyber-netics.’cybernetics evolving to systems, humans, organizations, and their complexinteractions’(cat. no. 0. Vol. 3. IEEE. 2000, pp. 2275–2280.

[27] Dimitris Gavrilis and Evangelos Dermatas. “Real-time detection of distributeddenial-of-service attacks using RBF networks and statistical features”. en.In: Computer Networks 48.2 (June 2005), pp. 235–245. ISSN: 13891286. DOI:10.1016/j.comnet.2004.08.014. URL: http://linkinghub.elsevier.com/retrieve/pii/S1389128604003445 (visited on 12/15/2018).

72 BIBLIOGRAPHY

[28] Amirhossein Gharib et al. “An evaluation framework for intrusion detec-tion dataset”. In: 2016 International Conference on Information Science and Se-curity (ICISS). IEEE. 2016, pp. 1–6.

[29] “Browser Network Attack Methods Solution Brief”. en. In: (Apr. 2015), p. 3.URL: https://www.mcafee.com/enterprise/en-us/assets/solution-briefs/sb-browser-network-attack-methods.pdf(visited on 02/03/2019).

[30] Daniel J Bernstein. “Understanding brute force”. In: Workshop Record of ECRYPTSTVL Workshop on Symmetric Key Encryption, eSTREAM report. Vol. 36. Cite-seer. 2005, p. 2005.

[31] L. Zhang et al. “A Survey on Latest Botnet Attack and Defense”. In: 2011IEEE10th International Conference on Trust, Security and Privacy in Computing andCommunications. Nov. 2011, pp. 53–60. DOI: 10.1109/TrustCom.2011.11.

[32] Chris Roedel. “Detection and Characterization of Port Scan Attacks”. en.In: (), p. 7.

[33] Shashank Gupta and B. B. Gupta. “XSS-SAFE: A Server-Side Approach toDetect and Mitigate Cross-Site Scripting (XSS) Attacks in JavaScript Code”.In: Arabian Journal for Science and Engineering 41.3 (Mar. 2016), pp. 897–920.ISSN: 2191-4281. DOI: 10.1007/s13369-015-1891-7. URL: https://doi.org/10.1007/s13369-015-1891-7.

[34] William G Halfond, Jeremy Viegas, Alessandro Orso, et al. “A classificationof SQL-injection attacks and countermeasures”. In: Proceedings of the IEEEInternational Symposium on Secure Software Engineering. Vol. 1. IEEE. 2006,pp. 13–15.

[35] Lawrence Orans, Jeremy D’Hoinne, and Sanjit Ganguli. Gartner Reprint. en.Feb. 2019. URL: https://www.gartner.com/doc/reprints?id=1-6BR878V&ct=190305&st=sb&__hstc=&__hssc=&hsCtaTracking=95770f83-6358-4851-9125-c31068d93b79%7Cfede4ca9-da98-4d03-a5ec-143a70908fa3 (visited on 05/19/2019).

[36] The data science behind Vectra AI threat detection models. en. Vectra AI Inc.,2017. URL: https://info.vectranetworks.com/hubfs/White_Papers/the-data-science-behind-vectra-ai-threat-detection-models.pdf (visited on 01/12/2019).

[37] Machine Learning- A Higher Level of Automation[4678].pdf. Darktrace Ltd.,2016.

BIBLIOGRAPHY 73

[38] Ankur A. Patel. Hands-On Unsupervised Learning Using Python: How to BuildApplied Machine Learning Solutions from Unlabeled Data. en. "O’Reilly Media,Inc.", Feb. 2019. ISBN: 978-1-4920-3561-9.

[39] Simo Särkkä. Bayesian filtering and smoothing. Vol. 3. Cambridge UniversityPress, 2013.

[40] Ronald Dekker. “The importance of having data-sets.” en. In: (), p. 5.

[41] Razan Abdulhammed et al. “Features Dimensionality Reduction Approachesfor Machine Learning Based Network Intrusion Detection”. en. In: Electron-ics 8.3 (Mar. 2019), p. 322. ISSN: 2079-9292. DOI: 10.3390/electronics8030322.URL: https://www.mdpi.com/2079-9292/8/3/322 (visited on05/03/2019).

[42] Constantinos Kolias et al. “Intrusion detection in 802.11 networks: empir-ical evaluation of threats and a public dataset”. In: IEEE CommunicationsSurveys & Tutorials 18.1 (2016), pp. 184–208.

[43] Markus Ring et al. “Flow-based benchmark data sets for intrusion detec-tion”. In: Proceedings of the 16th European Conference on Cyber Warfare andSecurity. ACPI. 2017, pp. 361–369.

[44] Douglas WFL Vilela et al. “A dataset for evaluating intrusion detection sys-tems in IEEE 802.11 wireless networks”. In: 2014 IEEE Colombian Conferenceon Communications and Computing (COLCOM). Ieee. 2014, pp. 1–5.

[45] Muhamad Erza Aminanto and Kwangjo Kim. “Improving detection of Wi-Fi impersonation by fully unsupervised deep learning”. In: InternationalWorkshop on Information Security Applications. Springer. 2017, pp. 212–223.

[46] Muhamad Erza Aminanto and Kwangjo Kim. “Detecting active attacks inWi-Fi network by semi-supervised deep learning”. In: Conference on Infor-mation Security and Cryptography 2017 Winter. 2016.

[47] Ranjit Panigrahi and Samarjeet Borah. “A detailed analysis of CICIDS2017dataset for designing Intrusion Detection Systems”. In: 7 (Jan. 2018), pp. 479–482.

[48] Archiveddocs. SQL Injection. en-us. URL: https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms161953(v%3dsql.105) (visited on 05/07/2019).

[49] Applications | Research | Canadian Institute for Cybersecurity | UNB. URL:https://www.unb.ca/cic/research/applications.html#CICFlowMeter (visited on 05/07/2019).

74 BIBLIOGRAPHY

[50] Mingyi Zhu et al. “A Deep Learning Approach for Network Anomaly De-tection Based on AMF-LSTM”. In: IFIP International Conference on Networkand Parallel Computing. Springer. 2018, pp. 137–141.

[51] Davy Preuveneers et al. “Chained Anomaly Detection Models for Feder-ated Learning: An Intrusion Detection Case Study”. en. In: Applied Sciences8.12 (Dec. 2018), p. 2663. ISSN: 2076-3417. DOI: 10.3390/app8122663.URL: http://www.mdpi.com/2076-3417/8/12/2663 (visited on03/30/2019).

[52] Qichao Que and Mikhail Belkin. “Back to the Future: Radial Basis FunctionNetworks Revisited.” In: AISTATS. 2016, pp. 1375–1383.

[53] Dr. Sebastian Raschka. Predictive modeling, supervised machine learning, andpattern classification. Blog. Aug. 2014. URL: https://sebastianraschka.com/Articles/2014_intro_supervised_learning.html (visitedon 05/16/2019).

[54] Yirey Suh et al. “A Comparison of Oversampling Methods on ImbalancedTopic Classification of Korean News Articles”. en. In: Journal of CognitiveScience 18.4 (Dec. 2017), pp. 391–437. ISSN: 1598-2327. DOI: 10.17791/jcs.2017.18.4.391. URL: http://www.kci.go.kr/kciportal/landing/article.kci?arti_id=ART002309598 (visited on 04/24/2019).

[55] N. V. Chawla et al. “SMOTE: Synthetic Minority Over-sampling Technique”.en. In: Journal of Artificial Intelligence Research 16 (June 2002), pp. 321–357.ISSN: 1076-9757. DOI: 10.1613/jair.953. URL: https://jair.org/index.php/jair/article/view/10302 (visited on 04/24/2019).

[56] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In: Journal ofMachine Learning Research 12 (2011), pp. 2825–2830.

[57] Farshad Fathian et al. “Hybrid models to improve the monthly river flowprediction: Integrating artificial intelligence and non-linear time series mod-els”. In: Journal of Hydrology (2019).

[58] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. “Improving deepneural networks for LVCSR using rectified linear units and dropout”. In:2013 IEEE international conference on acoustics, speech and signal processing.IEEE. 2013, pp. 8609–8613.

[59] François Chollet. Deep Learning with Python. Manning, Nov. 2017. ISBN: 9781617294433.

BIBLIOGRAPHY 75

[60] C. Madhusudhana Rao and M. M. Naidu. “A Model for Generating Syn-thetic Network Flows and Accuracy Index for Evaluation of Anomaly Net-work Intrusion Detection Systems”. en. In: Indian Journal of Science and Tech-nology 10.14 (Apr. 2017). ISSN: 0974 -5645. DOI: 10.17485/ijst/2017/v10i14 / 106786. URL: http : / / www . indjst . org / index . php /indjst/article/view/106786 (visited on 07/12/2018).

[61] M. Almseidin et al. “Evaluation of machine learning algorithms for intru-sion detection system”. In: 2017 IEEE 15th International Symposium on In-telligent Systems and Informatics (SISY). Sept. 2017, pp. 000277–000282. DOI:10.1109/SISY.2017.8080566.

[62] Jason Brownlee. “Classification accuracy is not enough: More performancemeasures you can use”. In: Machine Learning Mastery 21 (2014).

[63] Stephen Marsland. Machine learning: an algorithmic perspective. Chapmanand Hall/CRC, 2011.

[64] Ling Zhang and Bo Zhang. “A geometrical representation of McCulloch-Pitts neural model and its applications”. In: IEEE Transactions on NeuralNetworks 10.4 (1999), pp. 925–929.

[65] Adeel Ahmad. Activation functions. Nov. 2017. URL: https://github.com/adl1995/adl1995.github.io/blob/master/notebooks/Activation%20functions.ipynb (visited on 05/18/2019).

[66] V Spruyt. “The Curse of Dimensionality in classification”. In: Computer Vi-sion for Dummies 21.3 (2014), pp. 35–40.

[67] Bekir Karlik and A Vehbi Olgac. “Performance analysis of various activa-tion functions in generalized MLP architectures of neural networks”. In:International Journal of Artificial Intelligence and Expert Systems 1.4 (2011),pp. 111–122.

[68] About Us | Techson. en-US. Mar. 2018. URL: https://se.linkedin.com/company/techson-ab (visited on 03/19/2018).

[69] CogniMem Technologies Inc. •Cognitive Computing Technology & Pattern Recog-nition Chip. URL: http://www.cognimem.com/technology/index.html (visited on 01/10/2019).

[70] CogniMem Technologies Inc. • Products: Chips & Modules: CM1K Chip. Mar.2018. URL: http://www.cognimem.com/products/chips-and-modules/CM1K-Chip/ (visited on 03/19/2018).

76 BIBLIOGRAPHY

[71] M. Suri et al. “Neuromorphic Hardware Accelerated Adaptive Authentica-tion System”. In: 2015 IEEE Symposium Series on Computational Intelligence.Dec. 2015, pp. 1206–1213. DOI: 10.1109/SSCI.2015.173.

[72] CORTEX Systems. CM1k Breakout Board – Neuromorphic Chip | OpenHard-ware.io - Enables Open Source Hardware Innovation. URL: https://www.openhardware.io/view/208/CM1k-Breakout-Board-Neuromorphic-Chip (visited on 05/25/2019).

[73] Donghwoon Kwon et al. “A survey of deep learning-based network anomalydetection”. en. In: Cluster Computing (Sept. 2017), pp. 1–13. ISSN: 1386-7857,1573-7543. DOI: 10.1007/s10586-017-1117-8. URL: https://link.springer.com/article/10.1007/s10586-017-1117-8 (visited on02/22/2018).

[74] Nelson E Hastings and Paul A McLean. “TCP/IP spoofing fundamentals”.In: Conference Proceedings of the 1996 IEEE Fifteenth Annual International PhoenixConference on Computers and Communications. IEEE. 1996, pp. 218–224.

[75] imblearn.over_sampling.SMOTE — imbalanced-learn 0.4.3 documentation. URL:https : / / imbalanced - learn . readthedocs . io / en / stable /generated / imblearn . over _ sampling . SMOTE . html (visited on04/24/2019).

[76] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic opti-mization”. In: arXiv preprint arXiv:1412.6980 (2014).

[77] Yurii Shevchuk. Hyperparameter optimization for Neural Networks — NeuPy.URL: http://neupy.com/2016/12/17/hyperparameter_optimization_for_neural_networks.html (visited on 05/27/2019).

[78] Elad Hazan, Adam Klivans, and Yang Yuan. “Hyperparameter optimiza-tion: A spectral approach”. In: arXiv preprint arXiv:1706.00764 (2017).

[79] Taisuke Hojo. Quality Management Systems - Process Validation Guidance. Feb.2004. URL: http : / / www . imdrf . org / docs / ghtf / final / sg3 /technical-docs/ghtf-sg3-n99-10-2004-qms-process-guidance-04010.pdf (visited on 05/22/2019).

[80] IEEE Standard for System and Software Verification and Validation. Tech. rep.IEEE. DOI: 10.1109/IEEESTD.2012.6204026. URL: http://ieeexplore.ieee.org/document/6204026/ (visited on 05/22/2019).

[81] B. W. Boehm. “Software Engineering Economics”. In: IEEE Transactions onSoftware Engineering SE-10.1 (Jan. 1984), pp. 4–21. ISSN: 0098-5589. DOI: 10.1109/TSE.1984.5010193.

BIBLIOGRAPHY 77

[82] Tiantian Xie, Hao Yu, and Bogdan Wilamowski. “Comparison between tra-ditional neural networks and radial basis function networks”. In: 2011 IEEEInternational Symposium on Industrial Electronics. IEEE. 2011, pp. 1194–1199.

[83] Saeid Soheily-Khah, Pierre-François Marteau, and Nicolas Béchet. “Intru-sion detection in network systems through hybrid supervised and unsu-pervised mining process - a detailed case study on the ISCX benchmarkdataset -”. en. In: (), p. 16.

[84] Yang Yu, Jun Long, and Zhiping Cai. Network Intrusion Detection throughStacking Dilated Convolutional Autoencoders. en. Research article. 2017. DOI:10.1155/2017/4184196. URL: https://www.hindawi.com/journals/scn/2017/4184196/abs/ (visited on 07/12/2018).

[85] Thomas M Cover, Peter E Hart, et al. “Nearest neighbor pattern classifica-tion”. In: IEEE transactions on information theory 13.1 (1967), pp. 21–27.

[86] Lipo Wang. Support vector machines: theory and applications. Vol. 177. SpringerScience & Business Media, 2005.

[87] Mrutyunjaya Panda and Manas Ranjan Patra. “Network intrusion detec-tion using naive bayes”. In: International journal of computer science and net-work security 7.12 (2007), pp. 258–263.

[88] J. R. Quinlan. “Induction of decision trees”. en. In: Machine Learning 1.1(Mar. 1986), pp. 81–106. ISSN: 0885-6125, 1573-0565. DOI: 10.1007/BF00116251.URL: http://link.springer.com/10.1007/BF00116251 (visitedon 05/24/2019).

78 BIBLIOGRAPHY

TRITA TRITA-EECS-EX-2019:258

www.kth.se

analyzing radial basis function neural networks for

Documents