copyright 2015, lakshmi sindhuri juturu

Applying Big Data Analytics on Integrated Cybersecurity Datasets

by

Lakshmi Sindhuri Juturu, B.Tech

A Thesis

In

Computer Science

Submitted to the Graduate Faculty

of Texas Tech University in

Partial Fulfillment of

the Requirements for

the Degree of

Master of Sciences

Approved

Dr. Susan D. Urban

Co-chair of Committee

Dr. Yong Chen

Co-chair of Committee

Mark Sheridan

Dean of the Graduate School

May, 2015

Texas Tech University, Lakshmi Sindhuri Juturu, May 2015

ii

ACKNOWLEDGMENTS

I would like to express my deepest gratitude to my advisor, Dr. Susan D.

Urban, for her untiring support, persistent guidance, and encouragement without

which this work would not have been possible. I owe her, more than what I can

mention, for giving me the opportunity to work under her valuable supervision. This

experience has been enriching for me to grow professionally and personally, and

helped me become a better person. I gained technical and functional knowledge, and

inter-personal skills which will stay with me throughout my life. The freedom she

gave me is the key factor behind the success of this work.

I would also like to thank my co-advisor and committee member, Dr. Yong

Chen, for his timely help, support, and advice. I can't thank him enough for always

ensuring that my journey throughout the thesis is hassle-free.

I would like to thank my parents, JVNS Prasada Rao and Narayani; my sister,

Raga Madhuri; my husband, Harshvardhan Gazula; and my grandparents. Without

their patience, guidance, understanding, support, and most of all unconditional love

and care, I wouldn’t be what I am today.

I would like to thank my colleagues and very special friends Mr. Narasimha

Inukollu and Ms. Kalaranjani Vijayakumar for supporting me all the time and assuring

me throughout. Without their help and suggestions, I would not be able to put all the

things together in place.


iii

TABLE OF CONTENTS

ACKNOWLEDGMENTS .................................................................................... ii

ABSTRACT .......................................................................................................... iv

LIST OF TABLES ................................................................................................ v

LIST OF FIGURES ............................................................................................. vi

I. INTRODUCTION ............................................................................................. 1

II. RELATED WORK .......................................................................................... 7

Background on Cybersecurity and Big Data Computing .................................. 7

HBase Schema Design Issues for Storing Datasets ........................................ 10

Tools and Methods for Anomaly Detection using HBase .............................. 12

Data Mining and Machine Learning Approaches ........................................... 15

III. SELECTION AND PREPARATION OF CYBERSECURITY

DATASETS .......................................................................................................... 21

IV. HBASE DESIGN ALTERNATIVES FOR STORING DATASETS ....... 27

Use Case Requirements .................................................................................. 27

Design Alternatives ......................................................................................... 28

Design Evaluation ........................................................................................... 32

V. DATA MINING AND MACHINE LEARNING APPLIED TO

CYBERSECURITY DATASETS ...................................................................... 37

Use of Logistic Regression ............................................................................. 37

Use of Fuzzy k-Means .................................................................................... 44

Comparison and Analysis of LR and FKM Algorithms ................................. 49

VI. CONCLUSION AND FUTURE WORK .................................................... 51

Conclusion ...................................................................................................... 51

Future Work .................................................................................................... 51

BIBLIOGRAPHY ............................................................................................... 53


iv

ABSTRACT

With the growing prevalence of cyber threats in the world, various security

monitoring systems are being employed to protect the network and resources from the

cyber attacks. The large network datasets that are generated in this process by security

monitoring systems need an efficient design for integrating and processing them at a

faster rate. In this research, a storage design scheme has been developed using HBase

and Hadoop that can efficiently integrate, store, and retrieve security-related datasets.

The design scheme is a value-based data integration approach, where data is integrated

by columns instead of by rows. Since rowkeys are the most important aspect of HBase

table design and performance, a rowkey design was chosen based on the most

frequently accessed columns associated with use cases for the retrieval of the dataset

statistics. Tests conducted on various schema design alternatives prove that the rate at

which the datasets are stored and retrieved using the model designed as part of this

research is higher than that of the standard method of storing data in HBase.

Network datasets representing DDoS attacks have been used for integration in

this research. Use case requirements have been identified, which are related to the

characteristics of attacker IP addresses from the integrated datasets, to generate

statistical data. This statistical data was used to run the Logistic Regression (LR)

classification algorithm for classifying the network traffic data into attack-related and

non-attack related traffic. The Fuzzy k-Means (FKM) algorithm was also used to

create clusters of attackers and non-attacks to segregate the attack-related traffic from

the network datasets. The results obtained from the two algorithms show that both LR

and FKM algorithms can successfully classify the network traffic datasets into

attackers and non-attackers.


v

LIST OF TABLES

3.1 Sample table structure of ‘from-victim’ dataset converted in .csv format 24

4.1 Test results of loading and running one use case on a single DDoS

attack data file using Hadoop Cluster ........................................... 33

5.1 Sample values (poor results) of the model generated using a

single dataset ................................................................................. 41

5.2 Snapshot of the input dataset prepared for LR algorithm ......................... 42

5.3 Sample values of the model generated using incremented

variables and datasets .................................................................... 43

5.4 Sample values of the model evaluated on the complete input

datasets .......................................................................................... 44


vi

LIST OF FIGURES

3.1 Snapshot of the DDoS attack dataset accessed through

Wireshark ...................................................................................... 22

3.2 Cross-mapping of Source and Destination columns of from-

victim and to-victim files .............................................................. 25

4.1 Method of storing datasets in rows table followed by field tables ............ 29

4.2 Architectural design to identify attackers from network traffic ................ 32

4.3 Avg. use case runtimes of individual passes for each schema

design ............................................................................................ 33

4.4 Time taken to load datasets in HVID and HBaseSchema models ............ 34

4.5 Avg. time taken to run use cases in Standard HBase, HVID and

HBaseSchema models ................................................................... 36

4.6 The three clusters formed using FKM algorithm ...................................... 49


1

CHAPTER I

INTRODUCTION

With the advent of big data technology, many industrial problems and

challenges that are related to large volumes of data are now being addressed. Many

industries and companies are able to analyze and process volumes of data which was

once beyond their capability. While many domains have benefited through the use of

big data technologies, cybersecurity is one field that is just beginning to explore the

advantages of big data analytics. The ability to detect and stop cyber attacks can make

or break an enterprise (Harper, 2013). By means of big data, organizations may be

able to rigorously detect threats, create more defense mechanisms and improve

security.

Prior to the arrival of big data storage, most security systems have been

dedicated to a single type of threat detection. SIEM (Security Information and Event

Management) systems (Cardenas et al., 2013) do exist that are capable of analyzing

data from several log files, but such systems are limited to the amount of data they can

handle. With systems such as Hadoop (Hadoop, 2005), cybersecurity data can now be

stored in a dedicated repository which can not only accommodate more than three

months of data but also combine and analyze real-time data together with historical

data. Big data analytics can be run on long-term patterns and detect advanced

persistent threats (APTs) that become manifest over time.

Big data analytics play an important role in detecting advanced threats and

insider threats (Gartner, 2014). Monitoring systems can potentially minimize false

alarms by providing smarter analytics. Data analytics can be used to assist systems in

collecting internal data by merging with relevant external data to detect known

patterns to stay ahead of malicious activities or intruders. Currently, 8% of major

global companies (Gartner, 2014) have adopted big data analytics for one or more use

cases related to security and fraud detection. Gartner predicted that within a year, this

will be increased to 25% with a positive return on investment within six months of

implementation. Data analysis should be intelligent and timely as anything that is


2

delayed will lose its value, especially in the field of cybersecurity. Given that hackers

are well aware of security measures and other fraud detection measures that are

employed by enterprises, they are able to directly attack without any reconnaissance

phase. Hence, to be always a step ahead, enterprises can use big data analytics to

improve monitoring systems and detection systems with contextual data and apply

smarter analytics. Data correlation techniques can be used among the high-priority

alerts and monitoring systems to detect patterns and get a bigger picture on the state of

security. Also, enterprises can opt for fast tuning of their rules and models to test

against data streaming close to real time.

The Teradata report (Ponemon, 2013) states that the traditional methods that

fall short in detecting and preventing threats can be enhanced with big data analytics.

Many big data tools and techniques have emerged that can efficiently handle the

volume and complexity of varied kinds of data, such as machine-generated and

network-related data. Also, the results from the survey conducted by Teradata

indicate, that the shortcomings of traditional solutions in detecting and preventing

threats can be overcome by using big data analytics. Hence, big data systems are being

part of a cyber defense strategy for every enterprise to meet the needs of complex and

large-scale analytics.

A concern with the cybersecurity monitoring process is that when multiple

security monitoring systems are employed and each system generates numerous log

files (such as security logs, network traffic logs), there is no well-established system

that can identify the relationships among these log files and integrate them. These

related log files could potentially be useful for identifying attack related patterns that

help in early detection of APTs or any malicious attacks. The work in (Labrinidis,

2012) identifies the challenges in dealing with big data analysis, such as automating

the whole process of locating, identifying and understanding the data. A suitable

database design is required even when analyzing one single dataset. Similarly, mining

requires data to be integrated, cleaned and efficiently accessible, which involves the

use of effective mining algorithms and big data computing environments. Labrinidis

(Labrinidis, 2012) also describes that significant research is required in order to


3

achieve automated integration of data sets as well as a suitable database design, even

for simpler analysis of a single data set. It is also essential that effective mining

techniques are used to extract information from the large datasets.

The objective of this research is to investigate the manner in which a system

such as HBase (HBase, 2006) can be used to support the integration and mining of

security-related datasets. In particular, this research aims to 1) create a design for the

integration of cybersecurity datasets in a system developed using Hadoop and HBase,

and 2) experiment with data mining or machine learning techniques on the integrated

files, which may potentially help in identifying attack patterns and segregating attack-

related traffic from the normal traffic.

A unique approach is designed using HBase, which can effectively support the

integration of multiple network-related datasets. The approach is based on the initial

work done by Stearns (Stearns, 2014) that describes a value-based data integration

approach, where data is integrated by columns instead of rows. Separate tables are

pooled into a common collection of columns, which can be effectively treated as one

single table possessing all fields. This work is inspired by neural-networks (Nielsen,

2001), which involves inversion of the column-table structure. In this approach, data

from all sources are stored by value instead of by row ID in a column-table where

values are the keys, which are stored as row IDs. Using this approach, tables can be

directly queried for rows by direct lookup of column values without scanning the

entire table. Furthermore, an inner join operation using a shared column among two

different tables can be performed using already collected row IDs and grouping them

as required. Also, data can be quickly merged without having to worry about what

columns are used for merging. MapReduce operations can also be performed easily on

the merged data.

In this research, two network-related data files are integrated by identifying

and mapping the relationships between the files. Other design alternatives are taken

into consideration. Since, rowkeys are the most important aspect of HBase table

design and the performance of data extraction from an HBase table is affected by the

row keys (Khurana, 2012), different rowkeys are considered as per the access patterns


4

required for this work. These data files mainly deal with IP addresses and timestamps.

Hence, potential schema designs are chosen by selecting the most frequently accessed

columns and their combinations as a part of the rowkey and testing performance using

each rowkey. Major use cases such as performing read operations, write operations

and the selection of a particular range of data are tested thoroughly to identify the best

possible schema design that can be attributed to each use case.

For this research, DDoS datasets have been used that are publicly available

from CAIDA (Center for Applied Internet Data Analysis), (The CAIDA, 2007).

CAIDA is a collaborative undertaking among government, research and commercial

organizations to promote research and development in providing robust and secure

internet infrastructure. This organization provides datasets for possible research

purposes while preserving the privacy of individuals and of organizations that donated

the data. It contains datasets obtained from over a decade consisting of various attacks

such as DDoS (DDOS Attack, 2007), the Code-Red virus (Danyliw, 2001), Conficker

(Conficker Worm, 2008), witty worm (Shannon and Moore, 2004), monitored logs

and network traffic datasets. Among these datasets, DDoS attack datasets are of high

focus. The DDoS type of attack attempts to block access to the targeted server by

consuming all the resources on the server and the bandwidth of the network that is

connecting the server to the internet. This DDoS attack dataset contains approximately

one hour of traffic traces that have been anonymized from the original attack. This

entire dataset is divided into individual files each containing five minute intervals of

data in the format of .pcap (packet capture) files.

Once the datasets are successfully integrated, data mining techniques are

applied on the statistical data obtained from the integrated datasets. These analytics

help in identifying the attack related traffic from normal traffic and also to extract

attack patterns. The Fuzzy k-Means (FKM) clustering algorithm and the Logistic

Regression (LR) classification algorithm have been performed on the time-related and

connection-related data obtained from the integrated datasets. Since the DDoS datasets

collected from CAIDA contain purely attack-related data, the files have been merged

with normal network traffic to a desktop in the TTU network that was captured using


5

the Wireshark tool. Individual tests were performed on the attack traffic, normal traffic

and on the combination of both the data files. The FKM algorithm was used to create

attacker and non-attacker clusters. For the LR algorithm, inferences drawn initially are

validated against the existing black listings (malicious information) and white listings

from the DNS lookup as part of a training phase. The model generated using the

sample dataset in the training phase is used for running the testing phase on the actual

dataset. Different models are generated by changing the key parameters and the testing

phase is repeated several times for accuracy and efficiency in results. The results

obtained from both the algorithms have been validated against each other for verifying

the attack-related traffic.

Initially, an HBase storage model designed by Stearns (Stearns, 2014) was

used for storing the DDoS attack datasets that took 304338 ms to load 220k rows of

data and an average time of 4005.7 ms to query the total number of requests made by

each IP address. This model was improved as part of this research with a new schema

design that took 262682 ms to load 220k rows of data and an average time of 3863.1

ms to run the same query. The statistical data retrieved from the improved HBase

model for all the unique IP addresses (Total - 41), was first sent to LR algorithm to

classify the attackers and non-attackers from the data. The LR algorithm accurately

classified 34 IP addresses as attackers and the remaining 7 IP addresses as non-

attackers from the total 41 unique IP addresses. The statistical data was also sent to the

FKM algorithm for creating attacker and non-attacker clusters. The FKM algorithm

created a cluster-1 of 11 attackers, cluster-2 of 21 attackers, and a cluster-3 of 7 non-

attackers, with clusters 2 and 3 sharing 2 attackers, as the FKM algorithm is known for

creating soft clusters. Although, both of the algorithms gave similar results, the LR

algorithm provided a complete and accurate classification identifying attack-related

traffic from the network traffic.

Since different network security systems and tools generate log files in an

individual format and might contain bits and pieces of attack information, it is

beneficial to have an integrated set of data that can be analyzed for attack patterns.

This research demonstrates the integration and analysis of datasets for identifying


6

attack-related traffic that can potentially lead to easier threat detection in cases where

attacks occur on multiple targets.

The remaining chapters of this thesis are organized as follows. Chapter 2

discusses the related work and Chapter 3 explains the selection and preparation of

cybersecurity datasets. Chapter 4 describes the HBase design alternatives that are

considered as part of this work. Chapter 5 presents the findings of the data mining

algorithms that are used on the statistical data generated from the integrated datasets,

while Chapter 6 presents conclusions and future research directions.


7

CHAPTER II

RELATED WORK

This chapter presents the past and current research that is related to this thesis

work. It provides an overview on the 1) background research on cybersecurity and big

datasets, 2) current research work that addresses the HBase design issues for storing

and accessing big data, 3) existing tools and methods that use Hadoop and HBase for

anomaly detection, and 4) data mining and machine learning approaches that have

been used for cybersecurity research.

Background on Cybersecurity and Big Data Computing

Many big data systems are enabling the storage and analysis of large

heterogeneous data sets at exceptional scale and speed (Big Data, 2013). These

systems have the potential to provide significant advancements in security intelligence

by reducing the time taken for data consolidation, contextualization of security event

information, and correlation of historical data for forensic purposes. Initially, data is

collected at a massive scale from many internal and external sources. Then, deeper

analytics are performed on the data by providing a consolidated view of security-

related information. Big data analytics can also be employed to analyze financial

transactions, log files, and network traffic in identifying anomalies, suspicious

activities and fraud detection. In this context, the more data that is collected, the

greater value that can be derived from the data. However, there are several challenges

such as privacy challenges, legal challenges and technical issues regarding data

collection, storage and analysis that developers have to overcome for performing

potential big data analytics.

HP labs has investigated big data analytics for security challenges by

introducing large-scale graph inference and analysis of a large collection of DNS

events, which consists of requests and responses. Large-scale graph inference

identifies malware-infected hosts in a network and maps them with the malicious

domains accessed by those hosts (Big Data, 2013). This information is again validated

using an existing black list and white list to identify the likelihood for the host and


8

domain to be malicious. This experiment was conducted on billions of HTTP requests,

DNS request data and NIDS (Network Intrusion Detection Systems) data sets

collected worldwide and finding that high true positive rates and low false positive

rates are achieved, which can be used to train anomaly detectors. Large collections of

DNS events are used to identify botnets or any kind of malicious activity in the

network by deriving domain names, time stamps, and DNS response time-to-live

values. Classification techniques such as decision trees and support vector machines

were then used to identify infected hosts and malicious domains. Although, the graph

inferences used here are well suited for handling complex types of data, this approach

can use a lot of space and the operations performed on these large amounts of data are

possibly slow (Sherman, 2014). The research presented in this thesis has focused on

integrating datasets through a unique design using HBase instead of using the

traditional way of storing the datasets in HBase. This approach helps in faster retrieval

of data, as anything that is delayed will lose its value, especially in the case of attack

detection.

Big data analytics has the ability to correlate data from a wide range of data

sources across significant time periods. This helps in reducing false alarms and

improving threat detection even when mixed with authorized user activities (Virvilis et

al., 2013). Also, the analytics do not have to be performed in real-time. An

organization can always perform the analysis within an acceptable time and provide

warnings to the security professionals about potential attacks. The work in (Virvilis et

al., 2013) also describes the importance of offline analysis along with the real time

data in threat detection. Although analyzing the offline data causes a delay in the

attack detection, it is equally important to consider the time the attackers spend in

reaching their objective. For instance, after gaining the initial access, attackers take

significant time to explore the network, navigate across subnets and identify their

desired location. These steps are performed as stealthily as possible to avoid any

detection. Here, big data analytics plays a key role in identifying the correlation of

events across large time scales and from multiple sources which are very crucial for

detecting sophisticated attacks. To achieve these needs, big data analytics support


9

dynamic collection, consolidation and correlation of data from diversified data

sources. Unlike SIEM systems, the use of big data technologies does not have any

limitations to perform correlation in a given time window. In fact, it increases the

scope and quantity of data over which correlation can be performed. These data

correlations result in a lower rate of false positives and increase the probability of

detecting the threats.

Although data analytics performed on real-time data is effective against

traditional attacks, most of the unique characteristics of attacks are not addressed.

Hence, the research in this thesis concentrates on processing the datasets that are

previously generated instead of real-time data. Data correlation is performed among

the existing datasets that are related with each other, in order to identify potential

relations between the datasets. For instance, the DDoS datasets used as part of this

research contain a number of files, with each file spanning across five minute

intervals. The statistics that are generated from each file are matched with the

remaining files that are being considered to obtain the correlation among different

events across a large scale. These events constitute the total number of attempts made

by one host to connect to the server in a given time, the number of packets that are

being sent over the network, and its time-to-live factor. Correlation helps in

identifying two or more hosts if they are behaving in a similar way by sending the

same type and number of requests in a given time span. This helps in detecting

malicious hosts that are trying to flood a network or make a server unavailable to

authenticated clients.

Big data analytics still faces a number of practical limitations. Also, a few

questions arise often regarding authenticity and integrity of the data that is being used

for analytics and the challenges in securing the data. Hence, good visualization tools

are needed to help analysts understand the data. Research is also needed to build a

complete solution that significantly enhances the detection capabilities for uncovering

threats and malicious activity. Using open source implementations of big data systems,

mainly Hadoop (Hadoop, 2005), to test the execution of more complex detection

algorithms would be beneficial.


10

HBase Schema Design Issues for Storing Datasets

Open source projects like Hadoop and HBase are common platforms for big

data solutions, where Hadoop is a cross-platform distributed file system that allows

computationally independent systems to process enormous amounts of data (Big Data,

2012). HBase is an open-source, NoSQL, highly-reliable, efficient, row-oriented and

expandable distributed database system. HBase utilizes Hadoop HDFS as its own

storage system and runs Hadoop MapReduce to process huge datasets. It can easily

store large amounts of unstructured data and is great for processing large datasets, as

the Hadoop Distributed File System (HDFS) provides a reliable low-level storage

support for HBase (Zhao et al., 2014).

While HBase provides lot of features and many design choices to the user, the

crucial feature for best performance lies in the schema design. In schema design or

table design, the emphasis is particularly given on rowkey design as the lack of

secondary indexes in HBase forces the use of the rowkey for column name sorting

(George, 2012). Choosing sequential keys for sequential reads are the best but provide

poor performance where writes are concerned. Similarly, random keys are good for

performing writes, but provide poor performance in read operations. Based on the

access pattern, sequential rowkeys, random rowkeys, or even the combination of both

can be chosen. Choosing a good rowkey will improve the read and write performance.

HBase provides many options to choose the rowkey, such as salting, hashing,

randomization and key field swapping techniques, to prevent hot-spotting and other

issues, with each of them having their own pros and cons (HBase, 2006). Hot spotting

is a problem in which most of the clients’ requests are directed to a single node or a

small set of nodes of a large cluster, keeping other nodes idle and wasting its

resources. The problem of hot spotting can be eradicated by distributing the data to

create a well load-balanced cluster that can be achieved by changing the rowkey

design. It is possible that a particular rowkey design which provides best write

performance might give the worst read performance and vice versa. Therefore, it is

necessary to choose good rowkey design based on the requirements.


11

The Salting approach to rowkey design is explained in (HBase, 2006) where

random numbers are generated and each number is appended to the rowkey, which

will help in writing data on multiple regions. Salting provides effective throughput in

performing write operations, but comes with the cost of bad read performance,

especially in scenarios that require reading the data in lexicographic order and reading

all of the different regions. Other approaches to rowkey design include hashing,

appending column names with the row id, and the combination of both. As a result,

there is no particular rowkey design that is suitable for all cases. Rowkey design

depends upon the use case of the client. Major work has been done on designing

rowkeys based on the data and processing requirement (HBase, 2006). For a use case

related to Log Data and Timeseries Data, certain possible rowkey design approaches

are mentioned in (HBase, 2006), with the given columns Hostname, Timestamp, Log

event, Value / Message. All of these column data are stored in an HBase table called

LOG_DATA. The possible rowkey design could be the combination of columns,

hostname, timestamp and log event. The row key combination of [timestamp] [hostname]

[log event] will cause a monotonically increasing rowkey problem (HBase, 2006).

Modifying this approach by adding buckets at the front of the key is another rowkey

design. Timestamps are distributed to different buckets by performing the modulus of

the timestamps with the total number of buckets. This technique is mainly useful for

time-oriented scans, but has a problem for selecting data for a particular range of

timestamp. The rowkey combination as [hostname] [log event] [timestamp] is useful

in the scenario where the search criteria is hostname centric. The rowkey design

[timestamp] or [reverse timestamp] is used to get the most recently captured data

quickly.

Based on this approach, the research of this thesis has tried three possible

combinations of Table Id, Row Id, column name, value and timestamp for designing the

rowkey, monitoring the read and write times for each identified rowkey design.

Among the three choices, the optimal design, which has a rowkey as a value followed

by hash value of Table Id and hash value of Row Id, is chosen for the field tables, and

hash value of Destination column followed by hash value of Time is chosen as the


12

rowkey for row tables. Further details about the rowkey design as well as Field and

row tables are provided in Chapters 3 and 4.

Tools and Methods for Anomaly Detection using HBase

The network data logs that are received from each data source contain different

formats, unless it is employed in the network to generate the data logs from a single

vendor. Since it is an uncommon scenario, these log files that are generated in

different formats, sometimes known as 'dirty data', cannot be easily correlated and

analyzed as a whole. To collect the data from different sources holistically, any service

wishing to use the data needs to examine and restructure each of the disjointed formats

separately to suit its analysis approach (Yen et al., 2013). There are certain tools that

use the SIEM approach for logging and analyzing the data from multiple sources and

efficiently handle the big data sets they generate using their own distributed storage

and processing techniques. However, these tools come at the cost of not being able to

change the security monitoring methods without losing data unification and other

dependency challenges (Big Data, 2013).

To combine multiple network monitoring systems at EMC Corporation, the

Beehive system was designed and implemented to handle the large scale logging data

produced by the systems at EMC (Yen et al., 2013). This system analyzes large

volumes of disparate log data collected in organizations to detect malicious activity,

which includes malware infections and policy violations. As described in (Yen et al.,

2013), there is lot of pre-processing of data that must be done before performing

analysis, such as timestamp normalization among different datasets, IP address-to-host

mappings, and detection of static IP addresses and dedicated hosts. However, there are

certain challenges in the implementation, such as the need for efficient data reduction

algorithms for timely detection of critical threats and strategies to focus on security-

relevant information in the logs.

This thesis has designed a value-based data integration approach, where data is

integrated by columns instead of rows. Separate tables are pooled into a common

collection of columns, which can be effectively treated as one single table possessing


13

all fields. However, to perform the join of different fields by a shared field, it is

necessary to scan the entire table for the values of each field and its corresponding

rows. Hence, a storage approach is developed which is inspired by neural-networks

(Nielsen, 2001) to invert the column-table structure. In this approach, data from all

sources are stored by value instead of by row ID in a column-table, where values are

the keys, which are stored by row IDs. This approach supports the ability to directly

query for rows from any table using direct lookup of a column value without scanning

the entire table. Even an inner join operation by a shared column can be performed

using already collected row IDs and grouping them as required.

The research in (Lee and Lee, 2011) describes a DDoS anomaly detection

method that was developed by implementing a detection algorithm based on Hadoop

and MapReduce. This technique works against the HTTP GET flooding attack.

Counter-based detection techniques are used that are based on total traffic volume or

number of page requests. Response rate against page requests are also considered to

reduce the false positive rate. The detection algorithm takes three input parameters:

time interval, threshold, and unbalance ratio. The time interval specifies the

monitoring duration of page requests, the threshold indicates the permitted frequency

of page requests to server, going beyond which the server will be alarmed, and the

unbalance ratio denotes the anomaly ratio of response per page request for a specific

client. These parameters are loaded through a configuration property or cache

mechanism of MapReduce and help in identifying the attackers from clients.

Another method based on access pattern detection is described in (Lee and

Lee, 2011), which separates the attackers from clients by using two MapReduce jobs.

The first job obtains an access sequence of a web page between a client and server and

calculates the time spent and byte count for each request to the URL. The second job

compares this pattern and the time spent in trying to access the server with that of the

infected hosts. This job is based on the assumption that clients infected by the same

bot exhibit similar behavior, thus helping in differentiating them from normal clients.

These simple DDoS attack detection methods are implemented in Hadoop and when

multiple nodes are used in parallel, performance gain is observed.


14

A case study published by Zions Bancorporation (DarkReading, 2012) has

reported that Hadoop clusters and Business Intelligence (BI) tools parse more data

quicker than traditional SIEM tools. The HDFS file system makes it easy for

administrators to run across Java-based queries that run against data spread over

multiple systems. This helps in performing analysis of large datasets in a timely

manner, which was not possible when dealing with traditional database systems. Also,

in cases like searching a month's worth of data, where a traditional system might take

20 minutes to an hour, a Hadoop system using Hive (SQL-friendly) to run queries took

about a minute (DarkReading, 2012). Using MapReduce, Hadoop and Hive, data can

be pulled in every five minutes or two minutes depending on the requirement.

Hive is an open-source Data Warehouse solution built on top of Hadoop which

provides an SQL-like declarative language called HiveQL (Hive, 2009). Queries

written in HiveQL are compiled as MapReduce jobs, which are executed using

Hadoop. At the same time, customized MapReduce jobs can be plugged in using Hive

if programmers are finding it hard to express their logic in HiveQL. HiveQL provides

query capabilities that allow users to perform various data operations that are

performed on traditional systems, such as filtering rows from a table using a where

clause; storing the results of a table into another table; performing equi-joins between

two tables; managing tables and partitions using operations such as create, drop and

alter; and finally storing the results of a query directly on HDFS directory. Hive tables

can be organized into partitions and buckets. They provide a quicker way to access a

specific portion of data. Hive allows different kinds of database tables to be created

over HDFS and different schemas can be applied to the same dataset. Extensibility is

offered to a greater extent and different formats, types and functions are supported.

IBM DeveloperWorks provided use cases to explain how Hive provides

various advantages when used for data analytics. For most of the use cases, inner join

is considered as the default and most common join operation used in applications

(Gilani and Ul Haq, 2013). For instance, consider a use case involving join of Call

Detail Records (CDR) with network logs based on a join predicate. Besides running a

direct join query similar to that of SQL, various possibilities are explained in detail,


15

such as creating a partition on the table that helps in improving speed and efficiency

when dealing with large datasets, or developing a custom user-defined function (UDF)

that can be loaded into a Hive command-line interface and used repeatedly. This

provides better performance because of the use of lazy evaluation and short-circuiting

techniques. It also supports non-primitive parameters and a variable number of

arguments.

Besides using Hadoop and HBase, this thesis research also utilizes Hive

features when working on the network datasets and accessing them by running simple

select queries and identifying unique hosts in each log file. Hive was also used to

calculate certain factors related to the number of connections in a given time period

and the number of flows to the same destination. While more data leads to better

analytics and more value derivation from the data, special algorithms and techniques

are required in order to perfectly mine the data and extract the desired results.

Data Mining and Machine Learning Approaches

The conventional methods of providing security against cyber attacks employ

tools such as firewalls, authentication tools, and VPNs. However, these mechanisms

always have vulnerabilities that are caused due to careless design or implementation

flaws (Chandola et al., 2006). Hence, monitoring systems have been developed but

require human intervention and depend on signatures. These monitoring systems,

however, have certain limitations. Also, detecting novel attacks and processing huge

amounts of data has been challenging lately. These circumstances led to an increasing

development in the area of data mining for threat detection to address different aspects

of cybersecurity.

The research in (Chandola et al., 2006) developed MINDS (Minnesota

Intrusion Detection System), which is a suite of different data mining-based

techniques to detect different types of attacks. The MINDS system contains an

anomaly detection approach, which is effective in detecting anomalies in network

traffic and preventing DoS (denial-of-service) attacks. In this approach, a model is

built with normal data and the deviations in the given data are detected using the


16

normal model. The anomaly detection algorithms have advantages compared to other

techniques as they can detect the threats or attacks when deviated from normal usage

even if there are no signatures or labeled data. Also, unlike other detection schemes

such as misuse detection scheme, MINDS does not require any explicitly labeled

training data set. The MINDS system uses an LOF (local outlier factor) algorithm that

detects outliers in data by comparing the densities of various regions in the network

data.

In the LOF algorithm, eight features are derived based on 'time-window' and

'connection’. The number of flows to a unique destination, number of flows to a

unique source, number of flows from one source to the same destination, and lastly,

the number of flows to one destination using the same source port are the four

different factors derived based on the last given number of seconds and flows. All of

these features can be extracted without having to look at the packet contents. The LOF

algorithm computes the similarity between pairs of flows, which contain a

combination of categorical and numerical features. The neighborhood around all data

points need to be constructed. To avoid computational complexity, all the data points

are compared to a sample training data set, which not only provides efficiency but also

improves the anomaly detector output. Since the training data set contains a sample of

data with less anomalous flow, the LOF score will be high in the anomalous flow and

very low in the normal flow. On each flow, the nearest neighbor set is computed and

using this set, the LOF score is computed for that particular flow. Thus, all the flows

with their scores are sorted and sent to the analyst for further action. Among the

scores, the ones with higher score values are the most anomalous flows. A few

methodologies have been developed to summarize the information and to convey the

anomalous information in a smaller but meaningful representation when displaying the

results to analysts.

In 2013, a novel system, Beehive (Yen et al., 2013) was proposed which works

on the problems involved in automatic mining and extracting knowledge from dirty

log data received from various security systems involved in an enterprise. This system

was evaluated on the data collected over a period of two weeks at EMC and results


17

were compared with Security Operations Center reports, antivirus software alerts and

feedback from enterprise security specialists. It was found that Beehive detected

malware infections and policy violations that were undetected by these afore

mentioned security tools. An algorithm was designed, which is based on the k-Means

clustering algorithm, but does not need to specify the number of clusters. Initially, it

selects a random vector as the first cluster, identifies the furthest vector away from the

initial hub, and reassigns all the vectors to these clusters with closest hub.

The Beehive system has three layers: 1) parse, filter and normalize log data

using network configuration information, 2) generate distinct features for each host,

and 3) use clustering techniques to group the hosts over features and report any

incidents on hosts if they are identified as outliers or suspicious hosts. All the distinct

features that are generated are classified under the four major categories of

destination-based, host-based, policy-based, and traffic-based (Yen et al., 2013). In

detail, destination-based features keep track of new destinations, new destinations that

have no white-listed HTTP referrer, unpopular raw IP destinations, and what fraction

of IP destination are contacted by a host on that day. Host-based features monitor new

user-agent strings, which contain name, version, capabilities and operating

environment of the application that is making the request. Policy-based features check

for the domains and connections that are uncategorized, unrated, and are blocked due

to host misbehavior. Hence, for each host, the number of domains or connections

contacted by that host is counted if they are blocked or challenged for not being

categorized. Lastly, traffic-based features track all of the ‘spikes’ and ‘bursts’ of

domains and connections, where a ‘spike’ is when a host generates more number of

connections in a minute window, exceeding the threshold limit, and a ‘burst’ is one

such time interval in which every minute is a connection or domain spike.

These parameters and features help understand all of the factors that need to be

considered when grouping certain hosts as malicious or suspicious in a given dataset.

Although, it’s the largest scale case study conducted on a real life production network,

it still needs an automation of detection tasks for security analysts. Apart from

detecting malicious activity on the network, many studies were conducted on detecting


18

exploited systems using honeypots on enterprise networks (Levine et al., 2003). Zhang

(Zhang et al., 2012) extended the work of (Chapple et al., 2007) by providing

machine-learning features that automatically detect VPN account compromises in a

university network.

The work in (Giura and Wang, 2012) proposes an attack model for detecting

APTs that are more sophisticated than worms, Trojan horses and other malware. This

model is flexible enough to work with large datasets and can accommodate any

context processing algorithm that is used for threat detection. The attack model uses

the concept of attack trees and attack pyramids to develop models of APT threats,

using a large-scale distributed computing framework to establish event correlations as

well as time correlations. An attack pyramid is nothing but a model of an APT and the

detection framework is based on this model. All the areas where the attack evolves

such as user plane, network and application plane are represented as the lateral planes

in the pyramid. The goal of the attack occupies the top of the pyramid. These planes

change based on the environments where the events are recorded. It is assumed that in

order to reach the goal, the attacker explores the vulnerabilities and navigates from

one plane to another plane, which makes the attack look like a tree that spanned across

multiple planes.

Eventually, this model uses all of the events recorded in the environment to

detect the attacks where each individual event causes a security alert. This model

collects candidate events, which do not necessarily represent attack activity but may

potentially contain some traces generated by APT. The model records suspicious

events, which are reported by security mechanisms as events containing suspicious,

abnormal, or unexpected activity. This model also includes the attack events that are

reported by security mechanisms, such as request to a known domain which is

expected to contain malicious binary code. All these events are correlated with the

events in other planes, where each plane constitutes a group of specific events. All the

entry logs, hiring events, and assets status logs are grouped under the physical plane.

Hierarchy updates, contact updates and affiliation updates are grouped under the user

plane. Firewall logs, IPS/IDS logs and net flow logs are grouped under the network


19

plane and, DNS logs, email logs, http logs and authentication logs are collected under

the application plane.

The methodology proposed by (Giura and Wang, 2012) correlates in parallel

all the relevant events across all these pyramid planes into contexts, with various

algorithms run in parallel for each context using the MapReduce framework. This

works best when multiple worker nodes are run in parallel, which also provides the

flexibility to use any detection algorithm that can run in parallel. However, running

this model using Hadoop and MapReduce with more complex detection algorithms is

something that still needs to be investigated.

A DDOS attack detection model based on data mining was proposed by

(Zhong and Guangxue, 2010), which can detect any abnormalities present in the

network traffic. This model is designed to reduce the system load and improve the

performance of DDoS attack detection in real time. The model uses FCM cluster and

Apriori association algorithms to generate network traffic module and network packet

protocol status module. According to (Zhong and Guangxue, 2010), attackers cause

the DDoS attack by exploring the hosts in the network with security vulnerabilities

and trying to obtain administrator rights to install their control programs. Permissions

are then given to handlers, which in turn control the attack agents. Agents then take

the action of sending a large number of packets, trying to flood the resource and

making it difficult to differentiate the attackers from the normal users. While there are

detection techniques for identifying TCP SYN FLOOD attack, those techniques

cannot identify a UDP or ICMP FLOOD attack. Similarly, other detection techniques

identified by (Gao, Feng, and Xiang, 2006), based on protocol analysis and cluster are

advantageous in having less or no human intervention. However, the number of

network connections cannot be reduced to an optimal level using an association

algorithm. Hence, (Zhong and Guangxue, 2010) came up with this model that

performs abnormal traffic detection as well as packet protocol status detection for

detecting DDoS attacks.

Network traffic value is captured using the k-Means data mining algorithm and

threshold value, which is also identified through k-Means, is adjusted automatically


20

depending on the packet protocol status detection. The two modules mentioned in this

approach are run one after another with first abnormal traffic detection module being

run on the network traffic. Once the network traffic crosses the threshold value,

network packet protocol status module starts immediately. The packet protocol status

is detected using this module and if any abnormal packets are found, an alarm is

raised. But, if there are no abnormal packets, the current traffic is clustered again by k-

Means and a new threshold value is generated.

The work mentioned in (Zhong and Guangxue, 2010), demonstrated good use

of k-Means for identifying the threshold value. However, the usage of association

algorithms in this context may lead to identifying too many rules and are not always

guaranteed to be relevant (Garcia, 2007). Hence, this research uses Logistic

Regression classification algorithm along with Fuzzy k-Means algorithm which is

good for fraud detection and classification of attack traffic from the network traffic.

Also, the results obtained from both of the algorithms are matched against each other

and inferences are drawn to ensure the accuracy of the results.


21

CHAPTER III

SELECTION AND PREPARATION OF CYBERSECURITY

DATASETS

This chapter provides details about cybersecurity datasets that have been

chosen as part of this research, the preparation performed on the datasets before

storing them in HBase and preliminary processing of datasets to identify attack related

information.

The datasets used in this research were provided by the CAIDA (The CAIDA,

2007) organization. The datasets are publicly available for research purposes on

various categories, such as attacks, worms, monitored logs, anonymized internet

traces, and network topology and traffic datasets. CAIDA provides free access to all

the datasets while preserving the privacy of individuals and organizations that donated

the data. Initially, CodeRed (Chien, 2007) and Wittyworm (Schneier, 2004) attack

datasets were analyzed. Both the attack datasets contain similar details, such as start

times, end times, time durations of hosts and machines performing the transmission,

and country distribution of code-redv2-infected and witty-infected computers.

However, both datasets are heavily anonymized in a way that no IP address related

information is revealed. Since the goal of this research is to identify attack-related

traffic from the integrated datasets, network log files of DDoS attack (DDoS Attack,

2007) datasets were chosen, which can be easily integrated and analyzed for detecting

suspicious hosts or IP addresses.

The DDoS attack datasets are of high focus. The entire dataset contains

approximately an hour of traffic captured from a DDoS attack that occurred in 2007

(DDoS Attack, 2007), where each file contains the data that is generated for a time

span of five minutes in the format of .pcap (packet capture) files. These data files are

classified under two categories, ‘to-victim’ and ‘from-victim’, where ‘to-victim’

specifies the log files that contain requests sent to the victim, and ‘from-victim’

specifies the responses that are received from the victim. While there are fourteen files

that are available under each category, only four datasets are considered for this

research, selecting two data files from each category. These datasets were chosen in


22

such a way that both ‘from-victim’ and ‘to-victim’ files have the same start time in

order to cover the traffic from both directions over the same time period. Each data

file contains seven columns of data, namely:

1) S.no, which is a sequence number

2) Time, which denotes the seconds’ value at which the request or response occurred

3) Source, which is the IP address of the source

4) Destination, which is the IP address of the destination

5) Length, length of the frame

6) Protocol, which indicates the protocol of the request or response

7) Info, which provides additional information about the request or response

The ‘Info’ column value contains the flags set for the request or response

depending on the protocol used, either ‘TCP’ or ‘ICMP’, and TTL (Time-To-Live)

value. The flags used for a TCP protocol packet can contain one or more flags among

SYN (SYNCHRONIZATION), ACK (ACKNOWLEDGEMENT), FIN (FINISH),

PSH (PUSH) and RST (RESET). Figure 3.1 shows a snapshot of the DDoS attack

dataset retrieved from CAIDA in the form of the .pcap file accessed through

Wireshark.

Figure 3.1: Snapshot of the DDoS attack dataset accessed through Wireshark

In a TCP protocol packet, SYN is used to initiate a connection (Frederick,

2010), ACK is used to acknowledge the validity of the request, FIN is used to

gracefully end the connection, RST is used to abruptly end the connection, and PSH is

used to inform the receiver to distribute the message or data. There are certain key


23

scenarios that are tested to help in identifying abnormal packets when working on the

network traffic datasets, such as 1) packets that do not contain an ACK flag except for

the initial SYN packet, 2) packets that contain SYN and FIN flags together, with or

without other flags in the same packet, 3) packets that contain the FIN flag alone

without any other flags, 4) a packet with no flag at all, and 5) packets that contain the

source or destination port set to zero. All of these scenarios are considered to be

abnormal and help in identifying attack related traffic at the initial stages. Unlike TCP

packets, ICMP packets are not complicated and hence, do not have many

characteristics that can be considered as abnormal. Error messages between two hosts

or a host and a network are transmitted through ICMP packets. In a normal scenario,

no responses are generated for these error messages in order to avoid error message

loops (Frederick, 2010). However, any redirect messages that are sent to a device and

attempt to convince the device that it is an optimal router and route everything to it

can be considered as abnormal. As a result, such ICMP packets are considered fake.

Also, ICMP packets are generally composed of a small header and small payload.

Hence, any ICMP packet that contains a significantly large header or payload should

be considered as abnormal. This is due to the fact that attackers causing DDoS and

other attacks might use ICMP packets as 'containers' that can hide the attack-related

traffic and this could be the reason behind their header or payload being large in size.

Since, all the network data files used in this research are available in ‘.pcap’

format, Wireshark was used to export the data files into .csv (comma separated values)

format. The .csv file contains all the columns separated by commas. However, the

‘info’ column internally contains request or response information along with various

fields separated by commas. Hence, these extra comma characters are removed by

replacing them with a ‘space’ character. Also, all the fields exported from the .pcap

file into .csv format are enclosed with quotation marks. These quotation marks are also

removed as part of the data cleaning process. In the ‘to-victim’ data files, all the

requests are sent to the victim address. Hence, the ‘destination’ column has the same

value throughout the file, which is the host that was attacked. Similarly, the ‘source’


24

column contains the same value in the ‘from-victim’ file. Table 3.1 shows the sample

structure of the ‘from-victim’ dataset converted in .csv format.

Table 3.1: Sample table structure of ‘from-victim’ dataset converted in .csv format

No Time Source Destination Protocol Length Info

349 22.174088 71.126.222.64 198.241.152.229 TCP 40

46426 > http [ACK] Seq=446 Ack=1461 Win=35040 Len=0

350 22.17437 71.126.222.64 198.241.152.229 TCP 40


351 22.179922 71.126.222.64 198.241.152.229 TCP 40


352 22.180239 71.126.222.64 198.241.152.229 TCP 40


353 22.180439 71.126.222.64 198.241.152.229 TCP 40


In this research, all the rows in the ‘from-victim’ file are taken as they are. But,

the rows of ‘to-victim’ file are appended to the ‘from-victim’ file in such a way that

the ‘source’ column values of ‘to-victim’ are appended to the ‘destination’ column of

‘from-victim’ file. Similarly, ‘destination’ column values of ‘to-victim’ file are

appended to the ‘source’ column of ‘from-victim’ file. A new column ‘Flag’ is

introduced to identify whether the row belongs to the ‘from-victim’ file (Flag: Yes) or

the ‘to-victim’ file (Flag: No) and the existing column ‘S.no’ is removed. By doing the

cross-mapping, the entire dataset can be observed in a unidirectional flow instead of

bi-directional and the computation of all the required factors, such as request, response

and TTL values, become easier since there is no need to distinguish between ‘Source’

and ‘Destination’ fields from each file.


25

Figure 3.2: Cross-mapping of Source and Destination columns of from-victim and to-

victim files

Figure 3.2 illustrates the cross-mapping of the ‘Source’ and ‘Destination’

columns of ‘to-victim’ data files with that of the ‘from-victim’ data files. Both ‘from-

victim’ and ‘to-victim’ data files contain the columns, 1) S.No (N in Figure 3.2), 2)

Time (T), 3) Source (S), 4) Destination (D), 5) Length (L), 6) Protocol (P), and 7) Info

(I) in a .csv format. The entire ‘Source’ column in every ‘from-victim’ file consists of

same IP address that is the victim’s IP address (in Figure 3.2, v in the ‘from-victim’

file). Similarly, the ‘Destination’ column of every ‘to-victim’ data file consists of the

victim’s IP address (in Figure 3.2, v in the ‘to-victim’ file). Both columns are stored in

the ‘Source’ column of rows table. Similarly, various ‘Destination’ column values of

‘from-victim’ files and ‘Source’ column values of ‘to-victim’ files are stored in the

‘Destination’ column of ‘rows’ table. As shown in Figure 3.2, the ‘Flag’ column

contains ‘Yes’ (y) for all the rows containing data from ‘from-victim’ file and ‘No’ (n)

for the rows containing data from ‘to-victim’ file. When inserting the datasets into the

HBase table structure, the column names are passed as the run time arguments to a

MapReduce program along with the input and output file paths residing on HDFS.

While ‘from-victim’ column names are passed as they are, ‘to-victim’ files are passed


26

as T, D, S, L, P, I, and F as ‘No’ in order to achieve this cross-mapping. The data

stored in the ‘rows’ table is internally loaded into the respective ‘field’ tables, where

each table contains both directions of traffic. Further details about this storage model

are provided in the next chapter.

The DDoS attack datasets obtained from CAIDA are completely anonymized

and the non-attack traffic has been removed as much as possible (The CAIDA, 2007).

Hence, to check the strength of the models developed in this research, real network

traffic from a TTU desktop machine connected to the TTU network has been captured

for five minutes and merged with the attack data. Although this captured data can be

considered as normal, non-attack-related traffic, to attain accuracy in the results, the

sources and destinations obtained in this captured data are verified from the DNS

lookup and are validated against the existing black listings to ensure the authenticity of

the hosts.


27

CHAPTER IV

HBASE DESIGN ALTERNATIVES FOR STORING DATASETS

This chapter describes the design approach chosen for storing the integrated

datasets obtained from the previous chapter. In particular, this chapter 1) lists multiple

use case requirements that are identified for performing data analytics by running data

mining and machine learning techniques on the integrated datasets, 2) describes

various schema designs that were explored for obtaining efficiency in loading the

datasets, as well as retrieving them as per the use case requirements, and 3) illustrates

the evaluations performed on different schema designs and their performance

assessment by comparing the results.

Use Case Requirements

This research has identified certain key use cases that need to be performed on

the network datasets in order to identify the attack-related traffic from the network

traffic. They are:

1) Number of requests made by each unique IP address to the victim

2) Number of responses the same IP address has received from the victim

3) TTL value of each unique IP address that is sending requests to the victim

The TTL values are available in the ‘info’ column for each network packet. These

values are used in calculating the hop-count for each network packet corresponding to

the unique IP address. Hop-count denotes the number of hops a network packet makes

from source to its destination (Gu, 2007). This value is obtained by subtracting the

TTL value available in the ‘info’ column from the standard TTL values. Standard TTL

values are 255, 128, 64, 60, 49, 32, and 30 which depend on the protocol used and the

OS of the machine that generated the packet. Further details on the standard TTL

values and hop-count calculation are provided in the next chapter.

The following use case requirements are related to the ‘Info’ column of ‘to-victim’

data files that are generated while sending requests to the victim (Frederick, 2010).

4) Any network packet that has no flag (such as ACK, SYN, FIN etc.) set


28

5) Any network packet that contain the suspicious combination of SYN and FIN

flags set along with other flags

6) Any network packet that does not contain ACK flag at all except for the initial

connection setup

7) Packets that contain a FIN flag alone without any other flag

These use cases help in generating the statistical data for each unique IP

address that communicated with the victim. The statistical data that is generated for

each IP address based on the given factors help determine whether the IP address is

related to an attacker or not using the data mining algorithms employed in this

research. However, these algorithms cannot be executed directly on a database table

such as HBase or Hive, since each technique expects input in its own type and format.

In this research, Logistic Regression (LR) and Fuzzy k-Means (FKM) algorithms are

used to identify the attack-related traffic. The LR algorithm expects all the factors and

parameters of IP addresses in a .csv format, whereas the FKM algorithm takes input

points represented in an n-dimensional vector space and creates clusters by identifying

the distance among the points. Hence, this research came up with an HBase storage

model where the required data can be efficiently retrieved and used for executing data

mining algorithms.

Design Alternatives

The storage model developed in this research is based on the initial work done

by Stearns (Stearns, 2014), which is a value-based data integration approach that

integrates data by columns instead of rows. Different tables are pooled into a common

collection of columns that can be effectively treated as one single table possessing all

fields. This work is inspired by neural-networks (Nielsen, 2001), which involves

inversion of the column-table structure. In this approach, data from all sources are

stored in a ‘rows’ table, which is in turn divided into many ‘fields’ tables. Here, a

‘rows’ table stores the entire integrated dataset, which further stores the data from

each column in the corresponding ‘fields’ table, which has rowkeys as values followed

by a hash value of ‘Table Id’ and a hash value of ‘Row Id’. This way of storing the


29

data in HBase is referred as HVID - Hadoop Value-Oriented Integration of Data.

Similarly, a ‘rows’ table also has rowkeys as a hash value of ‘Table Id’ followed by a

hash value of ‘Row Id’. In the HVID model, tables can be directly queried for rows by

direct lookup of a column value without scanning the entire table. Furthermore, an

inner join operation using a shared column among two different tables can be

performed using already collected rowIds and grouping them as required. Also, data

can be quickly merged without having to worry about what columns are used for

merging.

Figure 4.1: Method of storing datasets in rows table followed by field tables

Figure 4.1 shows datasets I and II being inserted into the rows table with the

hash value of ‘Table ID’ followed by a hash value of ‘RowId’ as the rowkey (T1:row1

in the Figure 4.1). A, B, C, D in Figure 4.1 are the column names in the rows table

which stores their corresponding values obtained from the datasets. These values are

inserted into the field tables as well, where each column name in the rows table

becomes the table name of the corresponding field table. Field tables have only one

column that stores only rowkeys that have a value followed by ‘Table ID’ and


30

‘RowId’. As shown in Figure 4.1, a1:T1:row1 is the rowkey for the field table A. This

way of storing values helps in faster retrieval of rowkeys from the field tables. Values

from multiple columns in a rows table can be fetched by collecting the rowkeys from

the corresponding field tables and then using those rowkeys to fetch the rows from the

rows table. This will avoid the need to scan the entire rows table for collecting the

required data.

The HVID model of Stearns provides an efficient integration approach for

storing as well as retrieving the datasets when compared to the standard method of

storing data in HBase. In a standard method of storing the data in HBase, the values

are stored with a rowkey that has a unique number for each row and all the columns

are grouped under one column family. To query the values, the entire table needs to be

scanned irrespective of what columns are being queried. Querying 360k rows from a

table that is stored in a standard HBase method took 11500 milliseconds (ms), whereas

retrieving all the rowkeys from one field table in HVID model took only 5500 ms.

However, using these rowkeys to fetch the entire data from the rows table takes

additional time. This model is still advantageous over using a standard method of

storing and retrieving data in HBase table.

This research initially experimented with storing the DDoS datasets with the

approach of Stearns using rows table and field tables. However, when running the use

cases to calculate the number of requests made by each IP address to the server, or

when running ‘select’ operations on the entire dataset, where multiple ‘where’

conditions are involved instead of just fetching row ids, the average retrieval time was

high. For instance, running a query to retrieve 38170 rows from a dataset that has 220

k rows of data, the HVID model took 4784 ms. Retrieving 9 rows from the 220 k rows

of data took 4574 ms. This retrieval time could be improved because of the schema

design that was adopted. For the ‘rows’ table, the rowkey chosen is the combination of

‘Table Id’ followed by ‘Row Id’ which is appended with session time to get the unique

combination of rowkeys, since the hash values of ‘Table Id’ and ‘Row Id’ are always

the same for every row. Although this creates unique selection for a rowkey, some

common problems such as ‘hot spotting’ (HBase, 2006) are caused when dealing with


31

large volumes of data. Hot spotting is caused when most of the clients' requests are

directed to a single node or a small set of nodes of a large cluster, keeping other nodes

idle and wasting their resources. Most of the requests that come from the clients are

reads and writes. The problem of hot spotting can lead to consequences such as

unavailability of the hot region, performance degradation, and resource wastage of

cold regions. If all the regions belong to a standalone machine, then other regions are

most likely unavailable. These problems can be eradicated by distributing the data in

such a way that it has a well load-balanced cluster that can be achieved by changing

the row key design. Hence, rowkeys are the single most important aspect of HBase

table design (Khurana, 2012).

Hence, this research has explored various HBase table schema designs by

experimenting with three different rowkey designs and identifying the best possible

design that meets the use case requirements of this research. The three rowkey designs

that were tested are:

1) The ‘S.no’ field as the rowkey with a unique number for each row

2) The ‘Time’ field concatenated with ‘S.no’ as the rowkey

3) The ‘Destination’ field concatenated with ‘S.no’ as the rowkey

Among these three designs, the third one that has ‘Destination’ concatenated with

‘S.no’ gave the best data retrieval times (as shown in Table 4.1). This design has

proven to take less time to store the datasets as well as less time to retrieve that data

compared to the HVID model. Hence, this rowkey design is integrated with the HVID

model and the datasets are cross-mapped (as shown in Figure 3.2) to increase the

efficiency while retrieving the data (as explained in the Chapter 3). This model is

referred to as the HBaseSchema model. The results of the three schema designs that

were tested and the comparisons of the HVID model and the HBaseSchema model in

storing and retrieving the datasets are provided in the “Design Evaluation” section in

this chapter.

After storing the DDoS datasets in the HBaseSchema model, the statistical data

is obtained by running the use cases (as mentioned in the “Use Case Requirements”


32

section). This statistical data is sent as an input to the LR and FKM data mining

algorithms used in this research for identifying the attackers from the network traffic.

Figure 4.2 shows the architectural design developed in this research to identify

attackers and non-attackers from the network traffic.

Figure 4.2: Architectural design to identify attackers from network traffic

Design Evaluation

This section presents the tests performed on the three rowkey designs that were

initially chosen, the results obtained from those tests, and the comparisons drawn

between the HVID model and the HBaseSchema model in storing the datasets and

retrieving them as per the use cases. The three rowkey designs that were selected in

the initial stage are:

1) The ‘S.no’ field as the rowkey with a unique number for each row

2) The ‘Time’ field concatenated with ‘S.no’ as the rowkey

3) The ‘Destination’ field concatenated with ‘S.no’ as the rowkey

Similar tests were performed on all the rowkey designs that involve creating

the table as per the rowkey design, loading a single DDoS attack data file, and running

the use case that has a select operation on the ‘Destination’ and ‘Time’ columns. A

simple HBase program has been written in Java to perform the above tests in multiple

passes in an automated manner on a cluster (TTU - Hadoop Cluster) environment. The

average time taken for loading the dataset and running the use case is calculated


33

among multiple passes and the results obtained are shown in milliseconds (ms) for all

the three schema designs.

Table 4.1: Test results of loading and running one use case on a single DDoS attack

data file using Hadoop Cluster

Schema (rowkey) Design Avg. Load Time (ms) Avg. Use Case Time (ms)

1 (rowkey – S.no) 6426.0 243.70

2 (rowkey – Time + S.no) 5766.0 173.6

3 (rowkey – Destination + S.no) 6136.0 145.32

The tests were performed in three different passes, where each pass contains 50

iterations of loading the dataset and running the use case on all three schema designs.

Table 4.1 shows the average values of all three passes. The results obtained for all

three schema designs in the three different passes are also shown in Figure 4.3.

Figure 4.3: Avg. use case runtimes of individual passes for each schema design

This pattern of iterations was designed to be as simple as possible and to also

provide a good degree of resilience against all possible sources of inefficiency, such as

caching and indexing. Since most of the use cases, such as obtaining the total number

of requests and responses as mentioned involve use of the ‘Destination’ column, and

the best possible performance results are achieved for the schema design that has

‘Destination + S.no’ as the rowkey, this particular rowkey design is selected for this

research and is integrated with the afore mentioned ‘rows’ table.

256.38 250.48 224.26

173.56 174.68 172.56 143.38 146.08 146.52

0

50

100

150

200

250

300

1 2 3Tim

e In

Mill

i Se

con

ds(

ms)

Number Of Passes

Avg. Runtimes of Individual Passes for Each Schema

Schema Design 1

Schema Design 2

Schema Design 3


34

As described in the previous chapter, four of the DDoS attack datasets have

been selected, with two from the ‘from-victim’ category and two from the ‘to-victim’

category. The files are integrated in a way that cross-maps the ‘Source’ and

‘Destination’ fields of each file where the ‘Source’ column of ‘to-victim’ file is

mapped to the ‘Destination’ column of ‘from-victim’ file and ‘Destination’ column of

‘to-victim’ file is mapped to the ‘Source’ column of ‘from-victim’ file. This process is

done separately for two different rowkey designs, 1) ‘TableId’ followed by ‘RowId’

(HVID model), and 2) ‘Destination’ followed by ‘Time’ (HBaseSchema model),

which is the third rowkey design that was previously identified as best rowkey. The

‘Time’ field is used instead of ‘S.no’ since the ‘S.no’ field is not considered in this

research and has been removed as explained in the previous chapter. Different load

times (in ms) have been observed for all the data files that have been loaded into the

two models with the HBaseSchema model taking lesser time to load the datasets than

the HVID model.

The graph in Figure 4.4 shows the time taken by the two models (in ms) to

load the datasets I and II. Here, each dataset constitutes the two data files started at the

same time that captured the traffic sent from and to the victim. Both models are loaded

with the same datasets, which contain a total of 217457 rows of data.

Figure 4.4: Time taken to load datasets in HVID and HBaseSchema models

150111 154227

124655 138017

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

Dataset I (108674 rows) Dataset II (108783 rows)

Tim

e (

ms)

Time taken to load Datasets

HVID

HbaseSchema


35

For this research, a namespace is created using an HBase shell. Using the

namespace, a ‘rows’ table and a ‘meta’ table are created. The ‘rows’ table stores the

datasets in its entirety, with the rowkey as a hash value of the ‘Destination’ column

value appended with a hash value of the ‘Time’ column value. This way of hashing

helps to avoid problems like hot spotting (HBase, 2006) as the data is evenly

distributed and provides faster retrieval of data. The ‘meta’ table stores the

information about the name, type, and length of each column from the dataset. The

‘Field’ tables are created individually for the fields ‘Time’, ‘Source’, ‘Destination’,

‘Length’, ‘Protocol’, ‘Info’, and ‘Flag’, with a datatype of ‘String’ and a rowkey as the

column value concatenated with a hash value of ‘Table Id’ and a hash value of ‘Row

Id’. Datasets are loaded into the ‘rows’ table through a customized MapReduce

program. The datasets are also loaded into their respective ‘field’ tables.

Among the four datasets that are collected, 35 unique IP addresses have been

identified which have flooded the victim with false requests and caused the DDoS

attack. For all of these unique IP addresses, request and response counts are calculated

and the time (in ms) taken to obtain these results are compared against the traditional

way of storing datasets in HBase and the HVID model. After calculating the average

time taken to run these use cases for all three models, it is noticed that the

HBaseSchema model took slightly less time than that of the HVID model, and much

less time when compared to the standard HBase model. Figure 4.5 depicts the average

use case times (in ms) for all three models.


36

Figure 4.5: Avg. time taken to run use cases in Standard HBase, HVID and

HBaseSchema models

Using this model, required factors such as the total number of requests and the

total number of responses in the datasets to be fetched and calculated per second in

order to obtain the request/response ratio per second. TTL values are also fetched for

all the unique IP addresses that were identified previously, in order to identify the

network packets that have abnormal TTL values. Data mining and machine learning

techniques are then used on these factors to identify the attack-related traffic from the

network traffic. The details of these results are provided in the next chapter.

All of these tests have been performed on IBM BigInsights v2.1.2 that comes

with an installation of Hadoop, HBase and other big data tools, running on a 64-bit

virtual Linux machine. This was installed on a standalone Windows-7 64 bit machine

that has 8 GB RAM and a 2.99 GHz processor.

5250.1

4005.7 3863.1

3600

3800

4000

4200

4400

4600

4800

5000

5200

5400

Standard Hbase HVID HbaseSchema

Tim

e (

ms)

Avg. Usecase Time (ms)

Avg. Usecase Time (ms)


37

CHAPTER V

DATA MINING AND MACHINE LEARNING APPLIED TO

CYBERSECURITY DATASETS

This research aims to create a model using Hadoop and HBase for storing the

integrated datasets and efficiently retrieve statistical data associated with the datasets,

which can be used to run data mining and machine learning algorithms to identify

attack-related traffic from regular traffic. In the previous Chapters 3 and 4, various

HBase models and schema designs have been discussed that provide efficient ways for

data storage and retrieval. This chapter provides details about the data mining and

machine learning techniques such as the Logistic Regression (LR) classification

algorithm (Komarek, 2004) and the Fuzzy k-Means (FKM) clustering algorithm

(Bezdek, 1981), applied on the statistical data to identify the attack related traffic. This

research has experimented with the two algorithms to attain accuracy in the results and

to identify the best algorithm that is potentially capable of identifying attackers in the

network datasets. Comparisons drawn between the results obtained from the LR and

FKM algorithms are also described in this chapter.

Use of Logistic Regression

The LR algorithm is a Supervised Learning algorithm (LR) whereas the FKM

algorithm is an Unsupervised Learning algorithm. Machine Learning is a concept of

Artificial Intelligence using which, a machine can learn from the data without having

the need to explicitly write a program (Kovahi, 1988). Machine learning algorithms

operate by generating a model from the sample data provided to them. This type of

algorithm where sample input and output are provided, and rules are generated to map

the input to the output is called Supervised Learning (Russell, 2003). Examples of

Supervised Learning algorithms are Logistic Regression, Decision Trees, Support

Vector Machine (SVM) and k-Nearest Neighbor (KNN). On the other hand, machine

learning algorithms that do not take any labeled data but identify a hidden structure

and model in the data are termed as Unsupervised Learning algorithms. Examples of

Unsupervised Learning algorithms are k-Means, Fuzzy k-Means algorithms and

hidden Markov models.


38

This research preferred the use of LR for its ability to provide quick and

accurate results while performing classification, as well as its simplicity over other

supervised learning algorithms such as Decision Trees (Mohri, 2012). The LR

algorithm has been employed in the past in cases of fraud detection or spam detection

(Owen, 2012). It is best preferred in scenarios that produce output in a binary form of

0 or 1. For instance, classifying an email as a spam or non-spam is denoted as 1 or 0.

This research has also used Apache Mahout (Mahout, 2009), which is an open

source platform that provides implementations for various classification, clustering,

association, and recommendation-related algorithms. Since, Mahout runs on top of

Hadoop, the algorithms can be run on a single machine or as a MapReduce program

that takes input files from an HDFS cluster. Although there is no single algorithm that

works the best for all problems, the user must identify which algorithm best suits a

given use case. This research utilizes the LR classification algorithm provided by

Mahout that helps in easy and automatic processing of large datasets to classify each

IP address either as an attacker or a non-attacker.

LR provided by Mahout uses a simple, sequential (non-parallel) method called

Stochastic Gradient Descent (SGD) that has a low overhead when compared to other

methods when working on datasets (Owen et al., 2012). Although SGD is non-

parallel, it's invariably fast and can handle large training sets that contain millions of

samples. SGD provided by Mahout is an online learning algorithm, which means that

models can be learned in an incremental fashion and the performance of each model

can be tested while the system continues to run. In this algorithm, every training

sample is used to tweak the model until the best possible model is attained. This

process is repeated over all the available training examples. Mahout has built-in

classes such as CrossFoldLearner that can perform the online evaluation, including

cross-validation. Cross-validation refers to dividing the dataset into 5 equal sets (20%

each) out of which, one set (20%) is used for evaluation and the remaining 80% data is

used for training. This is repeated in a round-robin fashion such that for all the

possible five runs, five different sets are used for evaluation, while the rest is used for

training the model. At the end of the five runs, the best model that fits the dataset can


39

be identified. This method also avoids the problem of overfitting. Overfitting refers to

the scenario, where the model works fine on the training data that is used for

generating the model, but fails to classify new samples when a new set of data is

introduced for evaluation (Pedregosa, 2011).

Since the datasets obtained from CAIDA contain purely attack traffic, with all

non-attack traffic removed, network traffic has been captured for five minutes from a

TTU desktop machine connected to the TTU network and merged with the attack data

as described in the previous section. Also, it might be evident that the data files

obtained from CAIDA contain purely attack data and the data files generated from

TTU machine contain purely non-attack data. However, it helps to generate a robust

model using the LR algorithm by passing this mixture of data files as training samples

for the learning phase of the algorithm. Moreover, this model can be tested by running

it on any network traffic dataset just by using the required statistical data for all the

suspicious IP addresses as the input file.

The datasets that were previously integrated and stored in the HBase model, as

described in the previous section, contain over 200k rows of data. Among these

various rows of data that have both requests and responses, only 34 unique IP

addresses are found. For all 34 unique IP addresses, the number of requests and

responses generated per second are calculated and an input file in the form of .csv is

created. The same process is repeated for the non-attack traffic captured from the TTU

machine, by identifying 7 unique IP addresses and calculating their request and

response counts per second. The statistical data obtained from the non-attack data file

is merged with that of the attack data file.

Logistic Regression algorithm is executed through Mahout in two stages,

namely TrainLogistic and RunLogistic. During the TrainLogistic phase, a training

dataset that constitutes about 80% of the original input dataset is passed and this

algorithm takes several key arguments as inputs. The arguments are:

1) input - path for the input file (.csv format preferred)

2) output - path and name of the model, where the generated model is placed


40

3) predictors - list of predictor variables, which help in classifying the target

4) types - data types of each predictor variable that can be a numeric, word, or text

5) categories - number of categories the target variable contains

6) passes - number of times the input file should be re-examined

(small input files require a dozen times)

7) target - a field in the input file that contains the target variable

8) rate - sets the initial learning rate (usually higher value required for large input

files), and

9) features - sets the size of the internal feature vector that is used in building the

model

Once the model is generated, it outputs the coefficients (B0, B1, B2 ... Bn)

calculated for all the predictor variables (x1, x2 … xn) as well as intercept term (which

is a built-in predictor variable for LR). This model is used to perform the evaluation of

classifying the samples from the given input dataset by running the RunLogistic

algorithm through Mahout. This algorithm takes fewer number of arguments

compared to that of the training phase. The arguments are:

1) input - path for the input file that needs to be evaluated

2) model - path for the model that is used to evaluate the input data

3) auc - (area under curve) that determines the strength and quality of the model,

which ranges from 0 to 1, with 1 considered as the best

4) scores - prints the target variable value along with their scores for each input value

5) confusion - prints the confusion matrix that determines the number of correctly and

incorrectly classified samples.

All the coefficient values obtained from the model and predictor values fetched

from the input data file are substituted in the formula below and the function values

are computed. The output values that are close to 1 are labeled as 1 and the values that

are close to 0 are labeled as 0 under the target column.

Logistic Regression Formula: 1/ [1+e–(B0+B1X1+B2X2+ … +BnXn)

]

Initially, two factors are considered as predictor variables for generating the

model, 1) requests per second calculated for each unique IP address, and 2)

request/response ratio per second. However, the auc that determines the strength and


41

quality of the model was 0.61, which is considered as above average. For a model, the

auc values range from 0 to 1, where 1 indicates that the model is best, 0 indicates that

it is a perverse model, and 0.5 is considered as average or random. Out of 23 samples,

20 were classified by the model when run on the same input dataset that is used to

generate the model. Upon running the model on newly added samples in addition to

the existing ones, the auc value was dropped to 0.5 which is considered as an average

value.

Table 5.1: Sample values (poor results) of the model generated using a single dataset

Target Model-Output Log-Likelihood

0 0.000 -0.000000

0 0.000 -0.000000

1 1.000 -0.000000

1 0.000 -100.000000

AUC = 0.61

Confusion: [[19.0, 3.0], [0.0, 1.0]] //Confusion Matrix

Table 5.1 shows four rows out of the 23 rows obtained by running the model

on the same sample dataset that was used to generate this model. The Target variable

shows whether each IP address is classified as an attacker (0) or non-attacker (1).

Model-Output presents the frequency of each value listed in the input set, and Log-

Likelihood describes the logarithmic values of the probability of each event (here, IP

address) that is likely to happen again. Another important performance metric in the

LR algorithm is the Confusion Matrix, which determines the number of correctly

classified and incorrectly classified samples. Using this sample model that was

generated, 20 (19+1 in the matrix) samples are correctly classified, with 19 as

attackers (0) and 1 non-attacker (1) and 3 (3+0) samples are incorrectly classified,

which are false-positives that indicates 3 samples that are non-attackers are incorrectly

classified as attackers. Although, the model is able to classify 20 samples out of 23

correctly, it has poor auc, model-output, and log values, and is therefore considered as

an unfit model.


42

To improve the quality of the model, all of the 41 unique IP addresses are

considered (34 attacker + 7 non-attacker IP addresses) and the number of predictor

variables is increased by considering TTL values of all the IP addresses along with the

request counts, response counts and request/response ratio per second. Table 5.2

shows a snapshot of the dataset that was prepared for executing the LR algorithm.

Table 5.2: Snapshot of the input dataset prepared for LR algorithm

IP Req_per_Sec Res_per_Sec Req/Res Hop-count

2 0.013 0.026 0.5 9

2 62.156 1 62.156 47

2 82.841 1 82.841 48

2 7.175 1.001 7.167 54

2 0.145 0.251 0.577 20

2 61.545 1 61.545 56

2 0.006 0.006 1 24

2 1.146 0.983 1.165 50

2 0.256 0.3 0.853 11

2 58.943 1 58.943 50

1 0.03 0.03 1 0

1 0.046 0.023 2 0

1 0.016 0.016 1 0

1 0.003 0.003 1 0

1 0.016 0.003 5 0

1 0.003 0.003 1 0

1 0.003 0.003 1 0

For any network packet, the standard TTL values that are generated from a

Windows-based, Unix or LINUX-based machines are 255, 128, 64, 60, 32, 30 (using

TCP UDP or ICMP protocol), and 49 (for HTTP protocol) (Gu, 2007). For any packet

that has a TTL value other than these given ones, its hop-count is said to be greater

than 0. Hop-count indicates the number of times a packet was passed through another

router or device before reaching the destination. Every time a packet is passed to

another device, one unit of TTL is reduced and its hop-count increases to 1. Since the

dataset contains only the latest TTL values and the original TTL value with which the

packet was initiated is not known to us, the hop-count can be calculated by subtracting

the latest TTL value from the smallest value of the standard TTL values that is greater


43

than the latest TTL. For instance, if the TTL value that is obtained from the packet is

106, then it should be subtracted from 128 as it is the least value among the standard

values that are greater than 106. Hence, the hop-count is 22. Similarly, hop-counts are

calculated for all the IP addresses. The input file that contains req (requests per

second), res (responses per second), req/res ratio, and hop-count in .csv format for all

41 IP addresses is prepared.

Out of these 41 rows of data, 33 rows (28 attacker + 5 non-attacker rows) of

data are passed in the training phase as a learning dataset with the same parameters as

mentioned previously, except for the additional predictor variables that are added. The

model that is generated from the training phase is used for evaluating the same dataset

to identify the quality and classification of the model. Table 5.3 shows the output

generated from the evaluating phase.

Table 5.3: Sample values of the model generated using incremented variables and

datasets


0 0.296 -0.351396

0 0.176 -0.193691

0 0.060 -0.061848

1 0.976 -0.023981

1 0.997 -0.003243

AUC = 0.90


The confusion matrix clearly shows that all 33 samples of the training dataset

have been correctly classified as 28 attackers (0) and 5 non-attackers (1) with the best

possible auc as 0.9. Hence, this model is used for assessing the whole input dataset,

which has 41 samples. Table 5.4 shows the values obtained when this model is run on

the complete dataset.


44

Table 5.4: Sample values of the model evaluated on the complete input dataset


0 0.177 -0.194467

0 0.209 -0.234551

0 0.060 -0.061848

1 0.976 -0.023981

1 0.998 -0.002318

AUC = 0.89


Similarly, when the same model is run on the full input dataset that has new

samples other than the one used for training, the auc value was 0.89 and all samples

were correctly identified as 34 attackers and 7 non-attackers.

Use of Fuzzy k-Means

As a data mining technique, this research has used the Fuzzy k-Means (FKM)

(Bezdek, 1981) clustering algorithm for grouping certain IP addresses that exhibit a

similar type of characteristics under one cluster, which helps in segregating attack-

related traffic from regular network traffic. Fuzzy k-Means was chosen for its ability

to identify soft clusters, unlike the original k-Means algorithm (MacQueen, 1967),

which identifies only hard clusters. In a soft cluster, any given point can be placed in

more than a single cluster, which creates overlapping clusters in the dataset. In a hard

cluster, each point belongs to exactly one cluster, thus ruling out the possibility of

identifying attack-related characteristics in normal traffic. However, the FKM

algorithm checks the possibility of each point with an affinity value towards belonging

to other clusters. This affinity value of each point is directly proportional to the

distance of the point from the centroid of each cluster. Hence, the FKM algorithm

creates soft clusters as an extension to the k-Means algorithm (Owen, 2012). FKM

works best in cases where the clusters might be overlapped or mixed with each other

and the distance among the points is not linear, where a k-Means cannot provide a

better solution.


45

The FKM algorithm has built-in methods for measuring the distance and

representing each point in an n-dimensional vector space. Similar to that of an LR

algorithm input file, the same characteristics are considered in the input file of the

FKM algorithm: the number of requests per second, the number of responses per

second, and the hop-count value for all 41 unique IP addresses. However, unlike the

LR algorithm, FKM doesn’t take predictor variables and solely depends on the input

points represented in vector space and the distance of each point with its adjacent

points. Hence, in order to represent all the key characteristics of each network packet

through a single point that can be used in forming clusters, the characteristics observed

for each IP address, request per second, response per second and the hop-count are

added to form a single value for each IP address. This is done in order to give equal

weightage to all the parameters as they all have an equal role in determining an IP

address as an attacker or non-attacker. This algorithm is executed in four steps when

working on a single machine or using MapReduce on a cluster environment.

Step 1:

In the first analysis step, the input file is converted into a sequence file. Since

Mahout runs on top of Hadoop, Mahout provides a number of Hadoop classes that

convert all input files automatically into a sequence file. This sequence file is in turn

converted into a vector file, which is taken as the argument for running the FKM

algorithm. This process is explained in the step 2. The SequenceFile class provided by

Mahout converts each and every value to the form of a (key, value) pair, where the

key is the document name and the value is each number obtained from the input file.

The arguments passed to this class are:

1) input - path that specifies the directory that contains all the input files

2) output - path that specifies the directory where the sequence file is created

3) method - method that is used to execute this step, which can be sequential or

mapreduce

All the remaining parameters are optional and this step can be executed with

the default values provided by the Java class.


46

Step 2:

The next step of the analysis process is to generate vectors from the sequence

file created in the previous step. The Seq2sparse class takes the sequence file as an

input and creates tokenized documents in the form of <docID, tokenizedDoc> pairs

and vectorized documents in the form of <docID, TF-IDF vector>. TF-IDF refers to

Term Frequency and Inverse Document Frequency, which is used to identify the

importance and frequency of each value in the input file or word in case of text-related

documents. TF gives the number of times each word has occurred in the document and

IDF refers to the frequency of the same value in the other documents in order to

identify whether it's a common or rare occurrence in the input files. This factor is not

applicable for the given input file, since it is a single file. In addition to the tokenized

documents, The seq2sparse class also creates sequence files for:

1) dictionary - index for each value followed by the value <valueIndex,

value>

2) word frequency count - index for each value followed by its count

<valueIndex, count>, and

3) document frequency count - index for each value followed by DF count

<valueIndex, DFCount>.

The DFCount is not applicable for the given input file since it is a single file.

This step takes several input parameters other than the usual input and output

arguments. For the 'weight' argument, the value can be specified as TF instead of TF-

IDF, as IDF is not applicable for a single input file. The named vector argument can

be set to true, so that the output vectors that are created will be named.

Step 3:

The third analysis step is to run the FKM algorithm using the TF-IDF vectors

created in the previous step, along with a few key arguments, such as:

1) input - path to the directory that contains the vector files created in the

previous steps

2) output - path to the directory where output files are created


47

3) clusters - path to a new directory where the output clusters are created.

4) distance measure - classname of the distance measure that is preferred

(Euclidean or Cosine)

5) x - maximum number of iterations that is preferred.

6) m - fuzzyness argument that specifies the coefficient for normalization. If

specified as 1, then it is treated as k-Means algorithm.

7) k - number of clusters that need to be created.

Upon successful completion of this step, 'k' (given) number of clusters will be

created in the specified folder for clusters, which contains a separate folder for each

cluster, named from cluster-0, cluster-1, up to cluster-N. Each cluster folder contains a

sequence file that has an identifier followed by the value present in that soft cluster.

Also, the output folder specified contains another sequence file that has all the

clustered points in a (key, value) format, where each key is the clusterId.

The final step is performed using the clusterdump tool that converts the

sequence files of each cluster into a readable format, listing the centroids and all the

values present in that cluster. To perform this step, the input path is mentioned as the

final cluster folder and if output path is specified, the clusters are written to that file,

otherwise the output is printed on the shell itself. The dictionary file created in the

second step can be provided optionally as an argument, if required. Also,

'dictionaryType' should be mentioned as 'sequencefile' since the input file used for

FKM is not a text file.

The input file passed to the first step of running the FKM algorithm is the list

of 41 IP values (given below), which are placed linearly without any commas in a text

file:

[15.838, 9.039, 110.156, 131.841, 62.176, 9.293, 9.105, 9.458, 20.396, 118.545,

24.012, 52.129, 11.556, 109.943, 54.616, 17.04, 17.07, 20.053, 20.059, 4.26, 112.616,

64.755, 61.63, 19.252, 22.06, 15.039, 9.009, 20.15, 5.483, 10.48, 15.02, 52.006,

16.006, 11.106, 0.06, 0.069, 0.032, 0.006, 0.019, 0.006, 0.006]


48

Using the FKM algorithm, 3 clusters C1, C2, and C3 are created with three

centroids for each cluster, highlighted in bigger size in Figure 4.6. The points grouped

in cluster C1 are non-attackers and the remaining points grouped in cluster C2 and C3

are all attackers. In the Figure 4.6, clusters C1 and C2 share the two points, 4.26 and

5.483 as they are created as soft clusters and these two points are similar to that of the

points available in both the clusters. Although these two points are grouped with

cluster C1, which contains non-attacker IP values, these points are attackers and

should be placed in cluster C2. This is both an advantage and disadvantage of using

the FKM algorithm. The disadvantage is that they are placed in cluster C1 in spite of

being attackers. However, it is more advantageous to use the FKM algorithm because

these two points are considered in cluster C2 as well giving a need for attention.

Cluster C3 has more clear values and is a hard cluster because it is not sharing values

with any other cluster. Also, if there are any common points shared between clusters

C2 and C3, this sharing would be acceptable since both clusters contain attackers. The

same input data file run in the k-means algorithm gave a clear cluster of C1 and C2

with these two points being placed in cluster C1 instead of cluster C2. In a real

scenario, these 2 points would have gone unnoticed as they are grouped with non-

attackers, which is dangerous as they are false-negatives.

Cluster C1 contains 9 points with 1 as the centroid:

[0.06, 0.069, 0.032, 0.006, 0.019, 0.006, 0.006, 4.26, 5.483]

Cluster C2 contains 23 points with a centroid of 15 and sharing 4.26 and 5.483 points

with cluster 1:

[15.838, 9.039, 9.293, 9.105, 9.458, 20.396, 24.012, 11.556, 17.04, 17.07, 20.053,

20.059, 19.252, 22.06, 15.039, 9.009, 20.15, 10.48, 15.02, 16.006, 11.106, 4.26,

5.483]

Cluster C3 contains the remaining 11 points with a centroid of 84:

[110.156, 131.841, 62.176, 118.545, 52.129, 109.943, 54.616, 112.616, 64.755, 61.63,

52.006]


49

Figure 4.6: The three clusters formed using FKM algorithm

Comparison and Analysis of LR and FKM Algorithms

This research has experimented with two algorithms that are potentially

capable of segregating the attack-related traffic from regular network traffic. Logistic

Regression is a classification algorithm that is an example of a supervised learning

technique, whereas Fuzzy k-Means is a clustering algorithm that provides an example

of an unsupervised learning technique that does not require any labeled data or

training phases. Although these two algorithms are best in their own ways, Logistic

Regression took 250 to 300 ms for the combined effort of the training and execution

phases, whereas the FKM algorithm took around 700 to 900 ms for each stage. Hence,

the combined effort for all four stages is 4000 ms. Although finalizing the training

model in the LR algorithm may take more number of passes and much time needed to

execute, once the model is finalized, the Logistic Regression algorithm classifies

attackers and non-attackers in less time compared to the FKM algorithm.

Besides taking less time for execution, Logistic Regression gives about 90%

accuracy when all the training scenarios are taken into consideration. The finalized


50

model mentioned in the previous section classified all 41 samples as 34 attackers and

7 non-attackers, identifying 100% accurately. The FKM algorithm accurately grouped

39 out of 41 samples, with 2 samples being shared among two attacker and non-

attacker clusters. Since the FKM algorithm creates clusters of closely-related points, a

few attacker IP addresses that have fewer counts similar to that of a non-attacker IP

addresses are grouped together. Due to this, 2 attackers are incorrectly grouped with

non-attackers. Thus, the FKM algorithm exhibited 95% accuracy in creating clusters

of attackers and non-attackers in the given datasets. In this research, only five datasets

(1 non-attacker + 4 attacker datasets) are considered having about 220k rows of data,

out of which only 41 unique IP addresses are found. If tens of thousands or millions of

IP addresses are considered for performing classification and clustering, these

performance measures and efficiency rates will vary.

Nevertheless, both algorithms are adaptive in a way that more parameters or

contributing factors can be added during the training phase of an LR algorithm or

while executing the first step of an FKM algorithm for identifying attack-related

traffic from regular network traffic.


51

CHAPTER VI

CONCLUSION AND FUTURE WORK

Conclusion

This research has explored DDoS attack datasets and developed a unique

column-based design approach using Hadoop and HBase for efficiently integrating,

storing, and retrieving the datasets. Use case requirements related to these datasets

have been identified and the statistical data related to each unique IP address is

obtained from the integrated datasets as per the use cases, which were used for running

machine learning algorithms, such as the Logistic Regression (LR) classification

algorithm and the Fuzzy k-Means (FKM) clustering algorithm. The LR algorithm

accurately classified attackers and non-attackers from the datasets and the FKM

algorithm successfully created clusters of attackers and non-attackers.

Various attacks, worms, and viruses related to cybersecurity datasets have been

explored and DDoS attack datasets were finalized after the initial research. Major

schema design alternatives in HBase were tested to develop an efficient HBase

rowkey design scheme for storing the integrated datasets. These datasets contained

purely attack related traffic. Hence, regular network traffic from a TTU desktop

machine connected to the TTU network has been captured using Wireshark and

merged with attack-related traffic. The statistical data obtained from the integrated

datasets contain a mixture of attack and non-attack traffic, which was successfully

classified and clustered by LR and FKM algorithms as attackers and non-attackers.

Future Work

In this research, four DDoS attack datasets are integrated, utilizing all of the columns

that are common to the datasets, such as Time, Source, Destination, Length, Protocol,

and Info. Future work is needed to explore the utilization of the additional columns

that are unique to each dataset, in addition to the common fields. The common fields

would, in effect, serve as join columns for the purpose of exploring attack patterns

related to the additional columns of each dataset. This research has also integrated the

existing datasets in a batch mode. In the future, this work should be extended to live,


52

streaming datasets that can be integrated and analyzed on a real time basis. Another

long term goal would be to automate the integration process of datasets and implicitly

send the statistical data to the machine learning and data mining algorithms which

would make a complete end-to-end process of identifying attack-related traffic from

the network datasets.


53

BIBLIOGRAPHY

Big Data: A Workshop Report. Workshop. Washington D.C: National Academic

Press, 2012. Report.

Bezdek, James C. Pattern recognition with fuzzy objective function algorithms.

Kluwer Academic Publishers, 1981.

Big Data Working Group. "Big Data Analytics for Security Intelligence." Cloud

Security Alliance, 2013.

Cardenas, Alvaro, Pratyusa Manadhata and Sreeranga Rajan. "Big Data Analytics for

Security." IEEE Security and Privacy 2013. Document.

Chandola, Varun, et al. Data Warehousing and Data Mining Techniques for Computer

Security. Springer, 2006.

Chapple, Michael J., Nitesh Chawla, and Aaron Striegel. "Authentication anomaly

detection: A case study on a virtual private network." Proceedings of the 3rd

annual ACM workshop on Mining network data. ACM, 2007.

Chien, Eric. CodeRed Worm - Symantec Enterprise. 13 February 2007. Report.

Conficker, Worm. Microsoft Safety and Security Center - Protection from Conficker

Worm. November 2008. Report.

Danyliw, Roman and Allen Householder. CERT - Code Red Worm Exploiting Buffer

Overflow In IIS Indexing Service DLL. 19 July 2001. Report. 17 January 2002.

Dark Reading. Dark Reading: Security Monitoring. 9 March 2012. Case Study.

Frederick, Karen Kent. Abnormal IP Packets: Symantec Connect. 12 October 2000.

Article. 3 November 2010.

Gao, Neng, Deng-Guo Feng, and Ji Xiang. "A data-mining based DoS detection

technique." Jisuanji Xuebao(Chinese Journal of Computers) 29.6, 2006.

García, Enrique, et al. "Drawbacks and solutions of applying association rule mining

in learning management systems." Proceedings of the International Workshop

on Applying Data Mining in e-Learning (ADML), Crete, Greece. 2007.

Gartner. Security Announcements - Retrieved from Gartner. February 2014.

Document.

George, Lars. "HBase Schema Design and Cluster Sizing Notes." ApacheCon Europe.

Sinsheim, Germany, 2012.


54

Gilani, Zafar and Slaman Ul Haq. "Analyzing Large Datasets with Hive." July 2013.

IBM Developer Works.

Giura, Paul, and Wei Wang. "Using large scale distributed computing to unveil

advanced persistent threats." SCIENCE 1.3, 2013.

Gu, Qijun, and Peng Liu. "Denial of service attacks." Department of Computer

Science Texas State University–San Marcos School of Information Sciences

and Technology Pennsylvania State University Denial of Service Attacks

Outline, 2007.

Hadoop, Apache. Apache Hadoop - Apache Software Foundation. 2005.

Harper, Jelani. "Enterprise Threats: Big Data and Cyber Security." 11 June 2013.

Dataversity Education.

HBase, Apache. Apache HBase - Apache Software Foundation. 2006.

Hive, Apache. Apache Hive - Apache Software Foundation. 2009.

ICANN, Community. Root Server Attack: ICANN Factsheet. Callifornia, 2007.

Report.

Juturu, Sindhuri, Noah Metzger and Lakhan Jhawar. Design of Efficient Schema for

Performing Read/Write Operations on Big Datasets Using HBase. Research

Paper. Texas, 2014.

Khurana, Amandeep. "Introduction to HBase Schema Design.", USENIX, Vol. 37. No.

5, Oct. 2012. Print.

Komarek, Paul. "Logistic regression for data mining and high-dimensional

classification." Robotics Institute, 2004.

Kohavi, Ron, and Foster Provost. "Glossary of terms." Machine Learning 30.2-3,

1998.

Labrinidis, Alexandros, and H. V. Jagadish. "Challenges and opportunities with big

data." Proceedings of the VLDB Endowment 5.12, 2012.

Lee, Yeonhee, and Youngseok Lee. "Detecting ddos attacks with

hadoop."Proceedings of The ACM CoNEXT Student Workshop. ACM, 2011.

Levine, John, et al. "The use of honeynets to detect exploited systems across large

enterprise networks." Information Assurance Workshop, 2003. IEEE Systems,

Man and Cybernetics Society. IEEE, 2003.


55

MacQueen, James. "Some methods for classification and analysis of multivariate

observations." Proceedings of the fifth Berkeley symposium on mathematical

statistics and probability. Vol. 1. No. 14. 1967.

Mahout, Apache. Apache Mahout - Apache Software Foundation. 2009.

Mahoney, Matthew V., and Philip K. Chan. "An analysis of the 1999 DARPA/Lincoln

Laboratory evaluation data for network anomaly detection."Recent Advances

in Intrusion Detection. Springer Berlin Heidelberg, 2003.

Mohri, Mehryar, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of

machine learning. MIT press, 2012.

Nielsen, Fiona. "Neural Networks algorithms and applications." Niels Brock Business

College, 2001.

Owen, Sean, Anil, Robin, Ted Dunning, and Ellen Friedman. Mahout in action.

Manning, 2011.

Pedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python." The Journal of

Machine Learning Research 12, 2011.

Ponemon, Institute. "Big Data Analytics in Cyber Defense." Ponemon Institute

Research Report. 2013.

Russell, Stuart, Peter Norvig, and Artificial Intelligence. "A modern

approach."Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs 25, 1995.

Sangster, Benjamin, et al. "Toward Instrumenting Network Warfare Competitions to

Generate Labeled Datasets." CSET. 2009.

Schneier, Bruce. "The Witty worm: A New Chapter in Malware." 2 June 2004.

Computer World: Maware and Vulnerabilities.

Shannon, Colleen and David Moore. The Spread of the Witty Worm. March 2004.

Document.

Sherman, Michael. Old Data - New Databases. 14 August 2014. Texas Enterprise.

Stearns, Bryan, Susan Urban and Sindhuri Juturu. Integrating Cybersecurity Log Data

Analysis in Hadoop. Research Paper. Texas, 2014.

The CAIDA UCSD "DDoS Attack 2007" Dataset,

http://www.caida.org/data/passive/ddos-20070804_dataset.xml

Virvilis, Nikos, Oscar Serrano, and Luc Dandurand. "Big Data Analytics for

Sophisticated Attack Detection.", 2013


56

Yen, Ting-Fang, et al. "Beehive: Large-scale log analysis for detecting suspicious

activity in enterprise networks." Proceedings of the 29th Annual Computer

Security Applications Conference. ACM, 2013.

Zhang, Jing, et al. "Safeguarding academic accounts and resources with the university

credential abuse auditing system." Dependable Systems and Networks (DSN),

2012 42nd Annual IEEE/IFIP International Conference on. IEEE, 2012.

Zhao, Teng, et al. "Problem Solving Hands-on Labware for Teaching Big Data

Cybersecurity Analysis." Proceedings of the World Congress on Engineering

and Computer Science. Vol. 1. 2014.

Zhong, Rui, and Guangxue Yue. "DDoS detection system based on data

mining." Proceedings of the Second International Symposium on Networking

and Network Security, Jinggangshan, China. 2010.

copyright 2015, lakshmi sindhuri juturu

Documents