big data analytics framework for peer-to-peer botnet ... · big data analytics framework for...

10
Big Data Analytics framework for Peer-to-Peer Botnet detection using Random Forests Kamaldeep Singh a,1 , Sharath Chandra Guntuku b,,1 , Abhishek Thakur a , Chittaranjan Hota a a Department of Computer Science, BITS – Pilani, Hyderabad Campus, AP 500078, India b School of Computer Engineering, Nanyang Technological University, 50 Nanyang Drive, Singapore 639798, Singapore article info Article history: Received 8 August 2013 Received in revised form 9 February 2014 Accepted 11 March 2014 Available online 25 March 2014 Keywords: Hadoop Mahout Peer-to-Peer Botnet detection Machine learning Network security abstract Network traffic monitoring and analysis-related research has struggled to scale for massive amounts of data in real time. Some of the vertical scaling solutions provide good imple- mentation of signature based detection. Unfortunately these approaches treat network flows across different subnets and cannot apply anomaly-based classification if attacks originate from multiple machines at a lower speed, like the scenario of Peer-to-Peer Botnets. In this paper the authors build up on the progress of open source tools like Hadoop, Hive and Mahout to provide a scalable implementation of quasi-real-time intrusion detection system. The implementation is used to detect Peer-to-Peer Botnet attacks using machine learning approach. The contributions of this paper are as follows: (1) Building a distributed framework using Hive for sniffing and processing network traces enabling extraction of dynamic network features; (2) Using the parallel processing power of Mahout to build Random Forest based Decision Tree model which is applied to the problem of Peer-to-Peer Botnet detection in quasi-real-time. The implementation setup and performance metrics are presented as initial observations and future extensions are proposed. Ó 2014 Elsevier Inc. All rights reserved. 1. Introduction Botnet attacks are one of the biggest challenge that security researchers and analysts face today on an International scale. The economic losses caused by breach of computer networks step up to several billions of dollars. Just a few months prior to this writing, a massive DDoS attack had targeted the admin accounts of numerous WordPress users. Online security commu- nities and blogs suspect it to be a Peer-to-Peer (P2P) based Botnet [43]. It was revealed that at least 90,000 unique IPs were utilized to achieve the attack [40]. It is suspected that this attack was part of a bigger plan, where the attacker wanted to compromise numerous WordPress servers and then use the army of bots to launch an even larger DDoS attack. Hence, even though P2P Botnets have been on the rise since one decade now, detecting and mitigating their attacks is still a challenge. For detecting and mitigating such attacks, network traces and packet captures are amongst the most valuable resources for network analysts and security researchers. With such attacks increasing every day the magnitude of network traces han- dled is expanding exponentially. However, computer systems lack the hardware and are fundamentally limited by the device http://dx.doi.org/10.1016/j.ins.2014.03.066 0020-0255/Ó 2014 Elsevier Inc. All rights reserved. Corresponding author. E-mail addresses: [email protected] (K. Singh), [email protected] (S.C. Guntuku), [email protected] (A. Thakur), [email protected] (C. Hota). 1 Authors 1 and 2 have equal contribution. Information Sciences 278 (2014) 488–497 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins

Upload: ngokhuong

Post on 04-Apr-2018

229 views

Category:

Documents


4 download

TRANSCRIPT

Information Sciences 278 (2014) 488–497

Contents lists available at ScienceDirect

Information Sciences

journal homepage: www.elsevier .com/locate / ins

Big Data Analytics framework for Peer-to-Peer Botnet detectionusing Random Forests

http://dx.doi.org/10.1016/j.ins.2014.03.0660020-0255/� 2014 Elsevier Inc. All rights reserved.

⇑ Corresponding author.E-mail addresses: [email protected] (K. Singh), [email protected] (S.C. Guntuku), [email protected] (A.

[email protected] (C. Hota).1 Authors 1 and 2 have equal contribution.

Kamaldeep Singh a,1, Sharath Chandra Guntuku b,⇑,1, Abhishek Thakur a, Chittaranjan Hota a

a Department of Computer Science, BITS – Pilani, Hyderabad Campus, AP 500078, Indiab School of Computer Engineering, Nanyang Technological University, 50 Nanyang Drive, Singapore 639798, Singapore

a r t i c l e i n f o

Article history:Received 8 August 2013Received in revised form 9 February 2014Accepted 11 March 2014Available online 25 March 2014

Keywords:HadoopMahoutPeer-to-PeerBotnet detectionMachine learningNetwork security

a b s t r a c t

Network traffic monitoring and analysis-related research has struggled to scale for massiveamounts of data in real time. Some of the vertical scaling solutions provide good imple-mentation of signature based detection. Unfortunately these approaches treat networkflows across different subnets and cannot apply anomaly-based classification if attacksoriginate from multiple machines at a lower speed, like the scenario of Peer-to-PeerBotnets.

In this paper the authors build up on the progress of open source tools like Hadoop, Hiveand Mahout to provide a scalable implementation of quasi-real-time intrusion detectionsystem. The implementation is used to detect Peer-to-Peer Botnet attacks using machinelearning approach. The contributions of this paper are as follows: (1) Building a distributedframework using Hive for sniffing and processing network traces enabling extraction ofdynamic network features; (2) Using the parallel processing power of Mahout to buildRandom Forest based Decision Tree model which is applied to the problem of Peer-to-PeerBotnet detection in quasi-real-time. The implementation setup and performance metricsare presented as initial observations and future extensions are proposed.

� 2014 Elsevier Inc. All rights reserved.

1. Introduction

Botnet attacks are one of the biggest challenge that security researchers and analysts face today on an International scale.The economic losses caused by breach of computer networks step up to several billions of dollars. Just a few months prior tothis writing, a massive DDoS attack had targeted the admin accounts of numerous WordPress users. Online security commu-nities and blogs suspect it to be a Peer-to-Peer (P2P) based Botnet [43]. It was revealed that at least 90,000 unique IPs wereutilized to achieve the attack [40]. It is suspected that this attack was part of a bigger plan, where the attacker wanted tocompromise numerous WordPress servers and then use the army of bots to launch an even larger DDoS attack. Hence, eventhough P2P Botnets have been on the rise since one decade now, detecting and mitigating their attacks is still a challenge.

For detecting and mitigating such attacks, network traces and packet captures are amongst the most valuable resourcesfor network analysts and security researchers. With such attacks increasing every day the magnitude of network traces han-dled is expanding exponentially. However, computer systems lack the hardware and are fundamentally limited by the device

Thakur),

K. Singh et al. / Information Sciences 278 (2014) 488–497 489

fabrication restrictions to accommodate the gigantic size of the datasets. This has led to an upsurge of interest in distributedalgorithms, taking advantage of the multi-core architectures and distributed computing.

In the past, researchers have used various techniques like Signature and Anomaly-based Intrusion Detection Systems(IDS) to deal with the issue of network security threat detection. But these solutions have scalability issues while dealingwith the large datasets that have been discussed above. Though there were kernel scaling methods proposed [28], there wereissues with datasets having high variance. It was found that when dataset has high variance, the larger the dataset, the betterthe training accuracy of the model [31]. In a high variance scenario, which happens when data is over-fit, the training errorwill be low and the cross validation error will be much higher than the training error. The intuition behind this is that train-ing is over-fitting the data, the model will memorize the data. But then the model will generalize poorly for new observa-tions, leading to a much higher cross-validation error. This can be corrected by increasing the training data set size whichwould lower the cross-validation error, in the scenarios where variance is high. Thus data sets demand the use of Big DataTechnologies. Several large datasets containing the malicious activity of various bots have been captured and released byCAIDA and other organisations. One of these traces, which was captured by UCSD, of 40 GB, [48] was used for this researchand this necessitated the existence of a scalable framework to train the classification module.

Hence, in this research a scalable and distributed intrusion detection system has been proposed which can handle heavynetwork bandwidths. This framework is built on top of Hadoop, which is an open-source software framework that supportsdata-intensive distributed applications and leverages the libraries that are designed to use the power of clustered commod-ity machines. This framework utilizes Apache Mahout which has machine learning algorithms to build predictive models.

The rest of the paper is organized as follows. Section 2 describes the related work in the realm of P2P Botnet detectionwhich are based on machine learning algorithms and the application of Hadoop in this area. Section 3 describes the exper-imental setup and the methodology used for the framework to achieve security threat detection in real-time. Section 4 dealswith the specific application of Peer-to-Peer Botnet detection using this framework and Section 5 concludes with results andscope for future work.

2. Related work

Over the last few years, several researchers have proposed machine learning based solutions for mitigating securitythreats. Research has systematically drifted from signature-based methods to more semantic-based methods like developingOntological models to handle web application attacks [36] and Hidden Markov Models for spam detection [37]. In [5]researchers have proposed security aware agent based systems which can manage their own security at runtime. Therehas also been research on using large network traces for the mitigating security threats [24] where authors proposed aHadoop based DDoS attack detection method. They have developed a novel traffic monitoring system that performs NetFlowanalysis on multi-terabytes of Internet traffic in a scalable manner [23]. They have devised a MapReduce algorithm with anew input format capable of manipulating libpcap files in parallel. But their approach hardcodes the features that can be ex-tracted from the libpcap files and thereby the user is not presented with the flexibility to decide on the feature set based onthe problem instance. As per the current knowledge of the authors, the area of network security analytics severely lacks priorresearch in addressing the issue of Big Data.

On the contrary, there has been significant research in the area of security threat detection using machine learning tech-niques. Authors [52] differentiate the network flow records based on certain features related to traffic volume and categorizethem as malicious and benign. They also present how the plotters could change their behavior to evade their detectiontechnique, which was observed in Nugache, which is known to randomly change its behavior. Authors in [17] show thatP2P Botnet detection can be achieved with high accuracy based on a novel Bayesian Regularized Neural Network.

In [9] authors provide an overview on the adversarial attacks against IDSs and highlight the most promising researchdirections for the design of adversary-aware, harder to defeat IDS solutions. Researchers [42] have also worked on buildinganomaly detection systems which are unsupervised in nature. Markovian IDS has been developed to shield wireless sensornetworks based on mining the attack patterns [20].

Researchers [39] report their work on detecting Sinit and Nugache that these bots communicate on the same port. Theyalso indicate that a large number of destination unreachable error messages and connection reset error messages were ob-served. But bots which encrypt their payloads will disable the authors technique from working as their method dependsheavily on the payload signatures. Researchers [11] explain the challenges and features when dealing with Nugache P2PBotnet. Authors in [45] conclude that it is impossible for a static Intrusion Detection System (IDS) to detect Nugache traffic.

Researchers [38] worked on understanding the role of reputation and trust in the P2P domain. There have also been works[21] trying to classify applications based on the traffic generated to characterize them. There have also been work [7] ondeveloping privacy preserving packet anonymization frameworks for transforming packets to suite research purposes.Authors analysis of Storm in [45] concludes that the bot can be detected by configuring an IDS to find the configuration fileused by the bot. But it is difficult to distinguish between legitimate P2P communication and Storm bot’s communication asthe behavior of Storm can be camouflaged to look like legitimate P2P communication. To address this issue, authors in [19]develop ways to mitigate the Storm worm and introduce an active measurement technique to enumerate the number of in-fected hosts by reverse engineering of the bot’s binary executable, in order to identify the function which generates the keythat is used for searching other infected machines and bots.

490 K. Singh et al. / Information Sciences 278 (2014) 488–497

Authors in [44] report an in-depth analysis of Peacomm and how it spams large number of emails to many accounts hold-ing an executable attachment. To mitigate this authors [25] proposed adaptive mechanisms to detect a variety of P2P Bot-nets. But their solutions can be applied only in the attacking stage of the Botnet. Their study was successful on Trojanpeacomm, but the implications are not clear when dealing with other categories/types of Botnets.

Authors in [15] retrieve the hashes of malware and use it for locating a zombie nodes’ activities in a P2P network. Theyargue that if a peer searches for a hash of a malware, it must be a zombie. Authors [29] devise techniques to localize Botnetmembers based on the unique communication patterns arising from the structured overlay topologies used for commandand control. They examine if ISPs can detect the efficient communication structures of P2P bots. Their algorithm isolates thisbased on the information about which pairs of nodes communicate with one another.

Authors in [46] use P2P flow identification techniques to monitor and filter traffic flows, isolating the hosts when theyconnect to the Botnet. They use Bayes Classifier and Neural Network classifier to detect the IP of the infected systems.Authors in [39] present an anomaly detection approach based on the analysis of non-stationary properties and ‘‘hidden’’recurrence patterns occurring in the aggregated IP traffic flows.

Researchers in [16] present a framework named BotMiner, which detects both centralized IRC and P2P Botnets using ananomaly-based detection system. The assumption in this regard is that bots are coordinated malware that exhibit similarcommunication patterns and behaviors. BotMiner targets a group of compromised systems belonging to a monitored net-work, whereas it fails to detect a simple system which might be a part of a Botnet which is not in the monitored network’szone.

Researchers [32] detect Botnet activity by detecting similar flows occurring between a group of hosts in the network on aregular basis. Flows with similar behaviors are labeled into groups, and a transition model of the grouped flows is con-structed using a probability matrix. The authors compute a ‘likelihood ratio’ and use that ratio for detection of bots.

A self-organization map algorithm was applied by authors in [22] to detect P2P Botnets, in which they assume that therewould be numerous failed connection attempts from exterior to interior in firewall. And in [13] authors have presented amodel based on Discriminative Restricted Boltzmann Machine to combine the expressive power of generative models withgood classification accuracy to infer knowledge from incomplete training data.

In most of the previous work, researchers have focused on detecting a specific Botnet activity and their methods were notreported to be successful in detecting bots, whose traffic characteristics were not used in the training set. Clearly, it can beseen that using a machine learning based approach is far superior in detecting malicious traffic when compared to a tradi-tional signature based approach as the bot masters redesign the bots from time to time and the functionality and behavior ofthe Botnet varies quite significantly with each version release of the bot. And signature based detectors rely heavily upon theexisting Botnet signatures to detect any activity. Especially when handling zero-day attacks, signature based approach failscompletely as there is no account of previous activity for that bot. Therefore, a machine learning approach is preferred todetect seemingly suspicious activity based on the anomalous behavior of the network. In the previous work, there hasnot been much research on deploying the detection module in a real-time scenario to monitor and mitigate Botnet activityin a network. In this work, this aspect of handling large-scale network traffic in a very short time is addressed and a solutionis proposed to deal with Botnet detection at quasi-real time in heavy bandwidths of data traffic.

3. Scalable framework for P2P Botnet detection

The technologies underlying this framework are Libpcap, Hadoop, MapReduce and Mahout. Tshark is used to extract therequired fields out of the packets. These are required for generation of the feature set for the Machine Learning Module.TShark is a network protocol analyzer that uses Libpcap library and captures packet data from a live network. It also enablesprinting a decoded and customizable form of the captured packets to the standard output or a file. After the extraction ofdesired information from the Sniffer Module, the MapReduce paradigm is used for feature extraction. This is accomplishedusing Apache Hive which provides a mechanism to query the data using a SQL-like language called HiveQL.

In this section the components of the proposed scalable network threat detection framework are described in detail. Theframework consists of the following components:

a. Traffic Sniffer Module for pre-processing packets,b. Feature Extraction Module for generating feature set, andc. Machine Learning Module for learning and detecting the malicious traffic.

These components are described in the following three subsections, highlighting the challenges faced and solutions pro-posed in their implementation.

3.1. Traffic Sniffer

Dumpcap [12] is used for sniffing the packets from the network interface while Tshark [50] is used for extracting thefields, related to feature set, out of the packets and to submit the fields to the Hadoop Distributed File System (HDFS). Eventhough Tshark could have been used for sniffing the packets from the interface, dumpcap provides better performance when

K. Singh et al. / Information Sciences 278 (2014) 488–497 491

dealing with long term captures [30] since it is the lowest level abstraction of libpcap. Whereas the limited buffers of Tsharkpresent more overhead and consume more time. Thus dumpcap capture ring buffer option is used to capture traffic onto suc-cessive pcaps which are then handled by multiple Tshark instances thereby achieving a greater degree of parallelism.

The traffic Sniffer Module saves the traffic from the wire into successive pcap files of a specified size using the capture ringbuffer option. While sniffing some packets are dropped and are not written to the file system. In order to achieve lower pack-et losses few experiments were performed varying the buffer sizes.

The Ubuntu 12.04 (x64) systems in the cluster had a limitation on the TCP/IP buffer size thereby disabling dumpcap fromwriting all the packets to a pcap file which significantly mismatches with the buffer size [34]. To verify this, the TCP/IP buffersizes were increased from 4 MB to 12 MB and the following observations were made at different bandwidths which showslower packet drop percentage upon increasing these buffer sizes.

During these experiments, a trend was also observed between the packets dropped and the size of the pcap chosen. This isdue to the rate at which the packets are written to the hard disk and the performance of the hard disk. It depends on theconfiguration of the hard disk – the sector size, file buffer size, etc.

The packets pass through a number of buffers (queues) while they get saved [41]. The first buffer is the memory on theNIC itself which cannot be altered. The first in-RAM queue is the RX ring buffer in the NIC driver. In most of the drivers oneNIC queue can be assigned to one CPU [4] In case of non-NAPI drivers, the next queue is the backlog in the network stack.This is a per-CPU buffer storing the packets before they get processed by the kernel. Its size can be specified by the kernelparameter net.netdev_max_backlog. Further buffers may be present on the processing path, depending on the upper layerprotocols. If any of the buffers becomes full, packets will be lost. Thus the packet drop rate is a function of the bandwidthchosen. The results of these experiments are summarized in Table 1. The variation of packet drop with bandwidth is shownin Fig. 1.

Automation of the entire traffic sniffing process is achieved via Feature Extraction Perl Scripts which run as daemon jobsand periodically spawn multiple instances of Tshark, each extracting the fields necessary for the feature set generation fromthe resultant pcap files. Here two processing time delays are introduced, one by the generation of pcap at regular intervals bydumpcap and the other by the processing time of the Perl script on the pcap. In this choice of taking 50 MB as the file size, thefirst time delay is seen to be 0.5 s per file cumulating to 10 s for processing 1 GB data completely and the second time delay isseen to be 41 s per session. Thus features from a 1 GB pcap file are being extracted in 51 s, which is phenomenally less whencompared to other open source tools like Netmate [3], which took 25 min for the same. The details of the feature extractionare given in the next Section 3.2.

The delimited files from each of these Tshark instances which run in parallel are all submitted to the HDFS upon comple-tion. These are then loaded into a table via the ‘‘LOAD DATA’’ command in HQL.

The test-bed for collecting the malicious sources for this research work consists of an isolated network of Linux systems.The systems were connected to an Access switch in order to form the isolated network. On top of each these physical ma-chines, the authors ran virtual machines with Windows XP as the operating system. On these virtual machines, samples ofKelihos–Hlux, Zeus, Waledac were deployed. This setup is shown in Fig. 2.

The samples of these malware were obtained from [8,33]. The network activity of these malware samples were monitoredand captured by Wireshark for 48 h each. After collecting packet traces of the network activity of each of the malware, rightfrom the infection stage to the attack stage, the pcaps were stored in the Hadoop cluster for further analysis. As executablefiles for all bots were not available, some of the already captured packets from online communities (CAIDA, ContagioDump,etc.) were used. The bots for which source code was available, compilation of bots was done to generate executables whichwere run the testbed network. And other source OpenMalware provided the executables which were run the same on thenetwork. So, for the network capture packets which were got from other sources, they were individual network sessions.But for the bots which were ran on test bed network, captures from both individual bot activity and multiple bot activityon the setup were taken, to enable the framework to be more robust, in terms of detecting any kind of bot activity in thenetwork. It was ensured that the network characteristics between the ground truth and the training set are based exclusivelyon the presence of Botnet for all the exercises.

An optimal file size of 50 MB was chosen for the pcap and all the results given in this paper are benchmarked on 1 GB pcapplayed out on the interface using TCP Replay [47], where 20 resultant pcap files of 50 MB each were taken into consideration.In the malicious dataset, there were 55,824 flows. Hence to achieve proper classification, benign traffic consisting of trafficfrom P2P application, ftp transfers, telnet sessions, video streaming and mobile updates were collected by the authors. Thusthe entire dataset was merged and given to the machine learning algorithm for model generation.

Table 1Difference in packets dropped after varying the buffer size.

Bandwidth (Mbps) Packets dropped (4 MB) (%) Packets dropped (12 MB) (%)

220 2.70 0.85130 2.0 0.61

40 1 0.01

Fig. 1. Surface depicting the variation of packet drop with bandwidth and file size.

Fig. 2. Test bed for capturing malicious network activity.

492 K. Singh et al. / Information Sciences 278 (2014) 488–497

Thorough sampling was also done with the datasets collected from CAIDA, as mentioned in the introduction, to ensurethat the Botnet detection module was working on not just the datasets collected from the infected net from the test bedbut also from standard Botnet datasets.

3.2. Feature Extraction Module

Once the delimited files are submitted to HDFS, Apache Hive [1] is used to extract the features out of them. One of theimportant features in this framework is the ability to change the features at runtime which is enabled by Apache Hiveand Tshark. The Feature Extraction Perl Script, mentioned above, enables the user to decide the fields to extract from packetusing Tshark and then creates the table in Hive accordingly. This presents the user the flexibility to extract different featuresrelevant to different problem instances and also avoids the painstaking job of changing the entire code, in case the featureswere extracted using a MapReduce program, which was seen in previous study [23]. In case a different feature set is chosenin this script, a different table is created automatically.

The Apache Hive data warehouse software facilitates easy extract/transform/load (ETL) and managing large datasetswhich is built on top of Apache Hadoop [2]. It provides a language called HQL (Hive Query Language) that is very similarto SQL and therefore can be easily understood by anyone familiar with SQL [49]. These queries are translated into MapRe-duce programs by Hive and executed in runtime.

Table 2Flow statistics that can be extracted from the software.

Feature Description of the feature

Srcip Source ip address (string)Srcport Source port numberDstip Destination ip address (string)Dstport Destination port numberProto Protocol (ie. TCP = 6, UDP = 17)total_fpackets Total packets in forward directiontotal_fvolume Total bytes in forward directiontotal_bpackets Total packets in backward directiontotal_bvolume Total bytes in backward directionmin_fpktl Size of smallest packet sent in forward direction (in bytes)mean_fpktl Mean size of packets sent in forward direction (in bytes)max_fpktl Size of largest packet sent in forward direction (in bytes)std_fpktl Standard deviation from mean of packets sent in forward direction (in bytes)min_bpktl Size of smallest packet sent in backward direction (in bytes)mean_bpktl Mean size of packets sent in backward direction (in bytes)max_bpktl Size of largest packet sent in backward direction (in bytes)std_bpktl Standard deviation from mean of packets sent in backward direction (in bytes)min_fiat Minimum amount of time between two packets sent in forward direction (in microseconds)mean_fiat Mean amount of time between two packets sent in forward direction (in microseconds)max_fiat Maximum amount of time between two packets sent in forward direction (in microseconds)std_fiat Standard deviation from mean amount of time between two packets sent in forward direction (in microseconds)min_biat Minimum amount of time between two packets sent in backward direction (in microseconds)mean_biat Mean amount of time between two packets sent in backward direction (in microseconds)max_biat Maximum amount of time between two packets sent in backward direction (in microseconds)std_biat Standard deviation from mean amount of time between two packets sent in backward direction (in microseconds)duration Duration of flow (in microseconds)fpsh_cnt Number of times PSH flag was set in packets travelling in forward direction (0 for UDP)bpsh_cnt Number of times PSH flag was set in packets travelling in backward direction (0 for UDP)furg_cnt Number of times URG flag was set in packets travelling in forward direction (0 for UDP)burg_cnt Number of times URG flag was set in packets travelling in backward direction (0 for UDP)total_bhlen Total bytes used for headers in backward directiontotal_fhlen Total bytes used for headers in forward direction

K. Singh et al. / Information Sciences 278 (2014) 488–497 493

Since most of the features extracted for this problem, which are shown in Table 2, are flow based statistics such as size oflargest packet in a flow, they are extracted out of the table using the ‘‘group by’’ clause in HQL. The group by is based onMapReduce algorithm. The map phase generates the key-value pairs which are passed onto the reduce phase. Here thereducer groups all the values based on the key passed to it. That is, the MapReduce framework operates exclusively on<key, value> pairs where the input to framework is a set of <key, value> pairs and the output produced by the job is alsoanother set of <key, value> pair.

ðinputÞ < k1;v1 >! map!< k2; v2 >! combine!< k2;v2 >! reduce!< k3;v3 > ðoutputÞ ð1Þ

Since this problem is based on flows, data is grouped by source IP, source port, destination IP and destination port to ex-tract flows from raw packet data. So the key here is a combination of source IP, source port, destination IP, destination port.The value here are the remaining fields which were a part of the delimited files generated by Tshark. The mapper generatesthe key–value pairs which are then passed to reducer which groups all the packets for that particular key (flow) and gener-ates the features like total bytes transferred or average inter arrival time in a flow.

Malware samples of Kelihos–Hlux, Zeus, Waledac which were deployed on the setup shown in Fig. 2 were monitored andcaptured for 48 h each and the activity was captured by Wireshark. After collecting packet traces of the network activity ofeach of the malware, right from the infection stage to the attack stage, the pcaps were processed using the Feature ExtractionModule. The pcaps from standard datasets released by CAIDA were also processed in parallel.

Table 3Features selected using Information Gain Ranking Algorithm.

Rank Feature

0.814 total_bhlen0.81 std_bpktl0.7886 fpsh_cnt0.7751 total_fhlen0.7654 bpsh_cnt0.7568 min_biat0.7438 min_fiat0.7182 max_bpktl

494 K. Singh et al. / Information Sciences 278 (2014) 488–497

For this study the authors did not consider the features like source IP, destination IP as such as they are completely depen-dent on the network configuration on which the bots function. There is prior work in the area of scalable feature selectionapproach in [14].

Then Information Gain Attribute Evaluation was done using the Ranker Algorithm in order to find out the most influentialfeatures of the entire feature set. This method evaluates the worth of an attribute by measuring the Information Gain withrespect to the class, where Information Gain is described by the following equation:

Information Gain ðClass;AttributeÞ ¼ HðClassÞ � HðClassjAttributeÞ ð2Þ

The feature set used in this context along with the corresponding Info Gain are presented in Table 3. The features ex-tracted are as follows, which are derived from [3].

3.3. Machine Learning Module

In order to achieve scalability in Machine Learning Module, Mahout which is machine learning library built on top of Ha-doop was used. Since each of its core algorithms for classification and clustering are run as MapReduce jobs, the high com-putational power of the cluster is harnessed to attain optimized results [27].

Deploying a non-distributed classifier would prove quite heavy for a single system to deal with since the data alone couldconsume the entire heap space of JVM on top of which most of the APIs of the classifiers run (e.g. Weka [18]). Even other imple-mentations (C/C++) of classifiers which run on standalone systems require lot of resources and often run out of memory whendealing with Big Data. Therefore the distributed implementation of Mahout plays a very significant role in Big Data Analysis.

Mahout’s native library of Random Forest’s implementation is used for model training and classification purpose. It hasbeen seen that Random Forest is increasingly being used by researchers [51] for predictive data modeling due it high accu-racy and time efficiency. First a file descriptor is created for the dataset which is the set of features extracted from previousmodule which contains the type of the attributes (numeric, categorical, and label). Then 100 trees are built each using 5 ran-domly selected attributes per node. The predicted results for each of the test instances are stored onto the HDFS, which canbe accessed using Hue Web UI and accordingly the malicious nodes can be identified.

The number of partitions is controlled by the -Dmapred.max.split.size argument that indicates to Hadoop the max. size ofeach partition, in this case 1/10 of the size of the dataset. Thus 10 partitions will be used [26].

The overview of the entire process is given in Fig. 3.The Random Forest Algorithm was chosen because the problem of Botnet detection has the requirements of high accuracy of

prediction, ability to handle diverse bots, ability to handle data characterized by a very large number and diverse types ofdescriptors, ease of training, and computational efficiency. Models based on k-Nearest Neighbors, Artificial Neural Networksand nonlinear Support Vector Machines have very high prediction rates and are very flexible in taking into consideration thediversity of the training data while building the model. However, high-dimensional data without dimension reduction or pre-selection of descriptors render ANN and k-NN not so efficient. While Nonlinear SVM is robust to the presence of a large numberof irrelevant descriptors, thus necessitating descriptor preselection. There have been attempts at addressing this problem usingIndependent component Analysis as in [35] however it might not be effective always to handle lower dimensional data.

Fig. 3. Overview of scalable P2P Botnet detection framework.

K. Singh et al. / Information Sciences 278 (2014) 488–497 495

Thus Decision Tree is the closest to having the desired combination of features. It is good at handling high-dimensionaldata and ignores irrelevant descriptors by pruning them. However, one drawback of the Decision Tree is that is has relativelylow prediction accuracy when compared to other machine learning algorithms. Therefore, this drawback was then resolvedby using ensembles of trees [10]. One such algorithm improvised to use to ensemble methods is Random Forest.

4. Results

The classification modules have been trained from capture files from known Bot attacks such as Conficker, Kelihos–Hlux,Zeus, Storm and Waledac. These datasets are captures in PCAP (libpcap) format, of the bot attack with some benign traffic toowhich is also needed in our classifier.

For purpose of testing the framework entire campus traffic was mirrored to the test bed. An equivalent test bed is to setup systems in different subnets of the whole network. The datasets from CAIDA were also replayed to validate the efficacy ofthe module on standard datasets. This ensures that the any differences in the behavior of the bots on the infected nets andfrom the collected traces do not affect the performance evaluation of the module.

To verify the validity of the classifier the predicted vs. experimental results were verified using the Pearson product-mo-ment coefficient. It is defined by the formula:

Table 4Accurac

True

0.990.99

r ¼Pn

i¼1 Xi � X� �

ðYi � YÞffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1 Xi � X� �2

q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPni¼1 Yi � Y� �2

q ð3Þ

A total of 84,030 instances of mixed traffic was used as the dataset from which 90% was taken as the training set and 10%was taken as testing set. The classifier gave an accuracy of 99.7 when implemented using the Random Forest Algorithmwhere 10 trees were used to build the forest. In this case Table 4 summarizes the accuracy of the classifier. The datasetwas evaluated with other learning algorithms also and the ROC (receive-operation) curve of the classifiers is presented inFig. 4. From the area of both the graphs, it can be seen that Random Forest outperformed other machine learning algorithms(SVM, NaiveBayes, etc.). Whereas algorithms like ANN suffered a heavy setback with the size of the data. Therefore, the par-allel implementation of Random Forests was chosen for this problem of P2P Botnet detection.

The results of training the framework are presented with above mentioned datasets by replaying the capture files ontothe network using TCP Replay. As discussed above the optimum level of file size was chosen as 50 MB and Tshark was usedfurther on each of these pcaps which has time interval of 41 s.

The performance of the system is benchmarked by varying the network bandwidths from 20 Mbps to 300 Mbps. The re-sults are presented in Table 5.

The Hive Query time is really dependant on the number of flows and is not a function of the network bandwidth alonerather on the unique number of flows in the network. However since output of Hive is directly fed to the Machine LearningModule, a directly proportional relationship between them is observed between them. Again, it has to be emphasized that

y measure of the classifier.

positive rate False positive rate Precision Recall Class

8 0.003 0.999 0.998 Malicious7 0.002 0.996 0.997 Non-malicious

Fig. 4. Relative performance of classifiers.

Table 5Time taken by each of the components of the framework.

Network bandwidth Drop% Total time (s) Hive Query (s) Mahout MR job (s)

79.5 0.03 57 6 6152 1.47 72 9 10223.8 1.626 85 9 10383.8 10.41 98 20 18

496 K. Singh et al. / Information Sciences 278 (2014) 488–497

this work is dealing with building a scalable intrusion detection system, where the emphasis on developing novel way ofhandling large network traces (which improves the prediction of ML techniques used here) using Big Data Technologiesand dynamic flow feature extraction. The efficacy of the features working out for the problem of P2P Botnet detection hasbeen studied extensively in all the works presented in the related works section.

5. Conclusions and scope of future work

This paper offers the following contributions:

1. A scalable packet capture module to process high bandwidths of data in a quasi-real-time (within 5–30 s delay).2. A distributed dynamic feature extraction framework to characterize flow statistics of packet captures.3. A Peer-to-Peer security threat detection module which classifies malicious traffic on a cluster.

Packet drops were observed due to the usage of TCP Replay and the authors have not delved into this issue as this wouldbe shifting the focus from this research problem. Authors are actively working towards further reducing the drop rates dur-ing packet captures. Present solution has detection time running in order of tens of seconds. By adding more machines to thecluster and tweaking the Hadoop deployment, the team targets to get this to be below 10 s. The architecture will beexpanded to integrating with firewall and domain controllers so as to block traffic from Botnets and isolate compromisedmachines on the internal network and implementing self-learning modules to tune performance from false positive and falsenegatives that are reported during runtime.

Many of the recent security threats like stuxnet stay dormant and communicate at a really low frequency during stealthmode. There also exist Botnets as mentioned in [6] that rely on swarm intelligence and in particular on stigmergic commu-nication, ensuring spontaneous, implicit coordination and collaboration among the independent bot agents. The new Botnetarchitecture presents improved fault tolerance and dynamic adaptation to varying network conditions, by propagatingcontrol messages to any bot node through multiple short-range hops structured according to a dynamically built DegreeConstrained Minimum Spanning Tree, whose distributed calculation is inspired to ant colony’s foraging behavior. Sincemajority of IDS/IPS solutions throttle the reporting to happen only if some thresholds are exceeded, low frequency commu-nication goes undetected. Extension of this work targets to explore such stealthy communication.

Acknowledgment

This work was supported by Department of Information Technology, New Delhi under Grant No. 12(13)/2011-ESD.

References

[1] Apache Hive, 2013. <hive.apache.com>.[2] Apache, Hadoop, 2013. <http://hadoop.apache.org/>.[3] D. Arndt, Netmate, 2011. <http://dan.arndt.ca/nims/calculating-flow-statistics-using-netmate/>.[4] Benvenuti, Understanding Linux Network Internals, 2009.[5] G. Beydoun, G. Low, Generic modelling of security awareness in agent based systems, Inf. Sci. 239 (2013) 62–71.[6] A. Castiglione, R. De Prisco, A. De Santis, U. Fiore, F. Palmieri, A botnet-based command and control approach relying on swarm intelligence, J. Netw.

Comput. Appl. 38 (2014) 22–33.[7] S. Chandra, V. Nunia, A. Thakur, Packet Anonymization: Preserving Privacy for DPI in P2P Realm, Security and Privacy Symposium 2013, IIT Kanpur,

March 2013.[8] ContagioDump Blogspot, 2013. <http://contagiodump.blogspot.in/>.[9] I. Corona, G. Giacinto, F. Roli, Adversarial attacks against intrusion detection systems: Taxonomy, solutions and open issues, Inf. Sci. 239 (2013) 201–

225.[10] T.G. Dietterich, Ensemble Learning. In the Handbook of Brain Theory and Neural Networks, MIT Press, 2002.[11] D. Dittrich, S. Dietrich, P2P as Botnet command and control: a deeper insight, in: 3rd International Conference on Malicious and Unwanted Software,

2008, MALWARE 2008, IEEE, October 2008, pp. 41–48.[12] Dumpcap, 2013. <http://www.wireshark.org/docs/man-pages/dumpcap.html>.[13] U. Fiore, F. Palmieri, A. Castiglione, A. De Santis, Network anomaly detection with the restricted Boltzmann machine, Neurocomputing 122 (2013) 13–

23.[14] N. García-Pedrajas, A. de Haro-García, J. Pérez-Rodríguez, A scalable approach to simultaneous evolutionary instance and feature selection, Inf. Sci. 228

(2013) 150–174.[15] J.B. Grizzard, V. Sharma, C. Nunnery, B.B. Kang, D. Dagon, Peer-to-peer botnets: overview and case study, in: Proceedings of the First Conference on

First Workshop on Hot Topics in Understanding Botnets, April 2007, pp. 1–1.

K. Singh et al. / Information Sciences 278 (2014) 488–497 497

[16] G. Gu, R. Perdisci, J. Zhang, W. Lee, BotMiner: clustering analysis of network traffic for protocol-and structure-independent botnet detection, in:USENIX Security Symposium, July 2008, pp. 139–154.

[17] S.C. Guntuku, P. Narang, C. Hota, Real-Time Peer-to-Peer Botnet Detection Framework based on Bayesian Regularized Neural Network, 2013, arXivpreprint arXiv:1307.7464.

[18] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The WEKA data mining software: an update, ACM SIGKDD Explor. Newslett. 11(1) (2009) 10–18.

[19] T. Holz, M. Steiner, F. Dahl, E. Biersack, F.C. Freiling, Measurements and mitigation of peer-to-peer-based botnets: a case study on storm worm, LEET 8(2008) 1–9.

[20] J.Y. Huang, I.E. Liao, Y.F. Chung, K.T. Chen, Shielding wireless sensor network using Markovian intrusion detection system with attack pattern mining,Inf. Sci. 231 (2013) 32–44.

[21] N.F. Huang, G.Y. Jai, H.C. Chao, Y.J. Tzang, H.Y. Chang, Application traffic classification at the early stage by characterizing application rounds, Inf. Sci.232 (2013) 130–142.

[22] C. Langin, H. Zhou, S. Rahimi, B. Gupta, M. Zargham, M.R. Sayeh, A self-organizing map and its modeling for discovering malignant network traffic, in:IEEE Symposium on Computational Intelligence in Cyber Security, 2009, CICS’09, IEEE, March 2009, pp. 122–129.

[23] Y. Lee, Y. Lee, Toward scalable internet traffic measurement and analysis with hadoop, ACM SIGCOMM Comput. Commun. Rev. 43 (1) (2012) 5–13.[24] Y. Lee, W. Kang, Y. Lee, A hadoop-based packet trace processing tool, in: Traffic Monitoring and Analysis, Springer, Berlin Heidelberg, 2011, pp. 51–63.[25] D. Liu, Y. Li, Y. Hu, Z. Liang, A P2P-botnet detection model and algorithms based on network streams analysis, in: 2010 International Conference on

Future Information Technology and Management Engineering (FITME), vol. 1, IEEE, October 2010, pp. 55–58.[26] Mahout Partial Implementation, 2013. <https://cwiki.apache.org/MAHOUT/partial-implementation.html>.[27] Mahout, 2013. <http://mahout.apache.org/>.[28] A. Maratea, A. Petrosino, M. Manzo, Adjusted F-measure and kernel scaling for imbalanced data learning, Inf. Sci. 257 (2014) 331–341.[29] S. Nagaraja, P. Mittal, C.Y. Hong, M. Caesar, N. Borisov, BotGrep: finding P2P bots with structured graph analysis, in: USENIX Security Symposium,

August 2010, pp. 95–110.[30] Network Protocol Specialists, 2008. <http://new.networkprotocolspecialists.com/tips/long-term-captures-with-tshark/>.[31] Ng Andrew, Advice for Applying Machine Learning, 2013.[32] S.K. Noh, J.H. Oh, J.S. Lee, B.N. Noh, H.C. Jeong, Detecting P2P botnets using a multi-phased flow model, in: Third International Conference on Digital

Society, 2009, ICDS’09, IEEE, February 2009, pp. 247–253.[33] Open Malware, 2013. <http://openmalware.org/>.[34] P. Orosz, T. Skopko, Software-based packet capturing with high precision timestamping for Linux, in: 2010 Fifth International Conference on Systems

and Networks Communications (ICSNC), IEEE, August 2010, pp. 381–386.[35] F. Palmieri, U. Fiore, A. Castiglione, A distributed approach to network anomaly detection based on independent component analysis, Concurr.

Comput.: Pract. Exp. (2013), http://dx.doi.org/10.1002/cpe.3061.[36] A. Razzaq, K. Latif, H.F. Ahmad, A. Hur, Z. Anwar, P.C. Bloodsworth, Semantic security against web application attacks, Inf. Sci. 254 (2014) 19–38.[37] F. Salcedo-Campos, J. Díaz-Verdejo, P. García-Teodoro, Segmental parameterisation and statistical modelling of e-mail headers for spam detection, Inf.

Sci. 195 (2012) 45–61.[38] M. Sánchez-Artigas, B. Herrera, Understanding the effects of P2P dynamics on trust bootstrapping, Inf. Sci. 236 (2013) 33–55.[39] R. Schoof, R. Koning, Detecting peer-to-peer botnets, Report, University of Amsterdam, 2007, pp. 1–7.[40] SCMagazine, WordPress2, 2013. <http://www.scmagazine.com/wordpress-attacks-showcase-botnet-owners-expanding-tricks/article/288947/>

(accessed: 16.05.13).[41] T. Skopko, Loss analysis of the software-based packet capturing, Carpathian J. Electron. Comput. Eng. 5 (107) (2012) 2012.[42] J. Song, H. Takakura, Y. Okabe, K. Nakao, Toward a more practical unsupervised anomaly detection system, Inf. Sci. 231 (2013) 4–14.[43] StackExchange, WordPressAttack, 2013. <http://security.stackexchange.com/questions/34482/understanding-what-bot-was-used-for-a-botnet-

attack> (accessed: 16.05.13).[44] J. Stewart, Peacomm DDOS Attack, Report, 2007.[45] S. Stover, D. Dittrich, J. Hernandez, S. Dietrich, Analysis of the Storm and Nugache trojans: P2P is here, USENIX; Login 32 (6) (2007) 18–27.[46] W. Tarng, L.Z. Den, K.L. Ou, M. Chen, The analysis and identification of P2P botnet’s traffic flows, Int. J. Commun. Netw. Inf. Secur. (IJCNIS) 3 (2) (2011).[47] TCP Replay, 2012. <http://tcpreplay.synfin.net/>.[48] The CAIDA UCSD Dataset 2008-11-21, 2008. <https://data.caida.org/datasets/security/telescope-3days-conficker/>.[49] A. Thusoo, J.S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, ... R. Murthy, Hive-a petabyte scale data warehouse using hadoop, in: 2010 IEEE 26th

International Conference on Data Engineering (ICDE), IEEE, March 2010, pp. 996–1005.[50] TShark, 2013. <http://www.wireshark.org/docs/man-pages/tshark.html>.[51] C.C. Yeh, D.J. Chi, Y.R. Lin, Going-concern prediction using hybrid random forests and rough set approach, Inf. Sci. 254 (2014) 98–110.[52] T.F. Yen, M.K. Reiter, Are your hosts trading or plotting? Telling P2P file-sharing and bots apart, in: 2010 IEEE 30th International Conference on

Distributed Computing Systems (ICDCS), IEEE, June 2010, pp. 241–252.