pattern extraction algorithm for netflow-based...

Research ArticlePattern Extraction Algorithm for NetFlow-Based BotnetActivities Detection

RafaB Kozik and MichaB ChoraV

UTP University of Science and Technology in Bydgoszcz Bydgoszcz Poland

Correspondence should be addressed to Rafał Kozik rkozikutpedupl

Received 31 May 2017 Accepted 14 September 2017 Published 17 October 2017

Academic Editor Pedro Garcıa-Teodoro

Copyright copy 2017 Rafał Kozik and Michał Choras This is an open access article distributed under the Creative CommonsAttribution License which permits unrestricted use distribution and reproduction in any medium provided the original work isproperly cited

As computer and network technologies evolve the complexity of cybersecurity has dramatically increased Advanced cyberthreats have led to current approaches to cyber-attack detection becoming ineffective Many currently used computer systemsand applications have never been deeply tested from a cybersecurity point of view and are an easy target for cyber criminals Theparadigm of security by design is still more of a wish than a reality especially in the context of constantly evolving systems Onthe other hand protection technologies have also improved Recently Big Data technologies have given network administratorsa wide spectrum of tools to combat cyber threats In this paper we present an innovative system for network traffic analysis andanomalies detection to utilise these tools The systems architecture is based on a Big Data processing framework data mining andinnovative machine learning techniques So far the proposed system implements pattern extraction strategies that leverage batchprocessing methods As a use case we consider the problem of botnet detection by means of data in the form of NetFlows Resultsare promising and show that the proposed system can be a useful tool to improve cybersecurity

1 Introduction

The cyber ecosystem is constantly changing as new tech-nology stacks are being created [1] However many existingsolutions have vulnerabilities and are frequent targets ofcyber criminals Signature-based detection techniques areineffective when faced with current cyber threats and thesophistication of precisely craftedmalware software that con-stantly evolves and changes Of course protection technolo-gies have also improved For example Big Data distributeddata mining andmachine learning are increasingly commoncandidate technologies to counter cyber attacks and cybercrime Nowadays one of the major cybersecurity challengesis to countermalicious software [2] Usuallymalware samplesare carefully crafted pieces of computer programs that staydormant while performing detailed surveillance of infectedinfrastructures and assets Infected computers commonlyconnect via a telecommunication network and form a so-called botnet that can be easily centrally controlled by cyber-criminals for different malicious purposes such as DDoSattacks SPAMdistribution ransomware sensitive data thefts

and extortion attacks Advancements in machine learningand data mining techniques in the area of Big Data introducenew possibilities to fight against the latest malware In thispaper we propose a cybersecurity system to counter botnetsbased on the Big Data concept and architecture The maincontribution of this work is the proposal of an innovativetool enhancing the cybersecurity of a local area networkThe tool supports the network administrator in networktraffic analysis by providing a distributed framework forvisualisation data mining and feature extraction More-over we propose scientific contributions in the form of anapplication of a distributed random forest classifier and analgorithm for solving the problem of imbalanced data Thepaper is structured as follows First we provide an overviewof existing solutions and methods for botnet detection Nextwe describe the proposed system architecture as well aspattern extraction and classification methods for NetFlowanalysis Then in the section devoted to experiments wepresent the evaluationmethodology andobtained resultsThepaper is concluded with final remarks and plans for futurework

HindawiSecurity and Communication NetworksVolume 2017 Article ID 6047053 10 pageshttpsdoiorg10115520176047053

2 Security and Communication Networks

Distributed le system

Node1

Node2

Node3

Node

Master

Batch and stream processing

Cluster

Visualisationand statistics

Predictive models

Anomalynovelty detection

Admin dashboard

Network data

Nmiddot middot middot

Figure 1 Overview of the proposed system for Big Data patterns extraction for cybersecurity

2 Related Work in CounteringBotnets and Malware

There are two approaches for cyber-attack detectionsignature-based and anomaly-based The signatures (in theform of reactive rules) of an attack used by software like Snort[3] are provided by experts from the cyber community Typ-ically for deterministic attacks it is easy to develop patternsthat will identify the specific attack Then those patterns (inthe form of signatures) are added to the database and when-ever they match the examined traffic the attack is recognisedand the alarm is raised However the task of developingnew signatures becomes more complicated when it comesto polymorphic worms or viruses Such software commonlymodifies and obfuscates its code (without changing theinternal algorithms) to be less predictive and harder to detectIn such situations signatures do not work as an efficientapproach to attack detectionThe development of an efficientand scalable method for malware detection is currently chal-lenging also due to the general unavailability of raw networkdata This is largely due to the trend of protecting usersprivacy for administrative and legal reasons (such as GeneralData Protection Regulation in Europe) Unfortunately theseimportant regulations create difficulties for research anddevelopment [4 5] A common alternative to solve theabove-mentioned problem of the lack of the traffic data iscalled NetFlow [6] data which is often captured by ISPs forauditing and performance monitoring purposes NetFlowsamples do not contain sensitive information and thereforeare widely available to the research community Howeverthe disadvantage is that such samples do not contain the rawcontent of network packets In the literature there are differ-ent approaches focusing on the analysis of NetFlow data In[7] the authors focused on host dependencymodelling usingNetFlow data analysis and malware detection However theyfocus only on peer-to-peer communication schemes On theother hand the BClus algorithm proposed in [8] uses a moregeneric approach based on behavioral analysis for botnetdetection In particular the BClus method uses (similar

to our proposed approach) low-level features to identifypotentially malicious traffic The algorithm aggregatesNetFlows for specific IP addresses and clusters themaccording to statistical characteristics The properties of theclusters are described and used for further botnet detectionHowever the EM-based clustering procedure may lead toa situation where some significant information is missingor the Gaussian model is not accurate enough (due to thelimited amount of the data in the cluster) The CCDetectormethod presented in [9] uses a state-based behavioral modelof the known command and control channels The authorof this algorithm proposes to use a Markov Chain to modelmalware behavior and to detect similar traffic in unknownreal networks The key difference from both BClus and ourown approach is that instead of analysing the completetraffic of an infected computer as a whole the authorsseparate each individual connection from each IP addressand treat them as an independent connection The resultsobtained with this method are very promising However oneof the concerns is the complex and time-consuming learningphase Another approach is used in the BotHunter [10] toolIt monitors the two-way communication flows between hostswithin both an internal network and the Internet BotHunteremploys the Snort intrusion detection system It models aninfection sequence as a composition of participants and aloosely ordered sequence of network information exchangesHowever as shown in [8 11] the effectiveness of this toolis unsatisfactory in some setups In this paper we proposeanother approach based on a Big Data architecture anddistributed machine learning methodologies The details ofour contribution are given in the next section

3 Architecture Overview

In Figure 1 the general overview of the system design ispresented The information flow follows the typical dataprocessing pipeline used in Big Data processing Herebywe propose to deploy such an approach as a cybersecuritysolution

Security and Communication Networks 3

31 Apache Spark The proposed system adapts the BigData lambda architecture built on top of the scalable dataprocessing framework named Apache Spark It provides anengine that processes Big Data workloads There are severalkey elements of the architecture that facilitate distributedcomputing These terms will be further used in this paperthus here we briefly explain the definitions

(i) Node (or host) is a machine that belongs to thecomputing cluster

(ii) Driver (an application) is a module that contains theapplication code and communicates with the clustermanager

(iii) Master node (a cluster manager) is the elementcommunicating with the driver and responsible forallocating resources across applications

(iv) Worker node is a machine that can execute applica-tion code and holds an executor

(v) Context (SparkContext) is the main entry for Sparkfunctionalities which provides API for data manipu-lations (eg variables broadcasting data creations)

(vi) Executor is a process on aworker node that runs tasks(vii) Task is a unit of work sent to the executor

Apache Spark uses the data abstraction called ResilientDistributed Dataset (RDD) to manage the data distributed inthe cluster and to implement fault-tolerance

32 The Analysed Data The data acquired by the systemhave a form of NetFlows NetFlow is a standardised formatfor describing bidirectional communication and containsinformation such as IP source and destination addressdestination port and amount of bytes exchanged

The collected NetFlows are stored in the HDFS (HadoopDistributed File System [12]) for further processing In thecurrent version of the proposed system the data mining andfeature extractionmethods work in a batch processing modeHowever in the future we plan to allow the system to analysethe streams of data containing the raw NetFlows directly asthey are received

It must also be noted that in this paper the problemof realistic testbed construction is not considered This isbecause in order to evaluate the effectiveness of the proposedalgorithms we have used the benchmark CTU-13 [8] datasetand we followed the experimental setup of its authors inorder to methodologically compare our results The datasetcontains different scenarios representing different infectionsand malware communication schemes with command andcontrol centre More details on the datasets are given in thefollowing sections

33 Feature Extraction A single NetFlow usually does notprovide enough evidence to decide if a particular machine isinfected or if a particular request has malicious symptomsTherefore it is quite common [8 9] that NetFlows are aggre-gated in so-called timewindows so thatmore contextual data

can be extracted and malicious behavior recorded (eg portscanning packet flooding effects)

In such approaches various statistics are extracted foreach time window In the current version of the proposedsystem the SparkSQL [13] language is used to extract suchstatistics Following the information pipeline this step isrelated to the ldquodata processingrdquo stage The result of the SQLquery is data abstraction called DataFrame which can befurther processed stored or converted to any other specificformat that will satisfy the machine learning or patternextraction algorithms

In general the proposed feature extractionmethod aggre-gates the NetFlows within each time window For each timewindow we group the NetFlows by the IP source address Foreach group (containingNetFlows with the same timewindowand IP source address) we calculate the following statistics

(i) Number of flows(ii) Sum of transferred bytes(iii) Average sum of bytes per NetFlow(iv) Average communication time with each unique IP

addresses(v) Number of unique destination IP addresses(vi) Number of unique destination ports(vii) Most frequently used protocol (eg TCP UDP)

The advantages of such an approach are that it allowsthe network administrator to identify possibly infected IPaddresses and statistics are efficiently calculated with theApache Spark SQL queries

34 Multiscale Analysis When it comes to malware behav-iors recorded over a time period it may be noticed thatfor different kinds of malware different time-scales maybe observed For example some time-intensive maliciousactivities can be observed in a short time windows (eg spamsending or scanning) and these usually exhibit significantvolume of data that is of a specific type (eg a number ofopened connections a number of distinct port numbers) Onthe other hand more sophisticated attacks will be stretchedover time and may exhibit relevant information in timewindows of a varying length

To address the above-mentioned issues we have adopteda sort of multiscale analysis This technique is often used inthe image processing area to recognise an object at differentscales In this process the original image is scaled by differentmagnification factors and an analysis is conducted for eachof them Moreover the features calculated at specific scalescan be used to build a more complex model (eg we can firstdetect wheels and a frame to help detect a bike) A similarapproach can be used in our case but instead of scaling theoriginal signal we scale the size of the time window wherethe feature vectors are calculated

As it is shown in Figure 2 multidimensional NetFlowscan be seen as a time series Moreover as it was describedin the previous section the NetFlows are grouped by timewindows and for each time window statistics are calculated


t = 2m

t = 1m

t (s)

NetFlows

Concatenated feature vector

Figure 2 Overview of the technique used to multiscale NetFlowstime series analysis

For simplicity in the presented example we are using twoscales (one and two seconds wide time windows) but thisapproach can be easily extended to any number of scalefactors In the presented example the final feature vectoris a concatenation of vectors that are calculated for bothof the time windows From a technical point of view thecalculations for each of the time windows are happening inparallel on the Apache Spark cluster

4 Distributed Machine Learning

41 Decision Trees Bagging Decision trees have often beenused in various machine learning tasks [14] However acommon problem of this classifier is its accuracy It is oftenrelated to the issue of the overfitting as the structure of thetreemay grow deep Typically the decision trees have low biasbut high variance One of the solutions to overcome this issueis to use the bagging algorithm which allows for averagingmultiple decision trees each trained on different portions ofthe data

In general training the set of decision trees on the samedata will produce highly correlated (or in many cases identi-cal) treesTherefore the idea behind the bagging algorithm isto achieve an ensemble [15 16] of decorrelated trees

Let 119871 represent the learning dataset which is a col-lection of 119873 pairs of input vectors and label values(119909119894 119910119894) (119909119899 119910119899) where 119909119894 isin 119883 and 119910119894 isin 119884Following the bagging Algorithm 1 the learning dataset 119871 issampled (with replacement) 119861 times For each data samplethe decision tree is trained producing a classifier119863119887 119883 rarr 119884As we have 119861 classifiers the average is computed using thefollowing formula

119863 (119909) = 1119861

119861

sum119887=1

119863119887 (119909) (1)

InputA training set 119883 119884 = 119909119894 119910119894

119873119894=1

OutputAn ensemble of Decision Trees 119863 119883 rarr 119884

Training phaseFor 119887 = 1 119861(1) Sample 119861 training samples from119883 and 119884(2) Train the Decision Tree119863119887 on the sampled data

Algorithm 1 Decision trees bagging

42 Classical Random Forest Classifier The random forest(RF) classifier adapts amodification of the bagging algorithmThe difference is in the process of growing the trees Theclassical implementation of the RF classifier works accordingto the following procedure

The final prediction obtained from the 119861 trained trees iscalculated using the average in (1) or majority vote repre-sented by (2) where 119868 stands for indicator function and 119863119887indicates the 119887-th random tree in the ensemble

119863(119909) = argmax119894

119861

sum119887=1

119868 (119863119887 = 119894) (2)

43 Distributed Random Forest Classifier The implementa-tion of the random forest classifier in theMLlib Apache Sparklibrary in general follows the classical algorithm presentedin the previous section However it heavily leverages thedistributed computing environment

In principle the algorithmworks in so-called batchmodemeaning that a large portion of the data needs to be availableto begin learning the classifierThis is one of the drawbacks ofthe current implementation (v210) as nowadays significantvalue is put into the availability to perform online and streamclassification In fact this is our plan for the further work

In general the dataset provided to Apache Spark isarranged into rows of training instances (feature vectorswith labels) and distributed across a number of nodes andpartitions The training algorithm can benefit from thedistributed environment since the learning process for eachdecision tree can be performed in parallel

When the random forest is trained in the Apache Sparkenvironment the algorithm samples the learning data andassigns it to the decision tree which is trained on thatportion of the data To optimise the memory and spaceusage the instances are not replicated explicitly but insteadaugmented with additional data that hold the probability thatthe given instance belongs to the specific data partition usedfor training

The workload is distributed among a master node andworkers The main loop manages a queue of nodes and runsiteratively In each iteration the algorithm searches for thebest split for a node In this procedure the statistics arecollected by a worker node


Input119873 training samples of the form (1199091 1199101) (119909119873 119910119873)

such that 119909119894 isin 119877119872 is the feature vector and 119910119894 is its label

OutputRandom Forest Classifier 119863 119883 rarr 119884

The119873 training samples (each with119872 input variables) aresampled with replacement to produce 119861 partitions

(1) Each of the 119861 partitions is used to train one of thetrees

(2) In the process of splitting119898 out of the119872 variablesare selected randomly

(3) One variable out of119898 is selected according to theimpurity measure (eg Gini index or entropy [21])

(4) The procedure is terminated either when themaximal depth of a tree is achieved or when thereis no data to split

Algorithm 2 Random forest training

The feature used for splitting is chosen among thesampled list using the impurity criterion

119862

sum119894=1

119891119894 (1 minus 119891119894) (3)

where119862 is the number of unique labels and119891119894 is the frequencyof label 119894 The algorithm terminates when the maximumheight of the decision tree is reached or whenever there isno data point that is misclassified

Another option for this classification task is the entropymeasure

119862

sum119894=1

minus119891119894 log (119891119894) (4)

The final output produced by the ensemble is themajorityvote of the results produced by all decision trees

Another drawback of the current version of the ApacheSpark random forest classifier is that it does not handle cost-sensitive learning (eg assigning higher importance to datasamples indicating an anomaly or cyber attack) As a resultit may be biased towards the majority class Finding the rightbalance between detection effectiveness and the number falsealarms is important from the perspective of anomaly detec-tion or intrusion detection systems In principle such systemsshould have high recognition ratio but should not overwhelmthe administrator with a large number of false alarmsTherefore our proposals to handle this issue and improve thedetection effectiveness are presented in the next section

44 Data Imbalance Problem and Cost-Sensitive LearningThe problem of data imbalance has recently been deeplystudied [17ndash19] in the areas of machine learning and datamining In many cases this problem negatively impacts themachine learning algorithms and deteriorates the effective-ness of the classifier Typically classifiers in such cases willachieve higher predictive accuracy for the majority class butpoorer predictive accuracy for the minority class

This phenomenon is caused by the fact that the classifierwill tend to bias towards the majority class Therefore thechallenge here is to retain the classification effectiveness evenif the proportion of class labels is not equal The imbalanceof labels in our case is significant as we may expect that incommon cases of cybersecurity incidents only a fewmachinesin the network will be infected and produce malicious trafficwhile themajority will behave normally In other words mostdata contains clean traffic while only a few data samplesindicate malware

The solutions for solving such a problem can be cate-gorised as data-related and algorithm-related The methodsbelonging to the data-related category use data oversamplingand undersampling techniques while the algorithm-relatedones introduce a modification to training procedures Thisgroup can be further classified into categories using cost-sensitive classification (eg assigning a higher cost to major-ity class) or methods that use different performance metrics(eg Kappa metric)

Cost-sensitive learning is an effective solution for class-imbalance in large-scale settings The procedure can beexpressed with the following optimisation formula

min120573

12100381710038171003817100381712057310038171003817100381710038172 + 1

2

119873

sum119894=1

119882119894100381710038171003817100381711989011989410038171003817100381710038172 (5)

where 120573 indicates classifier parameters 119890119894 the error in clas-sifier response for 119894-th (out of 119873) data samples and 119882119894 theimportance of the 119894-th data sample In cost-sensitive learningthe idea is to give a higher importance to the minority classso that the bias towards the majority class is reduced

In general our approach to cost-sensitive random forestclassifier can be categorised as data-related In our casewe have heavily imbalanced data In some scenarios wehave more than 28M normal samples while having only40k of samples related to malware activities Thereforewe have adopted the undersampling technique within theprocess of learning the distributed random forest As it isshown in Algorithm 2 the training procedure already adapts


the sampling technique The Apache Spark implementationadapts Poisson sampling in order to produce subportions ofthe data for training the random tree In our case beforeintroducing the data to the classifier we undersample themajority class with an experimentally chosen probability

5 Experiments

51 Evaluation Methodology To prove the effectiveness ofthe proposed method the test methodology exploiting time-based metrics (defined in [8]) have been used In order tokeep this paper self-contained the methodology is brieflyintroduced in this section The authors of the methodologyhave created and published a tool called Botnet DetectorsComparer It is publicly available for download [20] Wedecided to follow their experiments and compare our resultsdirectly to their work Their tool analyses the NetFlowfile augmented with prediction labels produced by differentmethods and executes the following steps to produce the finaleffectiveness results [8]

(1) NetFlows are separated into comparison time win-dows (we have used default time windows of 300 slength)

(2) The ground-truth NetFlow labels are compared to thepredicted labels and the TP (true positive) TN (truenegative) FP (false positive) and FN (false negative)values are accumulated

(3) At the end of each comparison time window thefollowing performance indicators are calculated FPR(false positive rate) TPR (true positive rate) TNR(true negatives rate) FNR (false negatives rate) preci-sion accuracy error rate and F-measure for that timewindow

(4) When the whole file with NetFlows is processed thefinal error metrics are calculated and produced Asexplained in [8] calculating the TP count at theNetFlow level does not make practical sense becausethe administrator would be overwhelmed by the highamount of information (the number of NetFlowsper second is high) Instead IP-based performancemetrics are used accordingly

(i) TP a true positive counter is incremented dur-ing the comparison time window whenever aBotnet IP address is detected as Botnet at leastonce

(ii) TN a true negative counter is incrementedduring the comparison time window whenevera Normal IP address is detected as nonbotnetduring the entire time window

(iii) FP a false positive counter is incremented dur-ing the comparison time window whenever aNormal IP address is detected as botnet at leastonce

(iv) FN a false negative counter is incrementedduring the comparison time window whenever

a Botnet IP address is detected as nonbotnetduring the entire time window

The authors of the tool have also incorporated timeinto the above-mentioned metrics to emphasise the fact thatdetecting a malicious IP earlier is better than later The time-based component is called the correction function (cf) and iscalculated using the following formula

cf (119899) = 1 + 119890minus120572119899 (6)where 119899 indicates a comparison time window and 120572 isa constant value set to 001 (as suggested in [9]) Usingthe correction function the error metrics are calculated asfollows

tTP (119899) = TP lowast cf (119899)119873BOT

(7)

where tTP indicates time-based true positive number cfthe correction function and 119873BOT the number of uniquebotnet IP addresses present in the comparison time windowSimilarly we can define

tFN (119899) = FN lowast cf (119899)119873BOT

(8)

tFP and tTN do not depend on the time and are defined asfollows

tFP = FP119873NORM

(9)

where 119873NORM indicates the number of unique normal IPaddresses present in the comparison time window Similarlythe authors of the evaluation framework define

tTN = TN119873NORM

(10)

Following the previous notations the authors of [9]defined the final error metrics as follows

(i) True positives rate is as follows

TPR = tTPtFN + tTP

(11)

(ii) False positives rate is as follows

FPR = tFPtTN + tFP

(12)

(iii) Precision is as follows

119875 = tTPtTP + tFP

(13)

(iv) F-measure is as follows

FM = 2 lowast 119875 lowast TPR119875 + TPR

(14)

(v) Accuracy is as follows

ACC = tTP + tTNtTP + tTN + tFP + tFN

(15)

(vi) Error rate is as follows

ERR = tFP + tFNtTP + tTN + tFP + tFN

(16)


52 Evaluation Dataset For the evaluation we have usedthe CTU-13 dataset [8] and the same experimental setup asits authors This dataset includes different scenarios whichrepresent various types of attacks including several types ofbotnets Each of these scenarios contains collected traffic inthe form of NetFlows The data was collected to create arealistic testbed Each of the scenarios has been recorded ina separate file as a NetFlow using CSV notation Each of therows in a file has the following attributes (columns)

(i) StartTime start time of the recorded NetFlow(ii) Dur duration(iii) Proto IP protocol (eg UTP TCP)(iv) SrcAddr source address(v) Sport source port(vi) Dir direction of the recorded communication(vii) DstAddr destination address(viii) Dport destination port(ix) State protocol state(x) sTos source type of service(xi) dTos destination type of service(xii) TotPkts total number of packets that have been

exchanged between source and destination(xiii) TotBytes total bytes exchanged(xiv) SrcBytes number of bytes sent by source(xv) Label label assigned to this NetFlow (eg back-

ground normal and botnet)

It must be noted that the ldquoLabelrdquo field is an additionalattribute provided by the authors of the dataset Normallythe NetFlow will have 14 attributes and the ldquoLabelrdquo will beassigned by the classifier The dataset consists of 13 scenariosIn order to obtain comparable results presented in [8]the evaluation procedure must be preserved Therefore thescenarios available in the CTU-13 dataset have been used inthe same manner as proposed by its author in which thesystem is trained on one set of scenarios and evaluated ondifferent ones

6 Results

61 Comparison with Other Methods During the exper-iments different configurations of the proposed patternextractionmethods have been analysedThe extracted featurevectors have been used to feed the random forest machinelearning algorithms which were also evaluated for differentconfigurations The following notation is used in Table 1 todescribe a different configuration of the proposed method

RF 119873 119882 119878 (17)

where 119873 indicates the number of trees used by the randomforest classifier 119882 indicates the width of disjoint time win-dows and 119878 represents the friction of vectors undersampled


such that 119909119894 isin 119877119872 is a feature vector and 119910119894 isin 119871 is its

class label119875 = 119901119894 | 119894 isin 119871 sampling rates of labels 119871 where 119901119894

indicates the sampling probability of the 119894-th labelOutput

Random Forest Classifier119863 119883 rarr 119884For 119905 = 1 119879

(1) Sample 119883 119884 using 119875 sampling rates(2) Train Random Forest Classifier119863119905(3) Calculate the effectiveness 119864119905 of the119863119905 classifier

on the remaining training dataReturn the119863119905lowast classifier with the highest accuracy where119905lowast = argmax119905119864119905

Algorithm 3 Cost-sensitive random forest training

from the majority class (in order to balance the data beforelearning)

Table 1 also includes other algorithms that have beenmentioned in the related work section namely BClus [8]CCD [9] and BotHunter [10] The effectiveness of thesemethods (for the CTU-13 dataset) has already been presentedin [8] and [9] respectively We are able to also compareour approach with these tools since we have used the sameexperimental setup In Table 1 we report the performancemetrics defined in the previous section such as precisionaccuracy and F-measure

Table 1 presents results obtained for five scenariosrecorded in the testing dataset containing botnet activitiesScenario 1 corresponds to an IRC-based botnet that sendsspam In this scenario CCD outperforms our proposedapproach in terms of TPR but it suffers a higher FPR rateMoreover in this scenario the proposed algorithm for cost-sensitive random forest classifier learning allows us to findan improved balance between the TP and FP rates For allexperiments we report the results for the same samplingratios 119875 (Algorithm 3)

In Scenario 2 the same IRC-based botnet as Scenario1 sends spam but for a shorter timespan Similarly withthe previous scenario the cost-sensitive learning allows usto decrease the number of false positives in contrast to theoriginal classifier Moreover for the same number of falsepositives (2) our approach allowed us to achieve highervalues of TPR

In Scenario 6 the botnet scans the SMPT mail serversfor several hours and connects to several remote desktopservices However it does not send any spam The proposedapproach performs very well for this type of attack whencompared to the other methods

In Scenario 8 the botnet contacts different CampC hostsand receives some encrypted data All of themethods achievea relatively low value of TPR However the cost-sensitiverandom forest gives a higher detection rate in contrast toCCD


Table 1 Effectiveness of the proposed cost-sensitive distributed random forest classifiers compared to the other benchmarkmethods reportedby the authors of the benchmark dataset in [8 9]

Algorithm Scenario 1TPR FPR Precision Accuracy Error rate F-measure

RF30 1m 07 081 001 096 095 006 088RF30 1m 001 090 014 068 087 013 077CCD 100 005 086 096 003 092BClus 040 040 050 050 040 048BH 001 000 080 040 050 002


11987711986530 1m 07 100 002 097 099 001 099RF30 1m 001 095 025 073 083 017 082CCD 074 002 096 088 011 092BClus 030 020 060 050 040 041BH 002 000 090 030 060 004


11987711986530 1m 07 100 000 100 100 000 100RF30 1m 001 096 019 074 086 014 083CCD 000 000 mdash 064 035 000BClus 000 000 040 040 050 004BH 006 000 098 038 061 011


11987711986530 1m 07 020 004 094 081 019 033RF30 1m 001 025 013 037 073 027 030CCD 000 000 mdash 064 035 000BClus 000 004 000 066 033 mdashBH 000 000 000 042 057 mdash


11987711986530 1m 07 092 001 099 095 005 096RF30 1m 001 094 009 099 094 006 095CCD 038 004 093 059 040 054BClus 010 020 040 040 050 025BH 002 000 090 040 050 003

In Scenario 9 some hosts are infected with the Nerismalware which actively starts sending spam e-mails As inthe previous scenarios our approach also allows us to achievehigher effectiveness than the other methods

In summary the results of our system are good incomparison to the state of the art It is worth noting that evenif TPR is not higher usually the FPR is lower which satisfiesthe requirements of security administrators

The first aspect that should be commented on hereis the poor performance of BClus and CCD in Scenarios6 and 8 Upon consulting the original authors regardingthis it turned out that their results reported in [8 9] areachieved for unidirectional NetFlows while the publisheddataset contains only bidirectional NetFlows This may causedeterioration of the results It may also be caused by the fact

that (eg in the case of Scenario 6) the malware has usedonly a few and quite specific communication channels tocommunicate with CampC which has not been reflected in thetraining data On the other hand the poor performance ofBClus could also be related to the fact that this method buildsa statistical model that utilises GMM clustering in NetFlowparameters feature space and probably the dataset (at leastfor some scenarios) cannot be modelled well by a Gaussiandistribution

It must be noted that high error rates are also reported forBotHunter The BotHunter (BH) is a popular tool that usesa mix of anomaly- and signature-based techniques As wasnoted in [22] the reason for such poor performance couldbe (a) the fact that BH is not able to detect machines thathave already been infected before the BH deployment and (b)


Table 2 Comparison of TP and FP obtained for different time window scaling factors

Feature Error ScenariosVectors Metrics 1 2 6 8 9

1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001

the fact that the data from the external bots are not directlyobserved and are not sufficient for dialog analysis

The second aspect is the statistical significance of theresults The reason why we did not include the confidencelevels for the presented results is due to the procedureproposed by the authors of the CTU benchmark dataset in[9]The dataset consists of 13 cybersecurity related scenarioswhich have been divided into two parts one for training andone for validation Therefore we can train the system onone part of the data and present the results (generated withthe dedicated software tool) for another part In a similarway other authors present their results in [8 22] In suchexperimental protocol it is not relevant to provide confidencelevels

62 Evaluation of Multiscale Feature Extraction TechniqueIn this section we present results for our feature extractionalgorithm which uses a multiscale analysis technique Inparticular we have investigated under which circumstancesthe concatenation of feature vectors obtained for severaltime windows of different lengths can further improve theperformance of our approach

Results are presented in Table 2 In all experiments wetrained the proposed cost-sensitive random forest algorithmusing feature vectors created from two time windows Thefirst row in the table presents the basic (using only one timewindow) algorithm indicated in Table 1 as RF30 1m 07 Itcan be noted that the proposed multiscale feature extractiontechnique improves results in 3 of the 5 analysed scenariosConcatenating the feature vectors calculated for 1-minutetime windows with feature vectors calculated for 3-minutetime windows (1m + 3m) allows us to improve the TPR inScenarios 1 and 9 by 5 and 2 respectively In both ofthese scenarios the algorithm was able to achieve similarvalues of FPR In the case of Scenario 8 the proposedmethodwas also able to decrease the FPR However we observed adeterioration of results for Scenario 2

When concatenating 1-minute time windows with 2-minute time windows an improvement in results can beobserved however it is less significant

7 Conclusions

In this paper the results of a proposed malware detectiondistributed system based on a Big Data analytics concept and

architecture have been presented The proposed approachanalyses the malware activity that is captured by meansof NetFlows This paper presented the architecture of theproposed system the implementation details and an analysisof promising experimental results Moreover we contributeda method utilising a distributed random forest classifier toaddress the problem of data imbalance (typical in cyberse-curity) Future work is dedicated to further improvementstowards online machine learning concept (eg to analyseonline streams)

Conflicts of Interest

The authors declare that there are no conflicts of interestregarding the publication of this paper

References

[1] R Kozik M Choras A Flizikowski M Theocharidou VRosato and E Rome ldquoAdvanced services for critical infrastruc-tures protectionrdquo Journal of Ambient Intelligence and Human-ized Computing vol 6 no 6 pp 783ndash795 2015

[2] M Choras R Kozik R Renk andW Hołubowicz ldquoA PracticalFramework and Guidelines to Enhance Cyber Security andPrivacyrdquo in International Joint Conference vol 369 of Advancesin Intelligent Systems and Computing pp 485ndash495 SpringerInternational Publishing Cham 2015

[3] SNORT Project homepage httpwwwsnortorg[4] T Andrysiak Ł Saganowski M Choras and R Kozik ldquoNet-

work Traffic Prediction and Anomaly Detection Based onARFIMA Modelrdquo in International Joint Conference SOCOrsquo14-CISISrsquo14-ICEUTErsquo14 vol 299 of Advances in Intelligent Systemsand Computing pp 545ndash554 Springer International Publish-ing 2014

[5] M Choras R Kozik D Puchalski andWHołubowicz ldquoCorre-lation approach for SQL injection attacks detectionrdquo Advancesin Intelligent Systems and Computing vol 189 pp 177ndash185 2013

[6] B Claise ldquoCisco Systems NetFlow Services Export Version 9rdquoTech Rep RFC3954 2004

[7] J Francois S Wang and R State ldquoBotTrack tracking botnetsusing NetFlow and PageRankrdquo in Proceedings of the IFIPTC6Networking pp 1ndash14 Springer

[8] S Garcıa M Grill J Stiborek and A Zunino ldquoAn empiricalcomparison of botnet detectionmethodsrdquo Journal of Computersand Security vol 45 pp 100ndash123 2014


[9] S Garcia Identifying Modeling and Detecting Botnet Behaviorsin the Network [PhD thesis] 2014

[10] BotHunter homepage url httpwwwbothunternetabouthtml

[11] F Haddadi D Le Cong L Porter and A N Zincir-HeywoodldquoOn the effectiveness of different botnet detection approachesrdquoLecture Notes in Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notes in Bioinformat-ics) Preface vol 9065 pp 121ndash135 2015

[12] HDFS Hadoop Distirbuted File System ndashArchitecture Guideurl httpshadoopapacheorgdocsr121hdfs designhtml

[13] SparkSQL Apache Spark SQL query language ndashoverviewurl httpsparkapacheorgdocslatestsql-programming-guidehtmlsql

[14] S K Jayanthi and S Sasikala ldquoReptree classifier for identifyinglink spam in web search enginesrdquo ICTACT Journal on SoftComputing vol 03 no 02 pp 498ndash505 2013

[15] G Folino and F S Pisani ldquoCombining ensemble of classifiersby using genetic programming for cyber security applicationsrdquoLecture Notes in Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notes in Bioinformat-ics) Preface vol 9028 pp 54ndash66 2015

[16] R Kozik and M Choras ldquoAdapting an Ensemble of One-ClassClassifiers for a Web-Layer Anomaly Detection Systemrdquo inProceedings of the 10th International Conference on P2P ParallelGrid Cloud and Internet Computing 3PGCIC 2015 pp 724ndash729pol November 2015

[17] M Wozniak ldquoData and Knowledge Hybridizationrdquo in HybridClassifiers vol 519 of Studies in Computational Intelligence pp59ndash93 Springer Berlin Heidelberg Berlin Germany 2013

[18] R Kozik and M Choras ldquoSolution to Data Imbalance Problemin Application Layer Anomaly Detection Systemsrdquo in HybridArtificial Intelligent Systems vol 9648 of Lecture Notes in Com-puter Science pp 441ndash450 Springer International Publishing2016

[19] B Krawczyk ldquoLearning from imbalanced data open challengesand future directionsrdquo Progress in Artificial Intelligence vol 5no 4 pp 221ndash232 2016

[20] Botnet Detectors Comparer The download page of the toolurlhttpdownloadssourceforgenetprojectbotnetdetector-scomparerBotnetDetectorsComparer-09

[21] Y Wang and S Xia ldquoUnifying attribute splitting criteria ofdecision trees by Tsallis entropyrdquo in Proceedings of the 2017IEEE International Conference on Acoustics Speech and SignalProcessing (ICASSP) pp 2507ndash2511 New Orleans LA March2017

[22] J Wang and I C Paschalidis ldquoBotnet detection based onanomaly and community detectionrdquo IEEE Transactions onControl of Network Systems vol 4 no 2 pp 392ndash404 2017

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014


Active and Passive Electronic Components

Control Scienceand Engineering

Journal of


International Journal of

RotatingMachinery


Hindawi Publishing Corporation httpwwwhindawicom

Journal of

Volume 201

Submit your manuscripts athttpswwwhindawicom

VLSI Design



Shock and Vibration


Civil EngineeringAdvances in

Acoustics and VibrationAdvances in



Electrical and Computer Engineering

Journal of

Advances inOptoElectronics


Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of


Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014


Chemical EngineeringInternational Journal of Antennas and

Propagation




Navigation and Observation



DistributedSensor Networks



Distributed le system

Node1

Node2

Node3

Node

Master

Batch and stream processing

Cluster

Visualisationand statistics

Predictive models

Anomalynovelty detection

Admin dashboard

Network data

Nmiddot middot middot

Figure 1 Overview of the proposed system for Big Data patterns extraction for cybersecurity

2 Related Work in CounteringBotnets and Malware

There are two approaches for cyber-attack detectionsignature-based and anomaly-based The signatures (in theform of reactive rules) of an attack used by software like Snort[3] are provided by experts from the cyber community Typ-ically for deterministic attacks it is easy to develop patternsthat will identify the specific attack Then those patterns (inthe form of signatures) are added to the database and when-ever they match the examined traffic the attack is recognisedand the alarm is raised However the task of developingnew signatures becomes more complicated when it comesto polymorphic worms or viruses Such software commonlymodifies and obfuscates its code (without changing theinternal algorithms) to be less predictive and harder to detectIn such situations signatures do not work as an efficientapproach to attack detectionThe development of an efficientand scalable method for malware detection is currently chal-lenging also due to the general unavailability of raw networkdata This is largely due to the trend of protecting usersprivacy for administrative and legal reasons (such as GeneralData Protection Regulation in Europe) Unfortunately theseimportant regulations create difficulties for research anddevelopment [4 5] A common alternative to solve theabove-mentioned problem of the lack of the traffic data iscalled NetFlow [6] data which is often captured by ISPs forauditing and performance monitoring purposes NetFlowsamples do not contain sensitive information and thereforeare widely available to the research community Howeverthe disadvantage is that such samples do not contain the rawcontent of network packets In the literature there are differ-ent approaches focusing on the analysis of NetFlow data In[7] the authors focused on host dependencymodelling usingNetFlow data analysis and malware detection However theyfocus only on peer-to-peer communication schemes On theother hand the BClus algorithm proposed in [8] uses a moregeneric approach based on behavioral analysis for botnetdetection In particular the BClus method uses (similar

to our proposed approach) low-level features to identifypotentially malicious traffic The algorithm aggregatesNetFlows for specific IP addresses and clusters themaccording to statistical characteristics The properties of theclusters are described and used for further botnet detectionHowever the EM-based clustering procedure may lead toa situation where some significant information is missingor the Gaussian model is not accurate enough (due to thelimited amount of the data in the cluster) The CCDetectormethod presented in [9] uses a state-based behavioral modelof the known command and control channels The authorof this algorithm proposes to use a Markov Chain to modelmalware behavior and to detect similar traffic in unknownreal networks The key difference from both BClus and ourown approach is that instead of analysing the completetraffic of an infected computer as a whole the authorsseparate each individual connection from each IP addressand treat them as an independent connection The resultsobtained with this method are very promising However oneof the concerns is the complex and time-consuming learningphase Another approach is used in the BotHunter [10] toolIt monitors the two-way communication flows between hostswithin both an internal network and the Internet BotHunteremploys the Snort intrusion detection system It models aninfection sequence as a composition of participants and aloosely ordered sequence of network information exchangesHowever as shown in [8 11] the effectiveness of this toolis unsatisfactory in some setups In this paper we proposeanother approach based on a Big Data architecture anddistributed machine learning methodologies The details ofour contribution are given in the next section

3 Architecture Overview

In Figure 1 the general overview of the system design ispresented The information flow follows the typical dataprocessing pipeline used in Big Data processing Herebywe propose to deploy such an approach as a cybersecuritysolution
























t = 2m

t = 1m

t (s)

NetFlows








119863 (119909) = 1119861

119861

sum119887=1

119863119887 (119909) (1)


119873119894=1






119863(119909) = argmax119894

119861

sum119887=1

119868 (119863119887 = 119894) (2)

















119862

sum119894=1

119891119894 (1 minus 119891119894) (3)



119862

sum119894=1

minus119891119894 log (119891119894) (4)







min120573

12100381710038171003817100381712057310038171003817100381710038172 + 1

2

119873

sum119894=1

119882119894100381710038171003817100381711989011989410038171003817100381710038172 (5)





5 Experiments














(7)



(8)


tFP = FP119873NORM

(9)


tTN = TN119873NORM

(10)



TPR = tTPtFN + tTP

(11)


FPR = tFPtTN + tFP

(12)



(13)



(14)



(15)



(16)







6 Results


RF 119873 119882 119878 (17)





















11987711986530 1m 07 100 002 097 099 001 099RF30 1m 001 095 025 073 083 017 082CCD 074 002 096 088 011 092BClus 030 020 060 050 040 041BH 002 000 090 030 060 004






11987711986530 1m 07 092 001 099 095 005 096RF30 1m 001 094 009 099 094 006 095CCD 038 004 093 059 040 054BClus 010 020 040 040 050 025BH 002 000 090 040 050 003









1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001






7 Conclusions





References
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation
































t = 2m

t = 1m

t (s)

NetFlows








119863 (119909) = 1119861

119861

sum119887=1

119863119887 (119909) (1)


119873119894=1






119863(119909) = argmax119894

119861

sum119887=1

119868 (119863119887 = 119894) (2)

















119862

sum119894=1

119891119894 (1 minus 119891119894) (3)



119862

sum119894=1

minus119891119894 log (119891119894) (4)







min120573

12100381710038171003817100381712057310038171003817100381710038172 + 1

2

119873

sum119894=1

119882119894100381710038171003817100381711989011989410038171003817100381710038172 (5)





5 Experiments














(7)



(8)


tFP = FP119873NORM

(9)


tTN = TN119873NORM

(10)



TPR = tTPtFN + tTP

(11)


FPR = tFPtTN + tFP

(12)



(13)



(14)



(15)



(16)







6 Results


RF 119873 119882 119878 (17)





















11987711986530 1m 07 100 002 097 099 001 099RF30 1m 001 095 025 073 083 017 082CCD 074 002 096 088 011 092BClus 030 020 060 050 040 041BH 002 000 090 030 060 004






11987711986530 1m 07 092 001 099 095 005 096RF30 1m 001 094 009 099 094 006 095CCD 038 004 093 059 040 054BClus 010 020 040 040 050 025BH 002 000 090 040 050 003









1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001






7 Conclusions





References
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation




















119862

sum119894=1

119891119894 (1 minus 119891119894) (3)



119862

sum119894=1

minus119891119894 log (119891119894) (4)







min120573

12100381710038171003817100381712057310038171003817100381710038172 + 1

2

119873

sum119894=1

119882119894100381710038171003817100381711989011989410038171003817100381710038172 (5)





5 Experiments














(7)



(8)


tFP = FP119873NORM

(9)


tTN = TN119873NORM

(10)



TPR = tTPtFN + tTP

(11)


FPR = tFPtTN + tFP

(12)



(13)



(14)



(15)



(16)







6 Results


RF 119873 119882 119878 (17)





















11987711986530 1m 07 100 002 097 099 001 099RF30 1m 001 095 025 073 083 017 082CCD 074 002 096 088 011 092BClus 030 020 060 050 040 041BH 002 000 090 030 060 004






11987711986530 1m 07 092 001 099 095 005 096RF30 1m 001 094 009 099 094 006 095CCD 038 004 093 059 040 054BClus 010 020 040 040 050 025BH 002 000 090 040 050 003









1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001






7 Conclusions





References
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation











5 Experiments














(7)



(8)


tFP = FP119873NORM

(9)


tTN = TN119873NORM

(10)



TPR = tTPtFN + tTP

(11)


FPR = tFPtTN + tFP

(12)



(13)



(14)



(15)



(16)







6 Results


RF 119873 119882 119878 (17)





















11987711986530 1m 07 100 002 097 099 001 099RF30 1m 001 095 025 073 083 017 082CCD 074 002 096 088 011 092BClus 030 020 060 050 040 041BH 002 000 090 030 060 004






11987711986530 1m 07 092 001 099 095 005 096RF30 1m 001 094 009 099 094 006 095CCD 038 004 093 059 040 054BClus 010 020 040 040 050 025BH 002 000 090 040 050 003









1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001






7 Conclusions





References
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation















6 Results


RF 119873 119882 119878 (17)





















11987711986530 1m 07 100 002 097 099 001 099RF30 1m 001 095 025 073 083 017 082CCD 074 002 096 088 011 092BClus 030 020 060 050 040 041BH 002 000 090 030 060 004






11987711986530 1m 07 092 001 099 095 005 096RF30 1m 001 094 009 099 094 006 095CCD 038 004 093 059 040 054BClus 010 020 040 040 050 025BH 002 000 090 040 050 003









1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001






7 Conclusions





References
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation














11987711986530 1m 07 100 002 097 099 001 099RF30 1m 001 095 025 073 083 017 082CCD 074 002 096 088 011 092BClus 030 020 060 050 040 041BH 002 000 090 030 060 004






11987711986530 1m 07 092 001 099 095 005 096RF30 1m 001 094 009 099 094 006 095CCD 038 004 093 059 040 054BClus 010 020 040 040 050 025BH 002 000 090 040 050 003









1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001






7 Conclusions





References
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation












1m TPR 081 100 100 020 092FPR 001 002 000 004 001

1m + 4m TPR 075 090 100 020 082FPR 001 007 000 001 001

1m + 3m TPR 086 092 100 020 094FPR 002 005 001 001 001

1m + 2m TPR 084 095 100 020 092FPR 001 007 001 001 001






7 Conclusions





References
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation
























RoboticsJournal of





Journal of



RotatingMachinery



Journal of

Volume 201


VLSI Design



Shock and Vibration







Journal of



Volume 2014


SensorsJournal of





Propagation









pattern extraction algorithm for netflow-based...

Documents