botnet detection based on anomaly and …sites.bu.edu/paschalidis/files/2016/07/botnets-tcns...1...

1

Botnet Detection based on Anomaly andCommunity Detection ∗

Jing Wang† and Ioannis Ch. Paschalidis‡

Abstract—We introduce a novel two-stage approach forthe important cyber-security problem of detecting thepresence of a botnet and identifying the compromised nodes(the bots), ideally before the botnet becomes active. Thefirst stage detects anomalies by leveraging large deviationsof an empirical distribution. We propose two approachesto create the empirical distribution: a flow-based approachestimating the histogram of quantized flows, and a graph-based approach estimating the degree distribution of nodeinteraction graphs, encompassing both Erdos-Renyi graphsand scale-free graphs. The second stage detects the botsusing ideas from social network community detection ina graph that captures correlations of interactions amongnodes over time. Community detection is done by maximiz-ing a modularity measure in this graph. The modularitymaximization problem is non-convex. We propose a convexrelaxation, an effective randomization algorithm, and es-tablish sharp bounds on the suboptimality gap. We applyour method to real-world botnet traffic and compare itsperformance with other methods.

Index Terms—Anomaly detection, cyber-security, bot-nets, social networks, random graphs, optimization.

I. INTRODUCTION

A botnet is a network of compromised computerscontrolled by a “botmaster.” Botnets are typically usedfor Distributed Denial-of-Service (DDoS) attacks, clickfraud, or spamming. DDoS attacks flood the victim withpackets/requests from many bots, effectively consumingcritical resources and denying service to legitimate users.Botnet attacks are widespread. In a recent survey, 300 outof 1000 surveyed businesses have suffered from DDoSattacks and 65% of the attacks cause up to $10,000 lossper hour [1]. Both click fraud and spamming are harmfulto the web economy.

Because of these losses, botnet detection has receivedconsiderable attention. Common intrusion detection fo-cuses on individual hosts but is often ineffective inpreventing botnet formation because not all hosts arezealously monitored and protected.

Early botnets used IRC, a protocol initially designedfor Internet chat, to Command and Control (C&C) their

Research partially supported by the NSF under grants CNS-1239021,IIS-1237022, and CCF-1527292, by the ARO under grants W911NF-11-1-0227 and W911NF-12-1-0390, and by the ONR under grantN00014-10-1-0952.† Center for Information and Systems Eng., Boston University, 15

St. Mary’s St., Brookline, MA 02446, [email protected].‡ Dept. of Electrical & Computer Eng., and Division of Systems

Eng., Boston University, 8 Mary’s St., Boston, MA 02215, [email protected], http://sites.bu.edu/paschalidis/.

bots (infected machines). As a result, a lot of botnet de-tection methods exploited this feature [2], [3]. Recently,though, botnets have evolved to bypass these detectionmethods by using more flexible C&C channels, such asHTTP and P2P protocols. In addition, more types of C&Cchannels are emerging, including Twitter.

Some methods have been proposed to handle thesenovel botnets with more flexible C&C mechanisms byanalyzing the communication patterns among hosts. [4]proposes a method, named BotMagnifier, that deducesbots through their communication with a set of seed IPs.However, only spam bots can be handled by BotMagni-fier and the seed IPs need to be given as input data.An alternative approach called BotHunter [5] modelsthe infection process using a state transition diagram.A variety of methods are used to detect these transitionsand determine whether a node is infected or not. Despiteits popularity, BotHunter has the drawback that it cannotdetect bots that were infected before the deploymentof the system, and its infection state diagram can onlydescribe a small set of bot behaviors.

In this paper, we propose a two-stage approach forbotnet detection. The first stage detects and collectsnetwork anomalies that are associated with the presenceof a botnet while the second stage identifies the botsby analyzing these anomalies (see Fig. 1). Our approachexploits the following two observations: (1) botmastersor attack targets are easier to detect because they com-municate with many other nodes, and (2) the activitiesof infected machines are more correlated with each otherthan those of normal machines [3].

For the first stage, we propose two anomaly detectionmethods, both of which leverage the theory of large de-viations [6]. Based on the stochastic model-free methodproposed in [7], [8], [9], the first anomaly detectionmethod quantizes flow-level data (e.g., Cisco NetFlows)and monitors the histogram of quantized flows. Thesecond anomaly detection method aggregates packet-level data to graphs and monitors their empirical degreedistribution. Compared to our preliminary work in [10]that considered only graph-based anomaly detection forErdos-Renyi graphs, here we also handle scale-freegraphs and draw on hypothesis testing to select theappropriate model.

For the second stage, we first identify a set ofhighly interactive nodes, which are referred to as pivotalnodes. Both botmasters and targets are very likely to

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication.The final version of record is available at http://dx.doi.org/10.1109/TCNS.2016.2532804

Copyright (c) 2016 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing [email protected].

2

be pivotal nodes because they need to interact withbots frequently. These interactions correspond to C&Ctraffic for botmasters and attacking traffic for targets. Ineither case, the interactions between each bot and pivotalnodes are correlated. To characterize this correlation,we construct a Social Correlation Graph (SCG), whoseformal definition is in Sec. IV-B1. We can detect botsby detecting the community that exhibits high interactionwith pivotal nodes in the SCG. We propose a novel com-munity detection method based on a refined modularitymeasure. This problem is posed as a maximization of themodularity measure, which is NP-complete. We developa convex relaxation scheme and establish bounds on itsperformance using ideas for the MAXCUT [11] problem.

Figure 1. Overview of our method.

Notation: Throughout the paper all vectors are as-sumed to be column vectors. We use lower case boldfaceletters to denote vectors and for economy of space wewrite x = (x0, . . . , xn) for the column vector x. Wedefine [x]i =

∑ij=0 xj . We use upper case boldface

letters to denote matrices, script letters for sets, anddenote by |A| the cardinality of set A. log denotes thenatural logarithm. Tr(·) denotes the trace of a matrix and� 0 positive semi-definiteness. Throughout the paper weuse n to denote the number of nodes in the network.

II. LARGE DEVIATIONS OF NETWORK PROCESSES

The anomaly detection stage (Stage 1) is based onanalyzing network processes such as network flows andthe degree of graphs representing node interactions. Thissection presents the large deviations results on whichanomaly detection will be based. We start with a formaldefinition of the Large Deviations Principle (LDP) for afamily of probability measures {µ(n)}.

Definition 1For every closed set B of probability vectors,

limn→∞

sup1n

logPn(µ(n) ∈ B) ≤ − infµ∈B

I(µ),

limn→∞

inf1n

logPn(µ(n) ∈ B) ≥ − infµ∈B◦

I (µ) ,

where B◦ denotes the interior of B and Pn is theprobability law.

More intuitively, Def. 1 states that when n is largeenough, the distribution Pn behaves as

Pn(µ(n) ≈ µ) � e−nI(µ). (1)

The function I (µ) characterizes the exponential decayrate of this probability and is called the rate function.

A. LDP for discrete random variables

Given a discrete random variable X whose alphabetis Σ = (σ1, . . . , σ|Σ|), the probability distribution of Xcan be written as a vector p = (p1, . . . , p|Σ|), where piis the probability of X being equal to σi.

Given n samples X = {x1, . . . , xn} of X , theempirical distribution is a vector µ(n) = (µ(n)

1 , . . . , µ(n)|Σ| )

where µ(n)i = (1/n)

∑nj=1 1(xj = σi). µ(n) satisfies an

LDP with rate function

I(µ) = D(µ ‖ p), (2)

where D(µ ‖ p) =∑i µi log(µi/pi) is the Kullback-

Leibler (KL) divergence of two probability vectors [6].

B. LDP for the degree distribution of random graphs

Let Gn denote the space of all undirected graphs withn vertices. For any graph G ∈ Gn, let d = (d1, . . . , dn)denote the labeled degree sequence of G, with di denot-ing the degree of node i. Let m = 1

2

∑nj=1 dj denote the

number of edges in G. We assume that any two nodesare connected by at most one edge, which implies thatthe node degree in G is less than n.

For 0 ≤ i ≤ n − 1, let hi =∑nj=1 1 (dj = i) be the

number of vertices in G of degree i, where 1(·) is theindicator function. Henceforth, h = (h0, . . . , hn−1), aquantity not dependent on the ordering of vertices, willbe referred to as the degree frequency vector of graphG. The empirical distribution of the degree sequence d,defined by µ(n), is a probability measure on N0 = N ∪{0} that puts mass hi/n at i, for 0 ≤ i ≤ n − 1. Inthe following sections, we present an LDP for empiricaldegree distributions µ(n) of two types of random graphs.

C. Erdos-Renyi model

In the Erdos-Renyi (ER) model, a graph is constructedby connecting nodes randomly. Each edge is included inthe graph with probability p, independent from everyother edge. We will use G(n, p) to denote this model.



3

The distribution of the degree of any particular vertex vis binomial. Namely,

P (dv = k) =(n− 1k

)pk(1− p)n−1−k.

It it well known that when the number of nodes n→∞and np is constant, the binomial distribution converges tothe Poisson distribution. Let λ = np denote the constant.Then, in the limiting case, the probability that the degreeof a node is k equals

PER(k;λ) = pλk = λke−λ/k!, (3)

which is independent of the node label. Let pλ =(pλ0, . . . , pλ∞) be the Poisson distribution viewed as avector parametrized by λ.

Let P(N0) be the space of all probability measuresdefined on N0. We view any probability measure µ ∈P(N0) as an infinite vector µ = (µ0, . . . , µ∞). LetS = {µ ∈ P(N0) | µ :=

∑∞i=0 iµi <∞} be the set of

all probability measures on N0 with finite mean.It is easy to verify that pλ ∈ S . Let Pn denote the

degree distribution of the ER model G(n, λ/n) on thespace Gn. [12] proves that the ER model satisfies anLDP for the empirical degree distribution µ(n) in subsetsof S under Pn with the following rate function.

Definition 2For the ER model G(n, λ/n) define the rate functionIER : S → [−∞,∞] as

IER (µ;λ) =

D(µ ‖ pλ) +12

(µ− λ) +µ

2log λ− µ

2log µ,

where D(µ ‖ pλ) =∑i µi log(µi/pλi) is the KL

divergence of µ with respect to pλ.

D. Preferential attachment model

Preferential Attachment (PA) processes are graph net-works that evolve in time by linking new nodes progres-sively to existing nodes, where the probability of eachexisting node to be linked depends on its degree [13].We view a PA process as a sequence of random graphsG = {G1, . . . ,Gn}, where Gj is the random graph attime j. We assume that only one new node is attachedeach time, i.e., |Gj+1|−|Gj | = 1 for all j = 1, . . . , n−1.

Initially, the graph G1 consists of two nodes with asingle (undirected) edge between them. At time j + 1,a new node is linked to node x ∈ Gj with probabilityproportional to some function w(dx(j)), where dx(j) isthe degree of node x at time j. The graph is called “scale-free” when w(d) = d+α and α > −1. Here, “scale-free”means that the degree distribution of Gn converges to apower-law (i.e., µ(n)

k ∼ k−θ for some θ > 0) as n→∞.

1) Generalized Polya urn model: The evolution ofthe degree distribution in the PA model is equivalent toa generalized Polya urn model [14] U = {U0, . . . ,Un}as follows. Initially, U0 has two empty urns. Supposep(t) : [0, 1] → [0, 1] and β(t) : [0, 1] → [0,∞) are twogiven functions. At each time j + 1, we first add a newurn with no ball to the collection and then

1) with probability p(j/n), we place a new ball inthis new urn;

2) with probability 1 − p(j/n), we place a new ballin one of the other urns x ∈ Uj , selecting x withprobability proportional to w(bx) = bx + β(j/n),where bx is the number of balls in the urn x.

To make the connection between this model and a graph,associate an urn with k balls with a node of the graphwith degree k + 1. Initially, U0 corresponds to a graphwith two connected nodes, each with degree 1. At eachtime we add a new node and connect it to an existingnode. Placing a ball in an urn with no balls, is equivalentto connecting the new node to an existing node of degree1, ending up with one node of degree 2 and one nodewith degree 1. Placing a ball in an existing urn x withbx balls is equivalent to connecting a node of degree 1(an urn with no balls like the one introduced each time)with an existing node whose degree increases by 1.

Suppose µ = (µ0, . . . , µn) is the degree distributionwe observe at last (as t = 1). We consider two specialcases of the PA model whose degree distributions satisfya power-law asymptotically [13].

2) Offset Barabasi-Albert (BA) process: If p(t) ≡ 0,β(t) ≡ 1 + α, then the generalized urn model becomesa so-called “offset” BA process with selection functionw(d) = d + α. An “offset” BA process generates treeswith no cycles. In this model, the probability for a nodeto have degree k is PBA (k;α). It has been shown thatasymptotically (n→∞) [13]

PBA(k;α) � k−(α+3)/ζ(α+ 3),

where ζ(x) =∑∞k=1 k

−x is Riemann’s zeta-function.We can obtain that the empirical degree distribution of

the offset BA process satisfies a sample-path LDP withrate function

IBA(µ;α) =∑i≥0

(1−[µ]i) log1− [µ]i

(i+ 1 + α)µi/(2 + α)

+ (1−∑i≥0iµi) log(2 + α), (4)

where [µ]i =∑ij=0 µi (see Appendix A).

3) Chung-Handjani-Jungreis (CHJ) model: If p(t) ≡p and β(t) ≡ 1, the generalized urn model correspondsto an attachment scheme that can generate graphs insteadof trees. This model is a special case of the CHJmodel [15] with asymptotic degree distribution

PCHJ(k; p) � k−(1+(1−p)−1)/ζ(1 + (1− p)−1).



4

We can obtain (see Appendix A) that the empiricaldegree distribution satisfies a sample-path LDP with ratefunction

ICHJ(µ; p) = (1− µ0) log1− µ0

p+ (1− p)µ0/2

+∑∞i=1(1− [µ]i) log

1− [µ]i(1− p)(i+ 1)µi/2

+ (1−∑i≥0iµi) log

21− p

. (5)

III. ANOMALY DETECTION

In this section, we propose two approaches foranomaly detections, both based on the results presentedin the previous section. The first approach, which takesnetflow files as input, quantizes flows and treats thequantized flows as observations of a discrete randomvariable. The second approach, which takes pcap filesas input, aggregates packets to graphs and leverages theLDP for the graph degree distribution. A GeneralizedLikelihood Ratio Test (GLRT) is proposed to select themost appropriate random graph model.

A. Flow-based approach

NetFlow offers a concise representation of networktraffic and has become a de-facto industry standard. Eachflow describes a communication session, whose durationvaries from seconds to days.

The overall idea of our approach is to quan-tize flows and to treat them as independent obser-vations of a discrete random variable. We first clus-ter network addresses (part of the flow characteristics)into a manageable number of clusters. For simplic-ity of notation, we only consider IPv4 addresses. Ifxi = (xi1, x

i2, x

i3, x

i4) ∈ {0, . . . , 255}4 and xj =

(xj1, xj2, x

j3, x

j4) ∈ {0, . . . , 255}4 are two IPv4 addresses,

a distance metric between them is defined as d(xi,xj) =∑4k=1 256(4−k)|xik−x

jk|. Letting X be the set of unique

IP addresses, we apply typical K-means clustering on Xusing the distance metric mentioned above. A flow willbe represented by the following features: (1) the clusterlabel of the source IP address, (2) the cluster label ofthe destination IP address, (3) the source port number,(4) the destination port number, (5) the flow duration,(6) the data bytes sent from source to destination, and(7) the data bytes sent from destination to source.

Suppose the input is a sequence of flows F ={f1, . . . , fn}. For each flow f , we quantize each featureseparately and denote by σ(f) the “type” of the quantizedflow while Σ is the corresponding alphabet. For anyρ ∈ Σ, the empirical measure is

µF (ρ) = (1/n)∑ni=11(σ(f i) = ρ). (6)

We write µF = {µF (ρ) : ρ ∈ Σ} for the vectorempirical measure.

Let µref denote the probability vector calculated fromsome reference flows Fref and let Iflow(µ) = D(µ ‖µref ). Using the LDP presented in Section II-A, wepropose the following anomaly detector:

Iflow(F) = 1(Iflow(µF ) ≥ ξ), (7)

where ξ is a detection threshold. It was shown in [7]that (7) is asymptotically Neyman-Pearson optimal. Wegroup flows into windows based on their timestamps andapply the anomaly detector above for each window.

B. Graph-based approach

We also propose a graph-based approach that pro-cesses packet-level data. We treat each packet as aninteraction record between the source and the destina-tion. We first group packets into windows based on theirtimestamps. For all k, we denote by Wk the collectionof packets in window k. We define the interaction graphfor window k as follows.

Definition 3(Interaction Graph). Let Ek be an edge set such that(i, j) ∈ Ek if there exists at least one packet r ∈ Wk

whose source is node i and destination is node j orsource is node j and destination is node i. Then theinteraction graph Gk = (V, Ek) corresponding to Wk

is an undirected graph with vertex set V , the set of allnodes in the network, and edge set Ek.

1) Model selection: Since the LDP rate function inrandom graphs depends on the graph model that is used,we first present a method to select the appropriate graphmodel. We assume M independent observations of nodedegrees D = {d1, . . . , dM} under normal operation ofthe network. In practice, we could pick l nodes in thenetwork randomly as monitoring points and observe thedegrees of these nodes in K different interaction graphs.This will lead to M = Kl samples. Note that theobservations at those l nodes may not be independent.Yet, the dependence is negligible if l� n.

For the ER model with degree distribution PER(k;λ)(cf. (3)) the log-likelihood of D is:

LER(D ;λ) = (log λ)∑Mi=1di − λM + C, (8)

where C = −∑i log(di!).

For both the offset BA model and the CHJ model, theasymptotic degree distribution is a power-law with theform PPA(k; γ) = k−γ/ζ(γ), where γ is a parameter.The log-likelihood of D is:

LPA(D ; γ) = −γ∑Mi=1 log di −M log ζ(γ). (9)



5

In practice, we may observe some isolated nodes i whosedegree di = 0. In this case, log di = −∞ and the PAmodel is completely ruled out. However, these isolatednodes may be the result of observation error. Insteadof ruling out the PA model completely, we add a finitepenalty θ for each isolated node. Namely, we write

LPA(D ; γ) = L+PA(D ; γ)− θ

∑i1(di = 0), (10)

where L+PA is simply the same as in (9) with the

summation taken only over i with di > 0. An appropriatevalue of θ can be determined through experiments, as wewill elaborate in Sec. V-D.

We use the Generalized Likelihood Ratio Test (GLRT)(see [16] for related discussion) to select the referencemodel. LetH0 be the hypothesis that the ER model is themost appropriate and H1 be the alternative hypothesis;then the GLRT is

maxλ

LER(D ;λ)−maxγ

LPA(D ; γ)H0

≷H1

η, (11)

where η is a prescribed threshold.The Maximum Likelihood Estimate (MLE) for the

parameters of the ER and PA models can be calculatedeasily by maximizing (8) and (9), respectively. Settingthe derivative of LER(D ;λ) to zero, the MLE for theparameter of the ER model is λ = (

∑Mi=1 di)/M .

Similarly, letting φ(x) = ζ(x)/ζ(x), where dot denotesderivative, and φ−1(x) be the inverse function of φ(x),the MLE for the parameter of the PA model is γ =φ−1(−(

∑di>0 log di)/M).

If the ER model is selected, we use IER(µ; λ) as therate function. Let now α be the MLE of the offset BAprocess and p be the MLE of the CHJ model; then α =γ − 3 and p = 1− (1− γ)−1, respectively. We can usethe following combined rate function

IPA(µ; γ) = min(IBA(µ; α), ICHJ(µ; p)) (12)

if the PA model is selected. It can be shown thatIPA(µ; γ) is the rate function for a random graph modelthat is either the offset BA model or the CHJ model (seeAppendix B).

2) A formal anomaly detection test: Next, we con-sider the problem of evaluating whether a graph G isnormal, i.e., comes from either the ER model or thePA model with desirable parameters (H0). Let µG bethe empirical degree distribution of the graph G. I(µG)in Definition 1 provides a rigorous indicator of thenormality of µG . The rate function that should be useddepends on the result of the model selection procedure.If the ER model is selected, we use IER(µ; λ) as therate function; otherwise, we use IPA(µ; γ).

Let

I(µ) =

{IER(µ, λ), if ER is selected,IPA(µ, γ), if PA is selected.

The graph-based anomaly detector is:

Igraph(G) = 1(I(µG ≥ ξ), (13)

where ξ is a detection threshold. The test in (13) isa generalized Hoeffding test. Since I(µG) satisfies anLDP regardless of the selected model, the generalizedHoeffding test (13) satisfies the Generalized Neyman-Pearson (GNP) criterion [6].

IV. BOTNET DISCOVERY

The network anomaly detection technique in the pre-vious section can only report whether there is anomalyor not. In this section, we present a botnet discoverytechnique that can identify botnets. We apply a slidingwindow to network traffic, monitor windows contin-uously and store all anomalies. The botnet discoverytechnique can be applied when the number of detectedanomalies is large enough.

The flow-based anomaly detector reports abnormalwindows and the graph-based anomaly detector reportsabnormal interaction graphs. We first unify the output ofthe two methods by defining interaction records.

An interaction record (ts, i, j) is a tuple of timestamp,source IP and destination IP, and it represents that nodei and j interact at time ts (regardless of direction). Forthe flow-based detector, an anomalous window can berepresented as a set of interaction records because eachflow can be easily converted to an interaction record. Forthe graph-based detector, an interaction graph can be alsorepresented as a set of interaction records because eachedge corresponds to an interaction record. As a result,in both methods, an anomaly can be represented as a setof interaction records S = {r1, . . . , r|S|}. The input ofour botnet discovery method is a sequence of anomaliesA =

{S1, . . . ,S|A|

}.

A. Identification of pivotal nodes

Detecting bots directly is non-trivial. Instead, detect-ing the leaders (botmasters) or targets is simpler becausethey are more interactive than normal nodes. Botmastersneed to “command and control” their bots to maintain thebotnet, and bots actively interact with victims in a typicalDDoS attack, even before the attack while they probe thetargets. Both leaders and targets, henceforth referred toas pivotal nodes, are highly interactive. Suppose the totalnumber of nodes that appear in A is n. Let Gijk be thenumber of interactions between node i and node j inanomaly Sk. Then,

ei = (1/ |A|)∑|A|k=1

∑nj=1G

ijk , i = 1, . . . , n, (14)

represents the amount of interaction of node i with allother nodes in A and will be called the total interactionmeasure of node i. We next define pivotal nodes.



6

Definition 4(Pivotal nodes). The set of pivotal nodes is N ={i : ei > τ}, where τ is a threshold.

After identifying pivotal nodes, we turn to detecting thecommunity associated with pivotal nodes.

B. Botnet discovery

We now focus on interactions between pivotal nodesand the remaining nodes. Pivotal nodes are either bot-masters or attack targets; in either case their interactionswith the bots are likely to be correlated. This sectionpresents a technique that detects a community of highlycorrelated and highly interactive nodes.

Compared to similar approaches in community detec-tion, e.g., the leader-follower algorithm [17], our methodtakes advantage of not only temporal features (amountof interaction) but also correlation relationships. Theserelationships are characterized using a graph, whosedefinition is presented next.

1) Construction of the Social Correlation Graph: Fori = 1, . . . , n, let Xi represent the number of interactionsbetween node i and pivotal nodes. For each anomalySk ∈ A, we obtain a sample of Xi as xki =

∑j∈N G

ijk .

Let Xi = (1/|A|)∑|A|k=1 x

ki be the sample mean and

σ(Xi) =√

(1/(|A| − 1))∑|A|k=1(xki − Xi)2 be the sam-

ple standard deviation of Xi for all i. The samplecorrelation coefficient is defined as

ρ(Xi, Xj) =

∑|A|k=1[(xki − Xi)(xkj − Xj)](|A| − 1)σ(Xi)σ(Xj)

,

with the convention that ρ(Xi, Xj) = 0 if σ(Xi) = 0 orσ(Xj) = 0. By definition, ρ(Xi, Xj) ∈ [−1, 1].

Definition 5(Social Correlation Graph). The Social CorrelationGraph (SCG) C = (V, Ec) is an undirected graph withvertex set V and edge set Ec = {(i, j) : |ρ(Xi, Xj)| >τρ}, where τρ is a prescribed threshold.

Because the behaviors of bots are correlated, they aremore likely to be connected to each other in the SCG.Our problem is to find an appropriate division of theSCG to separate bots and normal nodes. Our criterionfor “appropriate” is related to the well-known conceptof modularity in community detection [18], [19].

2) Modularity-based community detection: The prob-lem of community detection in a graph amounts todividing its nodes into non-overlapping groups such thatconnections within groups are relatively dense whilethose between groups are sparse [18]. The modularity fora given subgraph is defined to be the fraction of edgeswithin the subgraph minus the expected fraction of suchedges in a randomized null model. This measure has

inspired a broad range of community detection methodsnamed as modularity-maximization methods.

We consider the simpler case when there is only onebotnet in the network. As a result, we want to divide thenodes into two groups, one for bots and one for normalnodes. Define si as:

si =

{1, node i is a bot,−1, otherwise.

Let dci be the degree of node i in SCG C = (V, Ec) fori = 1, . . . , n and mc = (1/2)

∑i dci the number of edges

in C. For a partition specified by s = (s1, . . . , sn), itsmodularity is defined as in [18]

Q(s) = (1/2mc)∑ni,j=1(Aij −Nij)δ(si, sj), (15)

where δ(si, sj) = (1/2)(sisj + 1). It is easy to observethat δ(si, sj) is an indicator of whether node i and nodej are of the same type. Aij = 1(|ρ(Xi, Xj)| > ερ) is anindicator of the adjacency of nodes i and j. Nij is theexpected number of edges between node i and node j in anull model. The selection of the null model is empirical;a common choice [19] is to assume that an arbitraryedge attaches to node i with probability κdci , whereκ is a normalization constant. Then, Nij = κ2dcid

cj .

Setting∑i,j Nij = 2mc, we can solve for κ and obtain

Nij = dcidcj/(2m

c). The optimal division of verticesshould maximize the modularity measure (15).

3) Refined modularity: We introduce two refinementsto the modularity measure to make it suitable for botnetdetection. First, intuitively, bots should have strong inter-actions with pivotal nodes and normal nodes should haveweaker interactions. We want to maximize the difference.To capture the amount of interaction between node i andpivotal nodes define

ri = (1/|A|)∑|A|k=1

∑j∈N ejG

ijk . (16)

We refer to ri as pivotal interaction measure of nodei. Then

∑irisi quantifies the difference between the

pivotal interaction measure of bots and that of normalnodes. A natural extension of the modularity measure isto include an additional term to maximize

∑irisi.

Second, the modularity measure is criticized to sufferfrom low resolution, i.e., it favors large communitiesand ignores small ones [20]. The botnet, however, couldpossibly be small. To address this issue, we introduce aregularization term for the size of botnets. Notice that∑i1(si = 1) =

∑i(si + 1)/2 is the number of detected

bots. Thus, our refined modularity measure is

Qd(s) =1

2mc

∑i,j∈V

(Aij −

dcidcj

2mc

)sisj

+ w1

∑irisi − w2

∑i

si + 12

, (17)



7

where w1 and w2 are appropriate weights.The two modifications also influence the results of

isolated nodes with degree 0, which possibly exist inSCGs. By Def. 5, node i becomes isolated if σ (Xi) = 0or its correlations with other nodes are small enough. Theplacement of isolated nodes, however, does not influencethe traditional modularity measure, resulting in arbitrarycommunity detection results [18]. This limitation is ad-dressed by the two additional terms. If node i is isolatedand ri = 0, then si = −1 in the solution because of theregularization term w2

∑i(si+ 1)/2. On the contrary, if

ri is large enough, si = 1 in the solution because of theterm w1

∑i risi.

4) Relaxation of the optimization problem: Themodularity-maximization problem has been shown tobe NP-complete [21]. The existing algorithms can bebroadly categorized into two types: (i) heuristic methodsthat solve this problem directly [22], and (ii) mathemat-ical programming methods that relax it into an easierproblem first [21]. We follow the second route becauseit is more rigorous.

We define the modularity matrix M = {Mij}ni,j=1,where Mij = Aij/(2mc) − dcid

cj/(2m

c)2. Let s =(s1, . . . , sn) and r = (r1, . . . , rn); then the modularity-maximization problem becomes

f∗ = max s′Ms +(w1r′ −

w2

21′)s (18)

s.t. s2i = 1,

where 1 is the vector of all ones and f∗ denotes theoptimal value. (18) is not necessarily convex. Letting

S = ss′, q0 = w1r− (w2/2)1,

noticing that s′Ms = Tr(SM), and by relaxing theconstraint S = ss′ into S − ss′ � 0, we can write thefollowing convex relaxation [23] of (18):

f∗relax = max Tr(SM) + q′0s (19)

s.t.[

S ss′ 1

]� 0,

Sii = 1, ∀i.

Let

X =[

S ss′ 1

], and W =

[M q0/2

q′0/2 0

];

then the problem can be compactly written as:

f∗relax = max Tr (XW) (20)s.t. X � 0,

Xii = 1, ∀i.

The problem above is a Semi-Definite Programming(SDP) problem and produces an upper bound on theoptimal value of the original problem (18). It is well

known that SDPs are polynomially solvable and manysolvers are available.

Theorem IV.1 Let χ−min be the smallest negative eigen-value of W, where χ−min = 0 when W � 0. Then,

2πf∗relax +

(1− 2

π

)χ−min(n+ 1) ≤ f∗ ≤ f∗relax.

Proof: See Appendix C.In the special case that W � 0, (18) becomes aMAXCUT problem and the lower bound degenerates tothe well-known bound for MAXCUT [11].

5) Randomization: Solving the SDP relaxation (19),we obtain an optimal solution (S∗, s∗) of (19) togetherwith bounds on f∗. However, this solution may not befeasible for (18). To generate feasible solutions we usea randomization technique.

Notice that by feasibility in (19), S∗ − s∗(s∗)′ � 0and it can be interpreted as a covariance matrix. Ifwe pick y as a Gaussian random vector with y ∼N(s∗,S∗ − s∗(s∗)′), then y will solve the non-convex(18) “on average” over this distribution. Thus, to gen-erate good feasible solutions of (18), we can sample yfrom N(s∗,S∗−s∗s∗

′) and project to the feasible set as

in y = sgn(y). Sampling a large number of points (e.g.,10,000) and keeping the one with the largest objectivevalue can produce near-optimal solutions.

V. EXPERIMENTAL RESULTS

Our experimental results include two parts. In the firstpart, we compare our method with the BotHunter method[5] on the CTU-13 botnet dataset [24]. In the second part,we apply our method to a dataset generated by mixing aDDoS attack traffic dataset with real-world backgroundtraffic. In this part, we mainly focus on the performanceof our botnet discovery approach.

A. Description of CTU-13 dataset

The CTU-13 is a dataset of botnet traffic that wascaptured in the Czech Technical University [24]. Itcontains 13 scenarios with various botnet types. Weconsider scenarios 1, 2, 6, 8, and 9 of the dataset.• Scenario 1 corresponds to an IRC-based botnet that

sent spam for almost 6.5 hours.• Scenario 2 (2.5-hours) is from the same botnet.• Scenario 6 is from a botnet that scanned SMTP

(Simple Mail Transfer Protocol) servers for 2 hoursand connected to several RDP (Remote DesktopProtocol) services. Different with Scenarios 1 and2, this botnet neither sent spam nor did it attack.It’s C&C server used a proprietary protocol.

• Scenario 8 is from a botnet that contacted a lotof Chinese C&C hosts and received large amounts



8

of encrypted data. The botnet also cracked thepasswords of machines during the 19-hour attack.

• Scenario 9 corresponds to a case where 10 localhosts were infected by a spamming botnet. Morethan 600 spams were sent over 5 hours.

For all the scenarios, the authors of the CTU-13dataset convert the captured pcap files to NetFlows andrelease the processed flows. The dataset contains ground-truth labels for flows as follows: flows from or to theinfected machines are labeled as “botnet”; flows fromor to non-infected machines are labeled as “normal”; allother flows are labeled as “background.”

B. Performance metrics for botnet detection

Typical performance metrics used in the literatureare mostly flow-based (e.g., detection and false alarmprobability for an anomalous flow) and do not usuallyaccount for the speed of detection. For botnets, we areinterested in performance metrics that account for falsealarms and miss-detections of bots but also recognizethat early detection is useful.

In this paper, we use the performance metrics pro-posed by [24], which are defined based on host IPsclassified either as bot or normal. Bot IPs are the IPsthat either send or receive at least one “botnet” flow.We divide the data into multiple time frames, indexedby t, which are only used to compute performancemetrics and are independent of the detection methods.For some metrics we will use a discount function (weuse α = 0.01)

c(t) = e−αt + 1, (21)

to account for the fact that early (small t) detections ormisses are more valuable than later ones. Let Nb(t) bethe number of bot IPs and Nn(t) the number of normalIPs in time frame t, respectively. We define:• TP (t) = TP (t)c(t)/Nb(t), where TP (t) is the

number of bot IPs correctly identified.• FN(t) = FN(t)c(t)/Nb(t), where FN(t) is the

number of bot IPs missed.• FP (t) = FP (t)/Nn(t), where FP (t) is the num-

ber of normal IPs reported as bot IPs.• TN(t) = TN(t)/Nn(t), where TN(t) is the num-

ber of normal IPs reported as normal,corresponding to true positives, false negatives, falsepositives, and true negatives, respectively.

Let now tTP =∑t TP (t), tFN =

∑t FN(t),

tFP =∑t FP (t), tTN =

∑t TN(t) and define

• FPR = tFPtTN+tFP ,

• Recall = tTPtTP+tFN ,

• Precision = tTPtTP+tFP ,

• F1-Measure = 2 Precision∗RecallPrecision+Recall ,• G-Measure =

√Precision ∗Recall.

The metrics above have similar semantic meanings withtheir classical counterparts [24].

C. Results on the CTU-13 dataset

The BotHunter method is based on a state-basedinfection sequence model and assumes the behavior ofa bot machine can be described by a state diagram. Itconsists of two stages. The first stage identifies warningsbased on anomaly and signature detection. In the secondstage, the warnings are tracked over a temporal window,each contributing to an overall infection sequence scorethat is maintained for each host.

For the evaluation, we used the GAD software pack-age we developed for evaluating anomaly detectionmethods [25], to which we added botnet evaluationfunctionalities. For the CTU-13 dataset we used ourflow-based anomaly detection coupled with our bot-net discovery approach as described in Sec. IV. Thetime frame size for evaluation is 5 minutes. For everyscenario, the first 25 minutes of the dataset have nobotnet traffic and are used for training. Each detectionwindow is 2 seconds. The threshold for constructing oursocial correlation graph is τρ = 10. The regularizationcoefficients for interactivity and community size (cf.(17)) are w1 = 1 and w2 = 2, respectively. The anomalydetection threshold ξ is 0.6 for Scenarios 1 and 6 and0.8 for the other scenarios.

Table I shows the comparison of our method withthe BotHunter method in the five scenarios. In general,our method has lower precision but higher recall. Therecall of BotHunter is very low, which is likely due totwo reasons: (1) the data was collected in the gatewayof an intranet and the bots inside the intranet mayhave been infected before the dataset was collected,and (2) information from external bots is not directlyobserved and is not sufficient for dialog analysis. Incontrast, our method tends to be more aggressive inreporting alarms since our botnet discovery is based onmodularity-based community detection, which is likelyto detect large communities. This is also the reason whywe add a regularization term for community size in (17)to alleviate this tendency, and we can increase the weightw2 to achieve higher granularity. In botnet detection,recall is more important than precision because missinginfected machines may cause serious security issueswhile false alarms are more tolerable.

There are several ways of combining precision and re-call to give an overall metric for algorithm performance.Table I lists two popular metrics: F1-measure and G-measure [24]. The F1-measure is the harmonic mean ofprecision and recall and the G-measure is the geometricmean of precision and recall. In Scenario 8, BotHunterfails to report any bot IP. For all other scenarios, our



9

Table ISUMMARY OF RESULTS ON THE CTU DATASET.

Scenario Method tTP tTN tFP tFN FPR Recall (TPR) Precision F1 Measure G Measure

1 Our Method 7.52 66.81 1.19 90.50 0.017 0.077 0.86 0.14 0.26BotHunter 1.59 73.80 0.18 109 0.0024 0.014 0.90 0.028 0.11

2 Our Method 3.77 44.03 0.97 77.64 0.022 0.046 0.80 0.088 0.19BotHunter 1.65 46.9 0.05 75 0.0011 0.022 0.97 0.042 0.14

6 Our Method 4.05 15.18 1.82 28.66 0.11 0.12 0.69 0.21 0.29BotHunter 2.53 20.9 0.02 37.3 0.00096 0.064 0.99 0.12 0.25

8 Our Method 13.42 219.0 9.010 292.00 0.040 0.044 0.60 0.082 0.16BotHunter 0 229 0.11 309 0.00048 0 0 - 0

9 Our Method 5.21 56.04 1.97 59.66 0.034 0.080 0.73 0.14 0.24BotHunter 1.76 57.9 0.06 86.9 0.001 0.02 0.97 0.039 0.14

method has better F1-measure and G-measure than theBotHunter method.

D. Random graph model selection in mixed models

This section presents results of our random graphmodel selection algorithm on a mixed model. It alsopresents a way to determine the penalty coefficient θ(cf. Eq. (10)) through experiments. We first define thefollowing random graph model. To generate a graphGmix = (Vmix, Emix) with this model, we partition Vmixinto two subsets VER and VPA. Then, graphs GER =(VER, EER) and GPA = (VPA, EPA) are generated bythe ER model and the PA model, respectively. The edgeset of Gmix is Emix = EER ∪ EPA.

We let |VER| = |VPA| and generate a sequenceof graphs using the random graph model describedabove. Taking some samples of node degrees D ={d1, . . . , d|D|} from the graphs, we can assume thatthe maximum likelihoods of D for the ER and thePA models are equal, namely, maxλ LER(D ;λ) =maxγ LPA(D ; γ). It follows from (10)

θ∑i

1(di = 0) = maxγ

L+PA(D ; γ)−max

λLER(D ;λ).

(22)We can use (22) to estimate θ approximately usingleast-squares. This approach generalizes to real-worlddatasets. We can ask experts to give estimates ofmaxγ LPA(D ; γ) − maxλ LER(D ;λ) for each D and∑i 1(di = 0) is available from the sample D . Again,

we can use least-squares to estimate θ.

E. Description of CAIDA dataset

Next we create a dataset by mixing real-world bot-net traffic with real-world background traffic. For thebotnet traffic, we use the CAIDA “DDoS Attack 2007”dataset [26]. It includes traces from a Distributed Denial-of-Service (DDoS) attack on August 4, 2007. The DDoSattack attempts to block access to the targeted server byconsuming computing resources on the server and all ofthe bandwidth connecting the server to the Internet.

The total size of the dataset is 21 GB covering aboutone hour (20:50:08 UTC to 21:56:16 UTC). This datasetonly contains attacking traffic to the victim; all othertraffic, including the C&C traffic, has been removed.The dataset consists of two parts. The first part is thetraffic when the botnet initiates the attack (between20:50 UTC and 21:13 UTC). In this stage, the botsprobe whether they can reach the victim. The amount oftraffic from the botnet during this period is small, thus,it is very challenging to detect it using only networkload. The second part is the attack traffic which startsaround 21:13 UTC when the network load increasesrapidly (within a few minutes) from about 100 Kb/s toabout 80 Mb/s. Although the DDoS attack itself (after21:13 UTC) is trivial to detect, we apply our approach toa 5-minute segment between 20:50 UTC and 21:13 UTC.This is traffic from the pre-attack stage when the botnetprobes the target. Botnet traffic during this period islow-intensity (about 100 Kb/s). A successful detectionduring this stage provides network administrators enoughtime (about 20 min) to take preventive actions, such as,blocking suspected bot IPs.

For the background traffic, we use trace 6 in theUniversity of Twente traffic traces data repository(simpleweb) [27]. This trace was measured in a 100Mb/s Ethernet link connecting a university unit to theInternet. This is a relatively small unit with around 35employees and a little over 100 students. There are100 workstations at this location which all have a 100Mb/s LAN connection. The core network consists of a1 Gb/s connection. The recordings took place betweenthe external optical fiber modem and the first firewall.The measured link was only mildly loaded during thisperiod. The background traffic we choose lasts for 3, 600seconds. The botnet traffic is mixed with backgroundtraffic between 2, 000 and 2, 300 seconds.

F. Results of network anomaly detection

We first divide the background traffic into 10-secondwindows and create a sequence Gbgrnd of 360 back-



10

ground interaction graphs. We apply the model selectiontechnique described in Section III-B1 to Gbgrnd.

We first randomly sample 50 interaction graphs fromGbgrnd and sample the degrees of 20 nodes in eachinteraction graph uniformly. As a result, our total sam-ple size is |D | = 1, 000. In our experiment, we setθ = 2.44. The MLE of the PA model is γ = 1.82 andits log likelihood LPA(D ; γ) = −3, 072.58. The MLEof the ER model is λ = 0.025 and its log-likelihoodLER(D ; λ) = −643.78. We use a threshold η = 0 inapplying the GLRT and assume no prior knowledge ofthe model. GLRT selects the ER model (cf. (11)).

Fig. 2-A shows the detection results. The blue “+”markers correspond to IER(µi; λ) for each windowi, i = 1, . . . , 360, where µi is the empirical degreedistribution of interaction graph i. The (red) dashed lineshows the threshold ξ = 0.18. There are 36 abnormalinteraction graphs, namely, |A| = 36. There are 30interaction graphs that have botnet traffic and 29 in-teraction graphs are correctly identified. The interactiongraph corresponding to the time range [2, 000, 2, 010]is missed. Being the start of the botnet traffic, thisrange has very low botnet activity, which may explainthe miss-detection. In addition, there are two groups offalse alarms—3 false alarms around 3,000 s and 4 falsealarms around 3,500 s. Fig. 2-B shows the ReceiverOperating Characteristic (ROC) curve of the networkanomaly detection stage.

0 500 1000 1500 2000 2500 3000 3500Time (s)

0.00.20.40.60.81.01.21.4

Div

erge

nce

A

0.0 0.2 0.4 0.6 0.8 1.0false positive rate.

0.00.20.40.60.81.0

true

pos

itive

rat

e.

B

Figure 2. (A): the rate function value IER(µi; λ) for each windowi. The x-axis shows the start time of each window. The backgroundtraffic lasts for 3,600 seconds and the botnet traffic is added between2,000 and 2,300 seconds. (B): the ROC curve. The x-axis plots thefalse positive rate and the y-axis plots the true positive rate.

G. Results of botnet discovery

The botnet discovery stage aims to identify bots basedon the information in A. The first step is to identify a set

of pivotal nodes N (cf. Def. 4 and Eq. (14)). Let emaxbe the maximum total interaction measure of all nodesand S Norm

e = {ei/emax : i = 1, . . . n} the normalizedset of total interaction measures. Fig. 3 plots S Norm

e

in descending order using log-scale for the y-axis. Eachblue “+” marker represents one node. The blue curve inFig. 3, being quite steep, clearly indicates the existenceof influential pivotal nodes. The red dashed line in Fig. 3plots the chosen threshold τ used in Def. 4, whichresults in 3 pivotal nodes. Only one pivotal node (theone with maximum total interaction measure) belongsto the botnet. The other two pivotal nodes are activenormal nodes. These two falsely detected pivotal nodescorrespond to the two false-alarm groups described inSection V-F.

0 50 100 150 200Sequence number

10-4

10-3

10-2

10-1

100

Amou

nt o

f int

erac

tion

(nor

mali

zed)

.

Figure 3. Sorted amount of interaction in A defined by (14). y-axisis in log-scale. The red dashed line plots the threshold τ we used.

Our dataset has 396 nodes, including 136 bots and260 normal nodes. Among the 396 nodes, only 213nodes have positive sample standard deviations σ(Xi).Let Vp = {i : σ(Xi) > 0} be the set of these nodes.Fig. 4 plots the correlation matrix of Xi, Xj for alli, j ∈ Vp. We can easily observe two correlation groups.

Figure 4. The correlation matrix of Xi, Xj for all i, j ∈ Vp (plottedby the pcolor command in the pylab python package).

We calculate the SCG C using Def. 5 and thresholdτp = 0.3. In C, there are 191 isolated nodes with degree



11

Figure 5. Comparison of different community detection techniques on the SCG. In A–C, red squares are bots and blue circles are normalnodes. In D–F, red squares indicate the group with the highest average pivotal interaction measure, while blue circles indicate the group withthe lowest one. (A): ground-truth communities of bots and normal nodes. (B): result of our botnet discovery approach. (C): result of the vectorprogramming method in [21]. (D): result of the walktrap method [28] with three communities. (E): result of the leading eigenvector method [19]with 3 communities. (F): result of the leading eigenvector method with five clusters.

zero. The subgraph formed by the remaining 205 nodeshas two connected components (Fig. 5-A). Fig. 5-A plotsnormal nodes as blue circles and bots as red squares.Although the bots and the normal nodes clearly belongto different communities, the two communities are notseparated in the narrowest part. Instead, the separatingline is closer to the bots.

We apply our botnet discovery method to C. The result(Fig. 5-B) is very close to the ground truth (Fig. 5-A). Ascomparison, we also apply other community-detectionmethods to the 205-node subgraph.

The first method is the vector programming methodproposed in [21], which is a special case of our methodusing w1 = 0 and w2 = 0 in (17). This approach,however, misses some bots (5-C).

The second method is the walktrap method in [28],which defines a distance metric between nodes basedon a random walk and applies hierarchical clustering.When the desirable number of communities, a requiredparameter, equals to two, the method reports the twoconnected components – a reasonable, yet uninformativeresult for botnet discovery. To make the results moremeaningful, we use walktrap to find three communitiesand ignore the smallest one that corresponds to thesmaller connected component (right triangles in Fig. 5-D). The community with higher mean of the pivotalinteraction measure is detected as the botnet, and the re-maining nodes are called normal. The walktrap methodseparates bots and normal nodes in the narrowest partof the graph, a reasonable result from the perspective of

community detection (Fig. 5-D). However, a comparisonwith the ground-truth reveals that a lot of normal nodesare falsely reported as bots.

The third method is Newman’s leading eigenvectormethod [19], a classical modularity-based communitydetection method. This method first calculates the eigen-vector corresponding to the second-largest eigenvalue ofthe modularity matrix M, namely the leading eigenvec-tor. The solution s = (s1, . . . , sn) is then constructed byletting si be the sign of the ith element of the leadingeigenvector. The method can be generalized for detect-ing multi-communities [19]. Similar to the walktrapmethod, the leading eigenvector method reports twoconnected components when the desirable communitynumber is two. We also use this method to find threecommunities and ignore the smallest one. Again, thecommunity with higher mean of pivotal interactionmeasure is detected as the botnet.

Different from the previous methods, the eigenvectormethod makes a completely wrong prediction of thebotnet. The community whose majority are bots (bluecircles in Fig. 5-E) is wrongly detected as the normalpart and the community formed by the remaining nodesis wrongly detected as the botnet. Despite being part ofthe real botnet, the community of blue circles in Fig. 5-Eactually has lower mean of pivotal interaction measure,i.e., less overall communication with pivotal nodes.

After dividing the SCG C into five communities usingthe leading vector approach for multi-communities [19],we observe that the botnet itself is heterogeneous and



12

divided into three groups. Both the group with thehighest mean of pivotal interaction measure (Group IIin Fig. 5-F) and the group with the lowest mean (GroupI in Fig. 5-F) are part of the botnet.

Because of this heterogeneity, some groups of thebotnet may be misclassified. The leading vector methodwrongly separates Group I from the rest as a singlecommunity, and merges Group II & IV with the normalnodes (Group III). Because Group I has the lowestpivotal interaction measure, it is wrongly detected asnormal, causing Groups II, III, IV to be detected asthe botnet. Similarly, the vector programming methodwrongly detects a lot of nodes in Group II, which shouldbe bots, as normal nodes.

By taking the pivotal interaction measure into con-sideration, the misclassification can be avoided. In ourformulation of the refined modularity (17), the termw1

∑i risi maximizes the difference of the pivotal in-

teraction measure of the botnet and that of the normalpart. Owing to this term, our method makes few mistakesfor nodes in Group II since they have high pivotalinteraction measures. Let S −r = {ri : si = −1} and

Table IISTATISTICS ON THE PIVOTAL INTERACTION MEASURES

Normal Mean Bot Mean DifferenceGround Truth 0.0021 0.024 0.0219Our Method 0.0017 0.024 0.0223

Walktrap 0.0019 0.021 0.01913-LEV 0.011 0.018 0.0078

S +r = {ri : si = 1} be the set of pivotal interaction

measures for normal nodes and bots. Table II shows themean of S −r (Col. 1) and S +

r (Col. 2) for groundtruth (Row 1), our method (Row 2), the walktrapmethod (Row 3), and the leading eigenvector methodwith 3 communities (3-LEV, Row 4), respectively. Thedifference between the mean of S +

r and S −r (Col. 3of Table II) of 3-LEV, whose result is unreasonable,is significantly smaller than the rest of the methods. Incomparison, the difference in our method is much largerand closer to the ground truth.

VI. CONCLUSIONS

In this paper, we propose a novel method of botnetdetection that consists of two stages. The first stageapplies a sliding window to network traffic and monitorsanomalies in the network. We propose two anomalydetection methods, both of which are based on largedeviations results, for flow and packet level data, respec-tively. For both anomaly detection methods, an anomalycan be represented as a set of interaction records.

Once instances of anomalies have been identified, weproposed a method for detecting the compromised nodes.

This is based on ideas from community detection in so-cial networks. However, we devised a refined modularitymeasure that is suitable for botnet detection. The refinedmodularity also addresses some limitations of modularityby adding regularization terms and combining informa-tion of pivotal interaction measure and SCGs.

APPENDIX ALDP FOR THE DEGREE DISTRIBUTION IN THE PA

MODEL

In the generalized Polya urn model, let hni (j) de-note the number of urns with i balls at time j.For 0 ≤ j ≤ n and d > 0, define hn,d(j) =(hn0 (j), . . . , hnd (j), hnd+1(j)) to be the d-truncated degreedistribution at time j, where hnd+1(j) =

∑k≥d+1 h

nk (j).

H n,d ={hn,d(0), . . . ,hn,d(n)

}is a Markov chain

with initial state hn,d(0) corresponding to the initialurn configuration U0. We interpolate H n,d/n into acontinuous process X n,d =

{xn,d(t) : 0 ≤ t ≤ 1

}as:

xn,d(t) = (1/n)hn,d(bntc)+ (t− bntc/n)

(hn,d(bntc+ 1)− hn,d(bntc)

).

We can extend this definition into an infinite-dimensionalprocess X n,∞ = {xn,∞(t) : 0 ≤ t ≤ 1} by removingthe truncation.

Suppose we can observe an empirical d-truncateddegree distribution ϕ(t) = (ϕ0(t), . . . , ϕd(t), ϕd+1(t)),where ϕd+1(t) = 1 −

∑dk=0 ϕk(t). Define ϕ(t) =

(ϕ0(t), . . . , ϕd(t), ˙ϕd+1(t)) as the time derivative ofϕ(t). For d ≥ 0, [13] establishes a sample-path LDP forX n,d with rate function (recall the [·]i notation; Sec. I)

Id(ϕ)=� 1

0

(1− [ϕ(t)]0) log1− [ϕ(t)]0

p(t) + (1− p(t)) β(t)ϕ0(t)(1+β(t))t

+d∑i=1

(1− [ϕ(t)]i) log1− [ϕ(t)]i

(1− p(t)) (i+β(t))ϕi(t)(1+β(t))t

+(

1−∑di=0(1− [ϕ(t)]i)

)×

log1−

∑di=0(1− [ϕ(t)]i)

(1− p(t))(

1−Pd

i=0(i+β(t))ϕi(t)

(1+β(t))t

) dt.

(23)

Similarly, X n,∞ satisfies a sample-path LDP with ratefunction

I∞(ϕ) = limd→∞

Id(ϕ). (24)

In applying this result, we assume that the observednetwork is generated by a process that evolves withconstant rate, i.e., ϕ(t) = µt. It follows ϕ(t) = µ. Therate functions in (4) and (5) are special cases of (24).



13

APPENDIX BRATE FUNCTION FOR A MODEL THAT SELECTS

BETWEEN THE OFFSET BA AND THE CHJ

We give an outline of the key argument. Supposewe build a random graph by selecting with probabil-ity πBA the offset BA process and with probabilityπCHJ = 1−πBA the CHJ process. Recall our definitionof the d-truncated degree distribution process X n,d ={xn,d(t) : 0 ≤ t ≤ 1

}and let X n,d

PA , X n,dBA , and X n,d

CHJ

be the corresponding processes for the generic PA model,the offset BA model, or the CHJ model, respectively. Forany closed set B of trajectories

P (X n,dPA ∈ B) ≤ P (X n,d

BA ∈ B) + P (X n,dCHJ ∈ B),

and since the two terms on the rhs satisfy LDPs withrate functions IBA(µ;α) and ICHJ(µ; p) respectively,we obtain

limn→∞

sup1n

logP (X n,dPA ∈ B) ≤ inf

{µ:X n,d(µ)∈B}IPA(µ, γ),

(25)where X n,d(µ) are trajectories that evolve at a constantrate µ.

To establish a lower bound note that

P (X n,dPA ∈ B) ≥ P (X n,d

BA ∈ B)

and P (X n,dPA ∈ B) ≥ P (X n,d

CHJ ∈ B).

It follows that the LDP lower bound for P (X n,dPA ∈ B)

holds with rate function IPA(µ, γ).

APPENDIX CPROOF OF THEOREM IV.1

Proof: Letting x = (s, 1), problem (18) becomes:

f∗ = max x′Wx (26)

s.t. x2i = 1, i = 1, . . . , n+ 1.

W is a (n+1)×(n+1) symmetric matrix, not necessarily� 0, yet, its smallest eigenvalue χ−min is real. Eq. (20) isan SDP relaxation of (26). Suppose now P = W + %Iis a regularized matrix. P � 0 if

% ≥ −χ−min. (27)

Consider the regularized problem

f∗reg = max x′Px (28)

s.t. x2i = 1, i = 1, . . . , n+ 1.

Note that the regularized term %x′Ix is a constant forany feasible solution because x′Ix =

∑i x

2i = n + 1.

As a result, (26) and (28) should have the same optimalsolution x∗ but different objective values. Then,

f∗reg = f∗ + %(n+ 1). (29)

By setting X = xx′ the SDP relaxation of (28) is:

max Tr(XP) (30)s.t. X � 0,

Xii = 1, i = 1, . . . , n+ 1.

Suppose X∗ is an optimal solution of (30); then

2π

Tr (X∗P) ≤ f∗reg ≤ Tr (X∗P) (31)

if (27) is satisfied. The upper bound is due to therelaxation and the lower bound follows from Nesterov’sextension to the MAXCUT bound [11]. Moreover,

Tr (X∗P) = Tr (X∗W) + %Tr (X∗)= f∗relax + % (n+ 1) . (32)

Combining (29), (31), and (32), we have

2π

(f∗relax + %(n+ 1)) ≤ f∗ + %(n+ 1) (33)

for % ≥ −χ−min. The lower bound in Th. IV.1 can beproved by letting % take its minimum value −χ−min andreorganizing the terms in (33). The upper bound inTh. IV.1 is a standard property of SDP relaxation.

REFERENCES

[1] “DDoS Protection Whitepaper,” 2012, http://www.neustar.biz/enterprise/resources/ddos-protection/ddos-attacks-survey-whitepaper#.UtwNR7Uo70o.

[2] W. T. Strayer, R. Walsh, C. Livadas, and D. Lapsley, “Detectingbotnets with tight command and control,” in Local ComputerNetworks, Proceedings 2006 31st IEEE Conference on. IEEE,2006, pp. 195–202.

[3] G. Gu, J. Zhang, and W. Lee, “BotSniffer: detecting botnet com-mand and control channels in network traffic,” in Proceedingsof the 15th Annual Network and Distributed System SecuritySymposium, 2008.

[4] G. Stringhini, T. Holz, B. Stone-Gross, C. Kruegel, and G. Vigna,“Botmagnifier: Locating spambots on the internet.” in USENIXSecurity Symposium, 2011.

[5] G. Gu, P. A. Porras, V. Yegneswaran, M. W. Fong, and W. Lee,“Bothunter: Detecting malware infection through ids-driven dia-log correlation.” in Usenix Security, vol. 7, 2007, pp. 1–16.

[6] A. Dembo and O. Zeitouni, Large Deviations Techniques andApplications, 2nd ed. NY: Springer-Verlag, 1998.

[7] I. C. Paschalidis and G. Smaragdakis, “Spatio-temporal networkanomaly detection by assessing deviations of empirical mea-sures,” IEEE/ACM Trans. Networking, vol. 17, no. 3, pp. 685–697, 2009.

[8] J. Wang and I. C. Paschalidis, “Statistical traffic anomaly detec-tion in time-varying communication networks,” IEEE Transac-tions on Control of Network Systems, vol. 2, no. 2, pp. 100–111,2015.

[9] J. Wang, D. Rossell, C. G. Cassandras, and I. C. Paschalidis,“Network anomaly detection: A survey and comparative analysisof stochastic and deterministic methods,” in Proceedings of the52nd IEEE Conference on Decision and Control, Florence, Italy,December 2013, pp. 182–187.

[10] J. Wang and I. C. Paschalidis, “Botnet detection using socialgraph analysis,” in 52nd Annual Allerton Conference on Commu-nication, Control, and Computing, Monticello, Illinois, October2014.

[11] A. Ben-Tal and A. Nemirovski, Lectures on Modern ConvexOptimization. SIAM, 2001.



14

[12] S. Mukherjee, “Large deviation for the empirical degree distribu-tion of an Erdos-Renyi graph,” arXiv preprint arXiv:1310.4160,pp. 1–23, 2013.

[13] J. Choi and S. Sethuraman, “Large deviations for the degreestructure in preferential attachment schemes,” The Annals ofApplied Probability, vol. 23, no. 2, pp. 722–763, 2013.

[14] A. Collevecchio, C. Cotar, and M. LiCalzi, “On a preferentialattachment and generalized Polya’s urn model,” The Annals ofApplied Probability, vol. 23, no. 3, pp. 1219–1253, Jun. 2013.

[15] F. Chung, S. Handjani, and D. Jungreis, “Generalizations ofPolya’s urn Problem,” Annals of Combinatorics, vol. 7, no. 2,pp. 141–153, Jul. 2003.

[16] I. C. Paschalidis and D. Guo, “Robust and distributed stochasticlocalization in sensor networks: Theory and experimental results,”ACM Trans. Sensor Networks, vol. 5, no. 4, pp. 34:1–34:22, 2009.

[17] D. Shah and T. Zaman, “Community detection in networks: Theleader-follower algorithm,” arXiv preprint arXiv:1011.0774, pp.1–13, 2010. [Online]. Available: http://arxiv.org/abs/1011.0774

[18] M. Newman, “Detecting community structure in networks,” TheEuropean Physical Journal B - Condensed Matter, vol. 38, no. 2,pp. 321–330, Mar. 2004.

[19] ——, “Finding community structure in networks using the eigen-vectors of matrices,” Physical review E, 2006.

[20] S. Fortunato and M. Barthelemy, “Resolution limit in communitydetection,” Proceedings of the National Academy of Sciences ofthe United States of America, pp. 1–8, 2007.

[21] G. Agarwal and D. Kempe, “Modularity-maximizing graph com-munities via mathematical programming,” The European Physi-cal Journal B, vol. 66, no. 3, pp. 409–418, Nov. 2008.

[22] J. Duch and A. Arenas, “Community detection in complexnetworks using extremal optimization,” Physical review E, no. 2,pp. 2–5, 2005.

[23] A. D’Aspremont and S. Boyd, “Relaxations and randomizedmethods for nonconvex QCQPs,” EE392o Class Notes, StanfordUniversity, no. 1, pp. 1–16, 2003.

[24] S. Garcia, M. Grill, J. Stiborek, and A. Zunino, “An empiricalcomparison of botnet detection methods,” Computers & Security,vol. 45, pp. 100–123, 2014.

[25] J. Wang, J. Zhang, and I. C. Paschalidis, “General AnomalyDetector (GAD),” https://github.com/hbhzwj/GAD, 2014.

[26] “The CAIDA UCSD ”DDoS Attack 2007” Dataset,” CAIDA,2013, http://www.caida.org/data/passive/ddos-20070804 dataset.xml.

[27] R. R. R. Barbosa, R. Sadre, A. Pras, and R. van de Meent,“Simpleweb/university of Twente traffic traces data reposi-tory,” http://eprints.eemcs.utwente.nl/17829/, Technical ReportTR-CTIT-10-19, April 2010.

[28] P. Pons and M. Latapy, “Computing communities in large net-works using random walks,” Computer and Information Sciences-ISCIS, 2005.

Jing (Conan) Wang received the Ph.D. de-gree in Systems Engineering from BostonUniversity, Boston, MA, USA in 2015, andthe B.E. degree in Electrical and InformationEngineering from Huazhong University ofScience and Technology, China, in 2010. Heis now a software engineer at Google Inc.His research interests include interest-basedadvertising, cyber security, and approximatedynamic programming.

Ioannis Ch. Paschalidis (M’96–SM’06–F’14) received the M.S. and Ph.D. degreesboth in electrical engineering and computerscience from the Massachusetts Institute ofTechnology (MIT), Cambridge, MA, USA,in 1993 and 1996, respectively. In September1996 he joined Boston University where hehas been ever since. He is a Professor andDistinguished Faculty Fellow at Boston Uni-versity with appointments in the Departmentof Electrical and Computer Engineering, the

Division of Systems Engineering, and the Department of BiomedicalEngineering. He is the Director of the Center for Information andSystems Engineering (CISE). He has held visiting appointments withMIT and Columbia University, New York, NY, USA. His currentresearch interests lie in the fields of systems and control, networking,applied probability, optimization, operations research, computationalbiology, and medical informatics.

Dr. Paschalidis is a recipient of the NSF CAREER award (2000),several best paper and best algorithmic performance awards, and a2014 IBM/IEEE Smarter Planet Challenge Award. He was an invitedparticipant at the 2002 Frontiers of Engineering Symposium, organizedby the U.S. National Academy of Engineering and the 2014 U.S.National Academies Keck Futures Initiative (NAFKI) Conference. Heis the inaugural Editor-in-Chief of the IEEE Transactions on Controlof Network Systems.



botnet detection based on anomaly and …sites.bu.edu/paschalidis/files/2016/07/botnets-tcns...1...

Documents