dcnet 2011 - presentacióndtstc.ugr.es/~jedv/descargas/2011_dcnet11-mvc.pdf · dcnet 2011 optics...

DCNET 2011OPTICS 2011

Proceedings of theInternational Conference on

Data Communication Networking andInternational Conference on

Optical Communication System

Seville, Spain

18 - 21 July, 2011

Sponsored byINSTICC – Institute for Systems and Technologies of Information, Control and

Communication

In Collaboration withUniversity of Seville

ETSII – University of Seville

Technical Co-sponsored byIEEE – Institute of Electrical and Electronics Engineers

IEEE System Council

In Cooperation withIEICE SWIM – Institute of Electronics, Information and Comm unication Engineers /

Special Group on Software Interprise ModellingATI – Asociación de Técnicos de Informática

CEPIS – Council of European Professional Informatics SocietiesFIDETIA – Fundación para la Investigación y el Desarollo de las Tecnologías de la

Información en AndalucíaINES – Iniciativa Española de Software y Servicios

Copyright c© 2011 SciTePress – Science and Technology PublicationsAll rights reserved

Edited by Mohammad S. Obaidat, Jose Luis Sevillano andEusebi Calle Ortega

Printed in Portugal

ISBN: 978-989-8425-69-0

Depósito Legal: 330151/11

http://www.dcnet.icete.org/

[email protected]

http://www.optics.icete.org/

[email protected]

MULTIPLE VECTOR CLASSIFICATION FOR P2PTRAFFIC IDENTIFICATION

F. J. Salcedo-Campos, J. E. Dıaz-Verdejo and P. Garcıa-TeodoroCITIC, Dpt. of Signal Theory, Telematics and Communications, University of Granada, Granada, Spain

{fjsalc, jedv, pgteodor}@ugr.es

Keywords: P2P identification, Feature extraction, Flow parameterization, Multiple vector classification.

Abstract: The identification of P2P traffic has become a principal concern for the research community in the last years.Although several P2P traffic identification proposals can befound in the specialized literature, the problemstill persists mainly due to obfuscation and privacy matters. This paper presents a flow-based P2P trafficidentification scheme which is based on a multiple classification procedure. First, every traffic flow monitoredis parameterized by using three different groups of features: time related features, data transfer features andsignalling features. After that, a flow identification process is performed for each group of features. Finally,a global identification procedure is carried out by combining the three individual classifications. Promisingexperimental results have been obtained by using a basic KNNscheme as the classifier. These results providesome insights on the relevance of the group of features considered and demonstrate the validity of our approachto identify P2P traffic in a reliable way, while content inspection is avoided.

1 INTRODUCTION

The wide expansion and increasing popularity of P2Pnetworks and applications gives way to the apparitionof some relevant concerns both in traffic engineeringand network security. On the one hand, tackling theintensive use of network resources commonly associ-ated to P2P activities represents a challenge for ISPs,that must handle this high volume of traffic with min-imal impact on the normal behaviour of the network,while keeping costs under control. On the other hand,the ability to communicate and exchange any kind ofinformation between the so called peers, most of thembeing anonymous, represents a security risk. Thisrisk first comes from the perspective of the contentof the exchanged information or files, which consti-tutes a security hazard for users through the propa-gation and execution of viruses, worms, and malwarein general. Second, from the networks’ infrastructureperspective, as P2P applications are an attack vectorthat can be used to support other harmful activities asDoS attacks, botnets, and so on.

In this context, it is clear the necessity of differ-entiating P2P traffic from any other kind of traffic.This is the so called P2P traffic identification problem,which is a specific topic in the more general one ofnetwork traffic identification (Callado, 2009). Threemain issues arise at this respect:

• Traffic Parameterization. Several features havebeen proposed in the literature to represent net-work traffic in order to subsequently classify theobserved events as belonging to different classes.This way, the data used ranges from the reportsof SNMP routers concerning session (connection)statistics (Sen, 2004a) (lower granularity) to TCPheaders including the signaling bits and the firstbytes of the payload (higher granularity) (Mad-hukar, 2006). Most approaches just make useof the headers corresponding to RST, SYN andACK TCP related packets, as there is an under-lying assumption about the relevance of the sig-naling phase for P2P protocols. However, cur-rent research is evolving to the use of the over-all transport headers for every communication. Insome cases, a multiple characterization is pro-posed. For example, (Chen, 2010) considers thenumber of ARP packets, the average speed, theaverage packet size, the proportion of TCP/UDPtraffic (IP mode) and the duration (IP-port mode)for a communication.

• Identification Level. Once the traffic is param-eterized, three different levels are considered inthe literature to carry out the identification pro-cess (Keralapura, 2010) (Callado, 2009): node-based identification, flow-based identification andpacket-based identification. In the first case, the

5

objective is to detect those nodes generating P2Ptraffic (Xuan-min, 2010). In the flow-based casethe goal is to classify each flow as P2P or other-wise. Finally, in packet-based the objective is toclassify each individual packet. It is interesting toremark that it is usual to mix those identificationlevels in different ways. For example, the detec-tion of nodes generating P2P traffic can be dealtwith by detecting at least one P2P packet gener-ated by this node. Other approaches like (Kerala-pura, 2010) use a layered methodology by first de-tecting nodes and then refining the results to clas-sify the generated flows and, consequently, the as-sociated packets.

• Identification Process.Finally, the schemes usedto perform the identification itself cover a broadrange of techniques. From simple heuristics orindicators (Liu, 2010) (Callado, 2009) (JinSong,2007) to complex data mining or pattern learn-ing algorithms (Soysal, 2010) (Keralapura, 2010)(Fontenelle, 2007) (Erman, 2007). Moreover,some papers propose to combine several classi-fiers in a multilayer structure (Yiran, 2010) or inan independent way (Callado, 2010).

Regarding the aforementioned aspects, this paperpresents a novel approach for P2P traffic identifica-tion, with the following characteristics:

1. A flow-based approach is considered at the identi-fication level. For that, the source and destinationports as well as the source and destination IP ad-dresses, together with transport layer protocol, areused to group individual packets into flows.

2. After that, three main groups of features are ob-tained for representing every flow at the trafficparameterization stage: transfer related parame-ters (e.g., packet size), signalling parameters (e.g.,number of SYN and ACK TCP packets), and timerelated parameters (e.g. packet inter-arrival time).This way, a flow is represented by a tuple of threevectors, each one corresponding to a group of fea-tures, and some additional parameters related tothe flow as the ports or the direction of the firstpacket. It is important to highlight that none ofthe features are payload-based.

3. Finally, a triple classification procedure is per-formed to identify a flow as P2P or otherwise, onefor each feature vector. By combining the resultsof the three classifiers, a final decision is taken toidentify a flow as P2P or otherwise. This approachis named as MVC, fromMultiple Vector Classifi-cation.

The items 2) and 3) constitute the specific con-tributions of this work, they being exposed along the

paper as follows. Section 2 describes the general eval-uation framework used in experimentation. Section 3details the feature extraction process in order to de-tect and represent each traffic flow. After that, a sim-ple KNN-based classifier is used in Section 4 to im-plement the MVC identification process, from whichsome experimental results are obtained. Despite thesimplicity of the detector used, the results obtainedclearly demonstrate the goodness of our approach toidentify P2P traffic in a reliable way, and on top ofall without requiring payload inspection. Finally, themain contributions of this work and some future re-search lines are discussed in Section 5.

2 TESTBED

The assessment of identification methods requires theavailability of a database of traffic properly classified.This database should be used as the reference to de-termine the correctness of the results obtained, thusbeing the ”ground truth”, and should contain enoughdata so as to be representative. Nevertheless, obtain-ing a big enough database of labeled traffic is not aneasy task, as a manual labeling process is not afford-able. Furthermore, the data should be captured in areal network without introducing any artifact, whichvoids other approaches as injecting known traffic.

Therefore, to evaluate the proposed system wehave developed an experimental setup built from twomain components: a database of real traffic capturedin an academic network, and a tool to automaticallyclassify packets and flows according to their payloadsby using Deep Packet Inspection (DPI). This way, the”ground truth” is built by analyzing and identifyingeach flow and packet according to this tool under theassumption that DPI is the best currently availablemethod for this and that the number of errors is negli-gible. This is a common approach in the traffic iden-tification field, the number of packets and flows thatDPI is not able to classify being its major limitation.

2.1 DPI Tool

The tool of choice for the classification of traffic isopenDPI(OpenDPI, 2011), which is derived from thecommercialPACE product from ipoque. The coreof openDPI is a software library designed to clas-sify internet traffic according to application proto-cols. In (Mochalski, 2009) the authors explain thatthe DPI-based protocol and application classificationis achieved using a number of different techniques:

• Pattern matching, by scanning for strings orgeneric bit and byte patterns anywhere in the

DCNET 2011 - International Conference on Data Communication Networking

6

packet, including the payload portion. This way,DPI searches for signatures of known protocols.

• Behavioral analysis, by searching for known be-havioral patterns of an application in the moni-tored traffic. The data used include absolute a rel-ative packet sizes, per-flow data and packet rates,number of flows and new flow rate per applica-tion.

• Statistical analysis, by calculating some statisticalindicators that can be used to identify transmis-sion types, as mean, median and variation of val-ues used in behavioral analysis and the entropy ofa flow.

Therefore,openDPIis not a pure-DPI product asit is not only signature-based but also incorporates in-formation from other sources. This way, the classi-fication accuracy is improved (no false classificationaccording to ipoque’s claims), although some pack-ets and flows still remains unclassified. This, to-gether with the availability and quality of the signa-tures, made us to selectopenDPIinstead of any othersimilar product.

For the purposes of our work, we have built a toolbased on theopenDPIlibrary which is able not onlyto identify the application protocols but also to followand differentiate the packets in each flow1. This way,two classifications are provided: flow-based (eachflow is labeled) and packet-based (each packet is alsolabeled). The tool operates in batch mode and, oncethe protocol of a flow is known, all the unknown pack-ets in that flow are relabeled as belonging to the iden-tified protocol.

2.2 Traffic Datasets and ”GroundTruth”

The traffic database contains data captured during 3working days for various nodes in a university net-work. The data acquisition was carried out at a borderrouter in order to be able to monitor all incoming andoutcoming traffic for those nodes. Therefore, apartfrom the boundaries of the caption, flows are capturedcomplete and in both directions. Two datasets S1 andS2 with different groups of nodes are considered to

1To be able to handle UPD packets, we have generalizedthe concept of flow through the use ofsessions. Sessions areconsidered as defined by the exchange of information asso-ciated to a tuple (IP addresses, ports and transport protocol).If the traffic is TCP, a session can be identified as a TCP flowunder the assumption that the number of ephemeral portsused by a given IP entity is not greater than 65535 duringthe observed period. Nevertheless, throughout this paper,we will use the termflow to refer to asession.

1 100 10000 1000000

MMS

QuickTime

RealMedia

SNMP

RTP

Oscar

SSH

NTP

SIP

iMESH

IRC

Windowsmedia

Yahoo

MPEG

MySQL

STUN

SMB

DirectDownloadLi…

Flash

NETBIOS

FTP

MSN

Mail_POP

Gnutella

Mail_SMTP

SSL

ICMP

Bi orrent

DNS

HTTP

unknown

FLOWS

PACKETS_FL

Figure 1: Relative number of instances (flows and packets)of protocols identified in the traffic database.

be able to test and validate the method. Table 1 high-lights some figures of the database.

The results provided by openDPI tool for thedatabase detected a total of 30 different protocols(plus ’unknown’) as shown in Fig. 1. The resultsshow that HTTP is the most used protocol while therelative proportion of P2P protocols is lower than ex-pected (only 5-12% of flows). A more detailed anal-ysis of the data shows that only a small number ofnodes generate P2P traffic, being videostreaming animportant contributor to HTTP traffic (e.g. Youtubetraffic).

Most of P2P flows from these nodes are BitTor-rent, while Gnutella and others are present in a lowerproportion. The rest of the flows (non-P2P) includesmostly usual protocols such as HTTP, DNS, SSL, andmail protocols. The P2P/non-P2P traffic ratio is simi-lar between both sets (see Table 1).

The set S1 will be used to evaluate and tune thesystem, and S2 to validate the results. In order toincrease the confidence of the testing stage, 10-foldstratified cross-validation are used, that is, the ob-

MULTIPLE VECTOR CLASSIFICATION FOR P2P TRAFFIC IDENTIFICATION

7

Table 1: Traffic used for classification experiments.

FlowsSet Total Labeled P2P flows Non-P2P flows UnknownS1 70797 33524 8091 25433 37273S2 107860 22645 16005 6640 85215Total 178657 56169 24096 76475 122488

served flows of S1 were partitioned into 10 randomparts with the same number of flows, each present-ing the same P2P/non-P2P ratio (Kohavi, 1995). Aleave-one-out procedure is applied, taking 9 partitionswith correctly labeled examples to train the systemand the remaining partition to evaluate it. Thus, 10different configurations of the partitions for the ex-periments are used and the results are averaged overthe whole set of experiments. The flows were ran-domly assigned to partitions, in order to enhance theconfidence in the results.

2.3 Performance Indicators

In order to compare the results, three well-knownmeasures in the field of classification systems are con-sidered (Gomez, 2002): percentage of true positives(TP), percentage of false positives (FP) and classifi-cation accuracy (CA).

LetNp2p andNother denote the total number of P2Pand non-P2P messages, respectively. ConsidernX→Yas the number of flows in category X –non-P2P,other,or P2P,p2p– being classified as belonging to categoryY (otheror p2p). The previous measures can be de-fined in our environment as follows:

1. TP, True Positives.The percentage of P2P flowscorrectly classified as P2P in relation to the totalnumber of P2P flows.

TP=np2p→p2p

Np2p·100 (1)

2. FP, False Positives.The percentage of non-P2Pflows mistakenly classified as P2P in relation tothe total number of non-P2P flows.

FP=nother→p2p

Nother·100 (2)

3. CA, Classification Accuracy.The percentage offlows correctly classified in relation to the totalnumber of flows.

CA=np2p→p2p+nother→other

Np2p+Nother·100 (3)

The ideal system should achieve 100% CA with100% TP and 0% FP. In order to facilitate the data rep-resentation and analysis we use True Negatives (TN),

which is an equivalent measure to FP errors. It repre-sents the percentage of non-P2P flows correctly clas-sified as non-P2P in relation to the total number ofnon-P2P flows. It can be calculated directly from FPrate asTN= (100−FP).

3 FLOW PARAMETERIZATION

The output of the openDPI tool is a list of the foundflows along with their classifications, a list of pack-ets also with their classifications and a pairing list re-lating flows and packets in each flow. From this in-formation, a parametrization process is applied to ob-tain a feature vector for each flow, as shown in Table2, with 61 components. The vectors contain all theinformation required for their further processing, in-cluding an identification label (FLOWID), the proto-col as detected byopenDPIand some basic informa-tion regarding the flow (flow tuple). IP addresses in aflow are ordered considering them as integers (usingnetwork representation) and, therefore, two directionsfor the packets are considered in the parametrization:UP, for packets traveling from IPLOW to IP UPPER,and DOWN for the opposite direction.

The values considered in a parameter vector arebasic statistical measures and flow properties, most ofthem split in total, up and down contributions. Amongthe parameters are the usual ones included in mostnetflow-like flow analysis as average packet size, flowduration and number of packets, while at the sametime we have included a more detailed description attemporary and signaling level (e.g. interarrival timesand number of URG packets).

By analyzing the nature of the parameters, we canconsider a feature vector as composed by four parts:

• An identification vector (10 components), whichincludes all the information required to univocallydifferentiate each flow and its identification ac-cording toopenDPI(used just to verify the cor-rectness of the classification provided by the pro-posed system).

• A transfer related vector (24 components), whichconsiders all the parameters related to the numberof packets and their sizes.


8

• A time related vector (10 components), includingparameters related to temporary characteristics ofthe flow, as duration and time between consecu-tive packets.

• A signaling vector (17 components), that accountsfor the number of packets with signaling informa-tion and the associated signals.

The values for the parameters are obtained fromthe list of packets in a flow by analyzing just theirsizes, timestamps, TCP flags if any, and the directionof the packets. This way, no inspection of the pay-load beyond TCP/UDP headers is made, thus preserv-ing the privacy of the users at the application layer.The complexity of the evaluation is low, as only max-imum, minimum, count and average values for eachparameter are considered.

From the point of view of the classification prob-lem addressed in this work, flow identification param-eters, except port numbers, are dismissed, thus result-ing in a parameter vector with, at most, 53 parameters.

4 EXPERIMENTAL RESULTS

The classification of the flows is made by using aKNN classifier and considering different groups offeatures as previously stated. Therefore, some basicson the use of KNN are described next along with apreliminary analysis on their applicability to the fea-ture vectors considered in this work. From this, theexperimental results obtained will be described andanalyzed.

4.1 KNN-based Classification

The K-Nearest Neighbors algorithm, or KNN, is amethod for classifying objects based on closest train-ing examples in a feature space (Duda, 2001). It isamong the simplest machine learning algorithms: anobject is classified by a majority vote of its neighbors,with the object being assigned to the most commonclass among itsK nearest neighbors.

Let us suppose that we want to classify the graytriangle as a circle or a square in the space shownin Fig. 2. If K = 1, it will be classified as a cir-cle because its most close object is a circle. How-ever, it will be classified as a square ifK = 3, as twoof the three most close objects are squares. The bestchoice ofK depends upon the data. Larger values ofK generally reduce the effect of noise on the classifi-cation, but make boundaries between classes less dis-tinct. A goodK can be selected by various heuristictechniques, for example, cross-validation. The spe-cial case where the class is predicted to be the class of

Table 2: Components of the parameter vector for each flow.

Name DescriptionFlow identification

FLOW ID Number of the flow (in the file)ID PROT Detected protocolIP LOW Minor IP address in the session tupleIP UPPER Greater IP address in the session tuplePORT1 Port associated to minor IP (IPLOW)PORT2 Port associated to greater IP (IPUPPER)PROT Transport protocol (TCP/UDP)DIR Direction of the first observed packetFIRST TIME Timestamp for the first packet (µs)LAST TIME Timestamp for the last packet (µs)

Transfer relatedNPACKETS Number of packets in the flowNPACKETS UP Idem UP directionNPACKETS DOWN Idem DOWN directionPACKETS SIZE Total size of the exchanged packetsPACKETS SIZE UP Idem UPPACKETS SIZE DOWN Idem DOWNPAYLOAD SIZE Total size of payloadsPAYLOAD SIZE UP Idem UPPAYLOAD SIZE DOWN Idem DOWNMEAN PACK SIZE Mean size of the packetsMEAN PACK SIZE UP Idem UPMEAN PACK SIZE DOWN Idem DOWNSHORTPACKETS Number of short packetsSHORTPACKETS UP Idem UPSHORTPACKETS DOWN Idem DOWNLONG PACKETS Number of long packetsLONG PACKETS UP Idem UPLONG PACKETS DOWN Idem DOWNMAXLEN Maximum packet sizeMAXLEN UP Idem UPMAXLEN DOWN Idem DOWNMINLEN Minimum packet sizeMINLEN UP Idem UPMINLEN DOWN Idem DOWN

Time relatedDURATION Duration of the flow (µs)MEAN INTERAR Mean time among consecutive packetsMEAN INTERAR UP Idem only for UP packetsMEAN INTERAR DOWN Idem only for DOWN packetsMAX INTERAR Max. time among consecutive packetsMAX INTERAR UP Idem only for UP packetsMAX INTERAR DOWN Idem only for DOWN packetsMIN INTERAR Min. time among consecutive packetsMIN INTERAR UP Idem only for UP packetsMIN INTERAR DOWN Idem only for DOWN packets

SignalingN SIGNALING Number of packets with flagsN SIGNALING UP Idem UPN SIGNALING DOWN Idem DOWNNACKS N. of packets with ACK flag activeNFIN Idem FINNSYN Idem SYNNRST Idem RSTNPUSH Idem PSHNURG Idem URGNECE Idem ECENCWD Idem CWDNACK UP N. of packets with ACK flag (UP)NACK DOWN Idem DOWNNFIN UP Idem FIN & UPNFIN DOWN Idem FIN & DOWNNRST UP Idem RST & UPNRST DOWN Idem RST & DOWN

the closest training sample (i.e. whenK = 1) is calledthe ”nearest neighbor” algorithm, or simply NN.

The training examples are vectors in a multidi-mensional feature space, each with a class label.Thus, thetraining phaseof the algorithm consistsonly of storing the feature vectors and class labels of


9

Figure 2: Example of KNN classification:K = 1, (big circlein dashed line),K = 3 (biggest circle in solid line).

the training samples. For our purposes, as evident,each vector in the feature space is a data transfer re-lated vector, a signaling vector, a time related vectoror an overall vector.

In the classification phase, K is a user-definedconstant, and an observed vector is classified by as-signing the most frequent label among theK train-ing samples nearest to that query event. Usually Eu-clidean distance is used as the distance metric to de-termine the proximity of two objects in the space.However, other different distance measures are suit-able to be applied. In our case, four metrics areused: euclidean, cityblock, cosine and correlation.Thus, the distance between two M-dimensional vec-torsX = (x1,x2, ...,xM) andY = (y1,y2, ...,yM) is al-ternatively calculated as follows:

• Euclidean:

deuc(X,Y) =

√

M

∑i=1

(x2i − y2

i )

• Cityblock:

dcit(X,Y) =M

∑i=1

|xi − yi|

• Cosine:

dcos(X,Y) = 1−

M∑

i=1xiyi

√

M∑

i=1x2

i

√

M∑

i=1y2

i

• Correlation:

dcor(X,Y) = 1−1M

M

∑i=1

(

xi − xσx

)(

yi − yσy

)

wherex and y are the average components ofX andY, andσx andσy their standard deviations. The lastfactor can be calculated through the next expression:

σx =

√

1M−1

M

∑i=1

(xi − x)2

4.2 Preliminary Analysis

The parameters list in Table 2 includes different mea-sures over flows: ports used, duration, time betweenpackets, number and length of packets, etc. This vari-ety of parameters implies many statistical differencesin terms of average, range or distribution betweenall of them, especially between parameters from dis-tinct nature. An extreme example is to compare sig-nalization parameters with timing ones. As we cansee on Table 3,DURATIONandN SYNmean val-ues differs in eight or nine orders of magnitude, de-pending on the traffic type (P2P or non-P2P). Simi-lar differences –O(6) to O(8)– can be observed be-tween the statistics of the number of packets per flow(N PACKETS) andDURATION. At the same time,the parameter ranges are also very different –fromO(9) betweenDURATIONandN SYNto O(3) be-tweenN PACKETSandN SYN–. These wide differ-ences in ranges and means between parameters typesposes a drawback to classification algorithms, espe-cially those based on distance measures as in the KNNcase.

One solution could be to re-scale the parametersusing a logarithm function (Yuan, 2010), so all ofthem were distributed in the same value range be-tween 0 and 1. This option has the disadvantage thatall the parameters are treated in a similar way, but thisis not realistic. We cannot ignore that there are someparameters with a continuous nature, for example allthose related with time, and some others with an ev-ident discrete nature, like signaling parameters. Thisfact suggests us to treat them separately, in order toempower classification techniques.

On the other hand, it is likely that some param-eters provide more information than others and eventhat some parameters do not provide any evidence atall for classification purposes. If all the parametersare treated in a homogeneous way, that is, if all theranges are somehow normalized, the classification ac-curacy can be diminished (features without discrimi-native information will behave as noise for the classi-fier). Therefore, an approach in which only selectedfeatures are evaluated in order to determine their dis-criminative capabilities when compared against theothers becomes very interesting. Therefore, the clas-sifier to evaluate will be composed of a set of classi-fiers, one per group of features, and the final decisionwill be made by combining the outputs provided byeach individual classifier. This approach, namedMul-tiple Vector Classificationis similar to that of MVQthat has been successfully applied in speech recogni-tion (Segura, 1994).

The experiments will be designed to address the


10

improvement of the classification that can be achievedby using this method when compared to a single clas-sifier whose inputs are vectors composed of all thefeatures in Table 2. Therefore, a preliminary exper-iment with the full feature vector over S1 has beenmade to be used as the reference system. The resultsare shown in Fig. 3, where a KNN classifier is usedas indicated in Section 4.1 with four different distancemetrics: euclidean, cityblock, cosine and correlation.The number of nearest neighbors used in the classi-fication test ranged fromK = 1 to K = 10. The bestresults were obtained forK = 4, which are similar toother reported in the literature. The results do not dif-fer significatively between the considered distancesin terms of classification accuracy (CA), because allof them achieve the same true negatives (TN) results.Thus, the differences are mainly based on the TP mea-sure.

4.3 MVC: Identification Results andAnalysis

As previously explained, the 53 parameters obtainedfor representing flows have been split in three sets ac-cording to its nature (Table 2):

1. Time Related.There exists 10 parameters relatedto time in the feature vector like the duration ofthe flows, arrival intervals between packets, etc.

2. Data Transfer Related. Include all the parame-ters indicating volume, like the number of packetsof the flow, number of bytes of the packets, etc.There are 24 parameters belonging to this cate-gory.

3. Signaling Related. There are 19 parameters re-lated to signalization, like the whole number ofsignaling packets exchanged in the flow, and thenumber of each kind of signaling packet (ACK,FIN, SYN, URG, etc.).

According to some works in traffic identification–e.g. (Sen, 2004b)–, and despite the use of ephemeralports on many protocols, port numbers can still be in-formative to differentiate some protocols. Neverthe-less, they are not included in any of the consideredset of features. Due to the lack of relationship of portnumbers with time or volume parameters and takinginto account that they could be considered as signal-ization at the application layer, we have included portnumbers in the signaling category.

Three different MVC schemes have been evalu-ated with the same distance metrics and values forKas in the reference system in Fig. 4. They all rely on afirst triple KNN classification, one for each group offeatures or vector, and are as follows:

0,97 0,97

0,95 0,95

0,93 0,93

0,90 0,90

0,98 0,98 0,98 0,98

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

Euclidean Cityblock Cosine Correla on

% CA % TP % TN

Figure 3: KNN classification results over S1 flows using allthe features in a single vector.

• Voting. The flow is assigned to the class providedby the majority of the KNN for each of the threegroups of features. This way, a flow will be iden-tified as P2P if at least two of the KNNs classifyit as P2P.

• Nearest KNN. The flow is assigned to the cate-gory given by the KNN providing the lowest dis-tance, independently of the fact that it correspondsto the time related vector, the signaling vector orthe data vector.

• Lowest Aggregate Distance. The flow is as-signed to the category for which the addition ofthe distances for each vector provided by the cor-responding KNN is the lowest one.

Prior to the evaluation of the above MVCschemes, it is necessary to determine again the bestK value and distance for classification purposes, aswe are now considering the overall features groupedinto vectors with different nature. As before, CA, TPor TN can be used to estimate the bestK and distance.Taking into account Figure 3, it seems reasonable toselect the distance that provides the better TP resultsbecause it is the weaker aspect in the four cases. Thus,cityblock is the best distance for time related and sig-naling vectors, while correlation distance is the bestone for the data transfer vector. The number of nearestneighbors isK = 4, the same value that was obtainedin Section 4.2.

Taking into account these results for the individ-ual KNNs, the evaluation of the three alternate MVCschemes for definitively identifying an observed flowas P2P or otherwise is performed. Figure 4 showsthe identification results obtained in comparison withthose corresponding to the reference system when asingle 53-dimensional vector-based KNN is used.

It is remarkable the TP improvement achievedwith the three MVC schemes. Furthermore, bothMVC voting and MVC nearest KNN schemes im-


11

Table 3: Basic statistics of some different flow parameters for S1.

P2P non-P2PParameter Type Mean Min Max Mean Min Max

DURATION time 5.58x108 0 1.69x1011 5.04x109 0 1.72x1011

N PACKETS volume 124.85 1 229507 46.80 1 243444NSYN signaling 2.38 0 673 1.67 0 524

0,97

0,99

0,98

0,930,93

0,96

0,96

0,980,98

1,00

0,99

0,92

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

100%

KNN all

parameters

MVC vo!ng MVC nearest

KNN

MVC lowest

aggregate

distance

% CA % TP % TN

Figure 4: KNN classification results over S1 node flows using different MVC schemes.

prove CA and TN. These figures demonstrate that it ispossible to detect more than 96% of P2P traffic with avery low false positives (FP) rate, less than 0.25%. Insummary, the MVC schemes outperforms the resultsobtained by the single KNN classifier with mixed pa-rameters in all the considered measures (CA, TP andTN), with the only exception of MVC with lowest ag-gregate distance, which just improves the true posi-tives rate.

In order to validate the method, we have tested thebest MVC scheme over dataset S2. It is comparedwith the results obtained by the single KNN classifierwith mixed parameters (cityblock distance). Figure 5shows that the MVC method is still better than usinga single KNN with all the parameters together. There-fore, the results over S1 and S2 datasets have shownthat the MVC schemes combined with a similaritygrouping of parameters is an appropriate method todiscriminate between P2P and no-P2P flows.

5 CONCLUSIONS

A multiple vector classification approach to identifyP2P traffic is presented in this paper. It is flow-based, each flow being represented by a tuple of threevectors regarding, respectively, data transfer features,signalling features and time features. A triple classi-fication is subsequently made per flow, one for each

0,78

0,86

0,370,43

0,89

0,98

30%

40%

50%

60%

70%

80%

90%

100%

KNN all parameters MVC vo!ng

% CA % TP % TN

Figure 5: KNN classification results over dataset S2.

feature vector. From this multiple vector classifica-tion, a global decision is taken for the flow identifica-tion as P2P traffic or not. Although a simple KNN-based classifier is used for implementing the system,the experimental results achieved show the promis-ing nature of our approach in reliably identifying P2Ptraffic. Furthermore, the identification scheme doesnot require to access sensible information in packetpayloads.

The proposed approach can be improved in someways. As an example, let us say three. First, a betterclassifier can be used; for example, SVM has demon-strated to have very good performance in this kind oftasks. Second, each feature vector could be analyzedmore in detail in order to reduce its dimensionality


12

to those more representative characteristics. Third,some alternative combination schemes can be consid-ered in the global identification process.

ACKNOWLEDGEMENTS

This work has been partially supported by SpanishMICINN under project TEC2008-06663-C03-02.

REFERENCES

Callado, A., Kamienski, C., Szabo, G., Gero, B.P., Kelner,J., 2009. ”A Survey on Internet Traffic Identification”.In IEEE Communications Surveys & Tutorials, vol.11, n. 3, pp. 37-52.

Callado, A., Kelner, J., Sadok, D., Kamienski, C.A., Fer-nandes, S., 2010. ”Better network traffic identificationthrough the independent combination of techniques”.In Journal of Network and Computer Applications,vol. 33, pp. 433-446.

Chen, H., Zhou, X., You, F., Wang, C., 2010. ”Study ofDouble-Characteristics-Based SVM Method for P2PTraffic Identification”. In Int. Conference on Net-works Security Wireless Communications and TrustedComputing, pp. 202-205.

Duda, R.O., Hart, P.E., Stork, D.G., 2001. ”Pattern Classfi-cation”. John Wiley & Sons.

Erman, J., Mahanti, A., Arlitt, M., Cohen, I., Williamson,C., 2007. ”Offline/RealtimeTtraffic Classification Us-ing Semi-Supervised Learning”. In Performance Eval-uation, vol. 64, pp. 1194-1213.

Fontenelle, M., Bessa, J., Siqueira, G., Holanda, R., Sousa,J., 2007. ”Using Statistical Discriminators and Clus-ter Analysis to P2P and Attack Traffic Monitoring”,In Latin American Network Operations and Manage-ment Symposium, pp. 67-76.

Gomez, J.M., Puertas, E., Maa, M.J., 2002. Evaluatingcost-sensitive unsolicited bulk email categorization; inProc. of the ACM Symposium and Applied Comput-ing, ACM Press, pp. 615-620.

JinSong, W., Yan, Z., Qing, W., Gong, W., 2007. ”Con-nection Pattern-based P2P Application IdentificationCharacteristic”. In Proc. of Int. Conference on Net-work and Parallel Computing Workshops, pp. 437-441.

Karagiannis, T., Papagiannaki, K., Foloutsos, M., 2005.”BLINC: Multilevel Traffic Classification in theDark”. In Proc. of the Conference on Applications,Technologies, Architectures, and Protocols for Com-puter Communications, pp. 229-240.

Keralapura, R. Nucci, A., Chuah, C., 2010. ”A Novel Self-Learning Architecture for P2P Traffic Classification inHigh Speed Networks”. In Computer Networks, vol.54, pp. 1055-1068.

Kohavi, R.: A Study of Cross-Validation and Bootstrap forAccuracy Estimation and Model Selection; in Proc.of the 14th International Joint Conference on ArtifcialIntelligence, Montreal, Canada, (1995)

Li, X., Liu, Y., 2010. ”A P2P Network Traffic IdentificationModel Based on Heuristic Rules”. In Int. Conferenceon Computer Application and System Modeling, vol.5, pp. 177-179.

Madhukar, A., Williamson, C., 2006. ”A LongitudinalStudy of P2P Traffic Classification”. In Proc. of Int.Symposium on Modeling, Analysis and Simulation,pp. 179-188.

Mochalski, K., Schulze, H., 2009. ”Deep PacketInspection. Technology, applications &net neutrality”. White Paper. Available athttp://www.ipoque.com/resources/white-papers.

OpenDPI, 2011. http://www.opendpi.org

Segura, J.C, Rubio, A.J., Peinado, A.M., Garcıa, P., Roman,R., 1994. ”Multiple VQ Hidden Markov Modellingfor Speech Recognition”. In Speech Communication,vol. 14, no. 2, pp. 163-170.

Sen, S., Spatscheck, O., Wang, D., 2004. ”Accurate, Scal-able In-Network Identification of P2P Traffic UsingApplication Signatures”. In Proc. of the Int. Confer-ence on World Wide Web, pp. 512-521.

Sen, S., Wang, J., 2004. ”Analyzing Peer-to-Peer TrafficAcross Large Networks”. In IEEE/ACM Transactionson Networking, vol. 12, n. 2, pp. 219-232

Soysal, M., Schmidt, E.G., 2010. ”Machine Learning Al-gorithms for Accurate Flow-Based Network TrafficClassification: Evaluation and Comparison”. In Per-formance Evaluation, vol. 67, n. 6, pp. 451-467.

Xuan-min, L., Jiang, P., Ya-jian, Z., 2010. ”A New P2PTraffic Identification Model Based on Node Status”.In Int. Conference on Management and Service Sci-ence, pp. 1-4.

Yiran, G., Suoping, W., 2010. ”Traffic IdentificationMethod for Specific P2P Based on MultilayerTree Combination Classification by BP-LVQ Neural-Network”. In Int. Forum on Information Technologyand Applications, pp. 34-38.

Yuan, R., Li, Z., Guan, X., Xu, L., 2010. An SVM-based machine learning method for accurate internettraffic classification. Information Systems Frontiers,Springer-Verlag, V. 12, n. 2, pp. 149-156.


13

dcnet 2011 - presentacióndtstc.ugr.es/~jedv/descargas/2011_dcnet11-mvc.pdf · dcnet 2011 optics...

Documents