csr_01_13 ova-based multi-class classification for data stream anomaly detection

Faculty of Mathematics,Natural Sciences andComputer Science Institute of Computer Science COMPUTER SCIENCE REPORTS Report 01/13 April2013 OVA-BASED MULTI-CLASS CLASSIFICATIONFOR DATA STREAM ANOMALY DETECTION TINO NOACKINGO SCHMITTSASCHA SARETZ Computer Science Reports Brandenburg University of Technology Cottbus ISSN: 1437-7969 Send requests to:BTU Cottbus Institut fr Informatik Postfach 10 13 44 D-03013 Cottbus

Computer Science Reports 01/13 April 2013 Brandenburg University of Technology Cottbus Faculty of Mathematics, Natural Sciences and Computer Science Institute of Computer Science Tino Noack, Ingo Schmitt, Sascha Saretz [email protected], [email protected], [email protected], http://dbis.informatik.tu-cottbus.de OVA-based Multi-Class Classificationfor Data Stream Anomaly Detection Computer Science Reports Brandenburg University of Technology Cottbus Institute of Computer Science Head of Institute: Prof. Dr. Ingo [email protected] BTU Cottbus Institut fr Informatik Postfach 10 13 44 D-03013 Cottbus Research Groups:Headed by: Computer EngineeringProf. Dr. H. Th. Vierhaus Computer Network and Communication SystemsProf. Dr. H. Knig Data Structures and Software DependabilityProf. Dr. M. Heiner Database and Information SystemsProf. Dr. I. Schmitt Programming Languages and Compiler Construction Prof. Dr. P. Hofstedt Software and Systems Engineering Prof. Dr. C. Lewerentz Theoretical Computer ScienceProf. Dr. K. Meer Graphics SystemsProf. Dr. D. Cunningham SystemsProf. Dr. R. Kraemer Distributed Systems and Operating Systems Prof. Dr. J. Nolte Internet-TechnologyProf. Dr. G. Wagner CR Subject Classification (1998): H, H.2.8 Printing and Binding: BTU Cottbus ISSN: 1437-7969OVA-basedMulti-ClassClassicationforDataStreamAnomalyDetectionTinoNoack,IngoSchmitt,SaschaSaretzBrandenburgUniversityofTechnologyCottbus,GermanyInstituteofComputerScience,InformationandMediaTechnology{Tino.Noack, Ingo.Schmitt, Sascha.Saretz}@tu-cottbus.deAbstract. Mobilecyber-physical systems(MCPSs), suchastheInter-nationalSpaceStation,areequippedwithsensorswhichproducesensordata streams. Continuous changes like wear and tear inuence the systemstatesofaMCPScontinuallyduringruntime.Hence,monitoringisnec-essarytoprovidereliabilityandtoavoidcriticaldamage.Although,themonitoringprocessislimitedbyresourcerestrictions.Therefore,thefo-cal point of the present paper is on time-ecient multi-class data streamanomalydetection.Ourcontributionisbid.First,weuseaone-versus-allclassicationmodeltocombineasetofheterogeneousone-classclas-siersconsecutively.Suchachainofone-classclassiersprovidesaveryexiblestructurewhiletheadministrativeoverheadis reasonablylow.Second, based on the classier chain, we introduce classier pre-selection.1 IntroductionMobile cyber-physical systems(MCPSs), such as the International Space Station(ISS), are location independent and embedded into a physical environment. Me-chanical inuences (e.g. continual friction) as well as external impacts of a harshand uncertain physical environment (e.g. geophysical eects) can cause wear andtear.Mostly,wearandtearleadstosystemstatechangeswhichcancausesud-den changes such as crashes. On that score, monitoring MCPSs is indispensableto ensure reliability and to avoid critical damage. The monitoring process, whichtakesplaceonaMCPS,isoftenlimitedbyresourcerestrictions(e.g.processingcapacity, memoryorpowerconsumption). TocompensateresourcerestrictionsanexternalwirelessnetworkoftenconnectsaMCPSwithexternalinformationsystems.ExternalinformationsystemsareusuallystationarypartsofaMCPS.MCPSs are equipped with sensors which produce sensor data streams [1, 12].These data streams have to be processed in an appropriate manner for monitor-ing MCPSs. Further information about real-time monitoring are provided by [33].Datastreamprocessing[7, 28], whichinvolves knowledgediscoveryfromdatastreams (KDDS) [11] as well as data stream mining [4], has received much atten-tioninrecentyears.Substantialcomponentsofdatastreamprocessingaredatastream clustering, data stream classication and data stream anomaly detection.The time-eciency is very important for data stream processing. However, there1exists only a little work considering time-ecient data stream anomaly detectionat the moment. Applying anomaly detection techniques in a data stream contextisnecessarytomonitorMCPSs;andsimultaneously,toidentifyalargenumberofsystemstatesduringruntime.Forthatreason,thefocalpointofthepresentpaperisontime-ecientmulti-classdatastreamanomalydetection.1.1 BasicNotationAs depicted in Figure 1, a multi-class anomaly detection problem is based on a setof classes (bounded regions) = 1, 2, ..., k. These classes represent expertknowledge about the system states of a MCPS. A vector space S[2] and the clus-terassumption[25] arethebasicfoundations. Amulti-classanomalydetectionproblemcorrespondstotheanomalyclassC=S

i i. Theanomalyclassrepresents the unawareness about the system states. A data stream is consideredasacontinuousandalmostinnitestreamofunlabelledinstances[1].AvectorspaceSisspannedbyasetofnmutuallyindependent(orthogonal)attributesA1,... ,An. AttributevaluesarefunctionsovertimeT, i.e. valuesof Aiwithi = 1 .. narevaluesofai: T R.Wedenotethetimeasanindex.Hence,anunlabelled instance with index t is represented as st= (at,1at,2... at,n)

st S.1.2 ProblemDescriptionThe shape of the present classes can diverge widely. Hence, an approximation ofthe present multi-class anomaly detection problem is required. We suggest to usea distinct and optimized classier for each class. Therefore, a multi-class anomalydetectionproblemcanbecomprehendedasasetof binarydecisions. Eachbi-narydecisioncorrespondstoadichotomousclassier. Adichotomousclassierprovidestwomutuallyexclusivedecisions. Itreturnstrue if anunlabelledin-stancewasacceptedandfalse if anunlabelledinstancewasnotaccepted. Atmostoneofasetofdichotomousclassiersyieldstrue. Thisimpliesadistinctclassicationresult. Asetof dichotomousclassiersentailstwoadvantagesincontrasttohomogeneousmulti-classclassication. Ononehand, suchasetofclassiersusuallyprovidesahigherclassicationaccuracy. Ontheotherhand,amulti-classanomalydetectionproblemfallsapartintoasetofsimplebinaryclassicationproblems.Thesetofdichotomousclassiershavetobecombinedinasucientman-ner. A combination is required to solve a multi-class anomaly detection problemundertiming-constrains.Wesuggestaconsecutiveapplicationofthesedichoto-mousclassiers.Theresultingclassierchainprovidesaveryexiblestructurewhile the administrative overhead is reasonably low. Moreover, we suggest to useanone-versus-all (OVA)[13, 14, 31] multi-classclassiermodel. Eachclassierof anOVA-basedmulti-classclassicationmodel istrainedbetweenthetargetclassandtheotherclassesincludingtheanomalyclass. Therefore, wesuggestthe application of a set of one-class classiers [30]. Versatile performance studiesof one-class classiers are contributed by [5, 15]. As stated in [30], each one-class212 Anomalies as elements of* ** C3Fig. 1.Exampleofamulti-classanomalydetectionproblem(basedon[6])classieralgorithmcomprisestwodistinctelements. First, adistanceorare-semblanceof aninstancetoaclass. Second, athresholdonthisdistanceorresemblance.Thisthresholdisusedtodeneaone-classclassierasadichoto-mousclassier.Furtheron,achainofdisjointedone-classclassiersprovidesadistinctclassicationresultforamulti-classanomalydetectionproblem.TothebestofourknowledgenosubstantialresearchhasbeenconductedtostudyOVA-basedmulti-classclassicationmodelsfordatastreamanomalyde-tection. Due to the consecutive application of one-class classiers, an OVA-basedmulti-class classication model becomes to a linear or sequential search problem.Furtherinformationaboutsequentialsearchingareprovidedby[16].Themaindisadvantages of such a chain of one-class classiers are time-ineciencies. How-ever, the reduction of processing time can also help to reduce power consumption.Therefore,ourcontributionisaminimizationoftheaverageprocessingtime.Aone-class classier entails a probability of occurrence and a processing time. Theabsoluteprocessingtimedependsonaspecictargetmachine. First, wemini-mizetheexpectedvalueovertheprobabilityofoccurrenceinconjunctionwiththeprocessingtimeofthecomponentone-classclassiers.Thisminimizationisintended to decreases the processing time of the classier chain for the majorityof unlabelledinstances. Second, weminimizetheprocessingtimeintheworstcase scenario when an unlabelled instance has to be processed by all of the com-ponent classiers. Therefore, we suggest classier pre-selection. We approximateeachone-class classier bymeans of aminimal boundinghypersphere. After-wards,thedistancesofanunlabelledinstancetoallofthehyperspherecentrescanbecalculated.Onthatscore,itispossibletopreselectareducedsetofcan-didate classiers. Moreover, we use the ISS Columbus Air Loop [19, 20] in a realworldcasestudytoevaluateandassesstheaforementionedsuggestions.Theremainingsectionsof thepaperarestructuredasfollows. WediscussrelatedworkinSection2andSection3introducestheISSColumbusAirLoop.Moreover, we describe the minimizationof the average processingtime of a3classierchaininSection4andclassierpre-selectionisdelineatedinSection5.WeexplaintheexperimentalsetupandperformtherealworldcasestudyinSection6.Finally,wediscusstheresultsandconcludeourworkinSection7.2 RelatedWorkCommonly, classication-basedanomalydetectionencompasses three phases:training, assessing andclassifying. Trainingisusedtolearnaclassiermodelfrom a set of labelled training data AtrS. Assessing is used to evaluate and renethelearnedmodel bytraininginstances(e.g. crossvalidation[10]). Classifyingis usedto classifyunlabelledinstances bythe learnedmodel. Classication-basedanomalydetectionisgroupedintotwocategories: multi-class andone-class classicationtechniques. Asstatedin[6], multi-classclassication-basedanomalydetectionworksmostlyinasupervisedmanner. Itpresumesthatthetraining data Atrinvolve multiple classes which include the anomaly class. One-class classication-based anomaly detection works in a semi-supervised manner.Itpresumesthatthetrainingdatainvolveonlyoneclasslabel[6, 30].Normally, monitoringMCPSsrequirestheinteractionwithhumanexpertsduringthetrainingandassessingphases.Forexample,theISSColumbusFail-ureManagementSystem[19, 20] dispatchestheoccurringdatastreamsalmostcompletely(down-sampled)toanexternalinformationsystem(thegroundsta-tion)toprovidehistorical data. Thiscontraststhewidelyadoptedassumptionthatdatastreamscannotbestoredalmostcompletely. Indeed, suchanalmostcompletestorageisveryexpensive. But amongotherreasons, humanexpertsareresponsibleforrelatedandconsequentdecisions. Therefore, historical dataisaprerequisiteforlong-termfailureanalysisandadequateanomalydetection.Consideringtheexistenceofanexternalinformationsystem,wedistinguishbetween onlineand oinetraining methods for data stream anomaly detection.Onlinetrainingmethodsneglecttheexistenceof external informationsystemsandtheyassumethatadatastreamcannotbestoredcompletely.Furthermore,theyprovidetheabilityfortrainingthemodel;andsimultaneously,forclassify-ingunlabelledinstancesinreal-timeornearreal-time.Forthispurpose,onlinetraining methods are applied by means of one-pass algorithms while only a smallwindow-based set of training instances is available. In the context of online datastream anomaly detection, user interaction and fast algorithms are mutually ex-clusive. Therefore, theassessmentof theresultingmodel isoftenneglectedbyonlinetrainingmethods. Onereasonistheabsenceof areasonablenumberoftraining instances due to windowing. Hence, the accuracy is proved insuciently.Besides,thevericationofclassifyingresultsbyhumanexpertsisverydicult.Some online trainingmethods require aninput of labelledtraininginstancesduringruntimeandatcertaintimeintervalsformodel trainingorretraining.TheOcVFDT[18, p. 80] approach, forexample, prerequisites20%of labelledtraining data during runtime. Such an online training method entails three moredrawbacks.First,theprovisionoflabelledtrainingdataduringruntimecannotbe always guaranteed and it contradicts the functioning of most of the real world4Online subcycleMobile cyber-physical systemOffline subcycleExternal informationsystemWireless networkFig. 2.KDC-Acyclicmonitoringprocess(basedon[21, 22])applications. Second, model training or retraining during runtime expends com-putational resources which are actually envisaged for system monitoring. Third,trainingsuchanunassessedmodelduringruntimecancauseunforeseeableandcriticalsideeectsfortheentiremonitoringprocess.Duetotheaforementioneddrawbacks, ourworkfocusesonoinetrainingmethods. As depictedinFigure2, oinetrainingmethods requireanoinesubcycleand an onlinesubcycleof a Knowledge Discovery Cycle (KDC) [21, 22].The oine subcycle is used for long-term analysis, for model training and modelassessing. Model trainingandmodel assessingare performedonanexternalinformationsystemwhichprovidesmuchsystemresources. Thus, heavyweightalgorithms can be applied while historical training instances are available. There-fore,trainingthemodelcanbeverytimeconsumingwhiletheonlineresourcesarenotadverselyaectedandtheresultingmodelcanbeeasilyassessedbyhu-man experts. The trained model needs to be transferred from the oine subcycletotheonlinesubcyclefromtimetotime. Theonlinesubcycleisusedtoapplythetrainedmodelfordatastreamanomalydetection.Computationalresourcesof the online subcycle are exclusively used for classifying unlabelled instances bymeansofthepreviouslycreatedmodel. Therefore, classifyingcanbeveryfast.Following this, we discuss present algorithms for data stream anomaly detection.OLINDDA [27] implements a cluster-based approach for novelty and conceptdriftdetection.Bydefault,OLINDDAworksinanunsupervisedmanneranditusesstandardclusteringalgorithmstoidentifyunknownclusters. Afterwards,the similarityto knownclusters is assessed. As statedin[29, p. 1512], thismethod detects new emerging concepts rather than anomalies. Thus, OLINDDA5isnotgenerallyusedforthepurposeof datastreamanomalydetectionwhichconstitutesthemaindisadvantageofOLINDDA.FRAHST [24] is a rank-adaptive algorithm for fast principal subspace track-ing.Itworksinanunsupervisedmanneranditisusedtoidentifyanomaliesinstreamingdataof lowdimensions. Therefore, thesubspacesarebuiltusingdi-mensionalityreduction.Observeddatathatcannotbesucientlyexplainedbythe current model is considered as anomalous. One obvious disadvantage of thismethodisthelossofinformationduringdimensionalityreduction.Moreover,itdoesnotprovidetheabilitytodistinguishbetweenheterogeneousclasses.Amethodwhichusesensemblesofstreaminghalf-spacetrees(HS-Trees)isdescribedin[29].AHS-Treeisabinarytreewhileeachnodeisusedtocapturea number of data elements within a particular subspace of the data stream. TheHS-Trees method is a fast one-class anomaly detector for evolving data streams.Asstatedin[29, p. 1511], theHS-Treesapproachrequiresonlynormal data,whichexcludestheanomalyclass,toretrainthemodel.Themodelisretrainedcontinuously at the end of each window. The main disadvantage of the proposedmethod is the paraxial separation of the subspaces. Hence, the anomaly detectiontaskbecomesambiguouswhenthesubspacesarenotparaxiallyseparable.A very fast decision tree for one-class classication of data streams (OcVFDT)is describedin[18]. This OcVFDTapproachis anextensionof theveryfastdecisiontree(VFDT)[9]. Duringthetrainingphase, theOcVFDTalgorithmconstructs a tree forest and then the best tree is chosen as nal output classier.Asstatedin[18, p. 80], OcVFDTpresupposesapproximately20%of labelledtraininginstancesduringruntimetoretraintheclassicationmodel. Thepro-posedmethodcontainstwodisadvantages. First, itusesdiscreteattributeval-ues.Hence,itisnotwidelyapplicable.Second,thecomputationaleortofthismethoddependsontheselectedone-classclassicationalgorithm. Thus, train-ingatreeforestaswellastheclassicationofunlabelledinstancescanbeverytime-consumingwhenacomplexone-classclassierhasbeenchosen.Weanalysedfourdatastreamanomalydetectionmethods. Theassessmentandtheevaluationoftheresultingmodelbyhumanexpertsisneglectedbyallof theanalysedmethods. Therefore, trainingasolidmodel, whichisbasedonlong-term historical data, is not suciently considered by the analysed methods.Considering these disadvantages, our contribution is the adjustment of an oinetraining method, in particular OVA-based multi-class classication, for anomalydetectiontothedemandsofdatastreamsundertimingconstrains.3 Example-ISSColumbusAirLoopThe ISS Columbus Module [19, 20] is a MCPS and it comprises an air loop. TheISSColumbusAirLoopispartof thelifesupportsystem. TheISSColumbusFailure Management System is responsible for crew health and for detecting timecritical failures. The ISS Columbus Air Loop is monitored by the failure manage-ment system. The air loop consists of fan assemblies which provide air circulationintheISSColumbusModule. Moreover, thefanassembliesareresponsiblefor600.20.40.60.811.20.70.80.911.11.21.31.40.500.511.522.5 Pressure [kPa]Speed [1/min] Current [A]Class 1 (168)Class 2 (43)Class 3 (144)Class 4 (136)Class 5 (3478)Class 6 (743)Class 7 (1779)Class 7Class 6Class 4Class 2Class 1Class 3Class 5Fig. 3.IRFAtrainingdataandsevenclassesairexchangebetweentheISSColumbusModuleandtheISS.Amongstothers,aircirculationisrequiredforairrevitalization, forairconditioning, forsmokeorredetectionandforavoidingdeadairpockets.TheISSColumbusairloopcontains an Inter Module Ventilation Return Fan Assembly (IRFA). The IRFAismonitoredbydierentsensors. Forthesakeof simplicity, wefocusonthreesensorattributes. Thisincludesspeed, current andpressure. Thespeedrelatesto the rotating speed of the fan assembly and the unit of measurement is 1/min.Thecurrentrelatestotheelectrical inputcurrentoftheIRFAandtheunitofmeasurementisA. ThepressurerelatestothepressureheadthatisgeneratedbytheIRFAandtheunitof measurementiskPa. Figure3depictsthetrain-ingdatawhichwaspreviouslyclusteredbyhumanexperts. Thetrainingdatacomprise seven classes = 1, 2, ..., 7. Each class refers to a specic system7stateof theIRFA. Theclass1, forexample, relatestoafaultysystemstatewhilethespeedoftheIRFAisunusuallyincreased. Theclass5, forexample,refers to a default system state while the IRFA works as expected. The class 7,for example, describes a system state while the IRFA has been switched o. Thevaluesintheparenthesesdenotethenumberofinstances.4 MinimizingtheTimeConsumptioninAverageFigure4depictsaconsecutivechainofasetofone-classclassiersocc(i).Eachone-classclassierisadichotomousclassier. Moreover, eachone-classclassi-erprovidesabinarydecision(trueorfalse)duetotheprovidedthreshold.Trueimpliesthatanunlabelledinstancewasacceptedbyaclassierandfalseimplies theopposite. Wedenetheappliedone-class classiers as disjoint toavoidambiguousclassicationresults.Therefore,thecombinationofthesemu-tually exclusive classiers leads to a distinct classication result. The consecutivechainofone-classclassiersreferstoOVA-basedmulti-classclassicationwhiletheclassiersarecombinedbymeansofbinarydecisions.Theappliedone-classclassierscanberearrangedintok!manydierentpermutations.The terminationconditionof such an OVA-based multi-class classier modelentailstwopossibilities.First,thealgorithmterminateswhenanunlabelledin-stancewasacceptedbyaclassier. Second, thealgorithmterminateswhenallof thecomponentone-classclassiersfail. Theworstcasescenariorelatestoacasewhenanunlabelledinstancehastobeprocessedbyall of thecomponentone-classclassiers. Eachone-classclassierentailsaprocessingtimet(i). Thebracketed indexes refer to a permutation. Amongst others, Bifet et al. establishedarequirementfordatastreamenvironmentswhichstates:Processanexample[anunlabelledinstance] at atime, andinspect it onlyonce(at most)[3, p.1601]. In conformity with this statement, the processing time of a one-class clas-sierreferstotheclassicationofoneunlabelledinstanceatatime.Themainpurpose is the minimization of the overall processing time which is consumed byanOVA-basedmulti-classclassiermodeltosolveamulti-classanomalydetec-tion problem. Hence, it is intended to reach the termination condition as fast aspossibleforthemajorityoftheunlabelledinstancesinaverage.Underconsiderationof asetof trainingdata Atritispossibletoestimatetheprobabilitiesof occurrenceforeachclasspi, wherep1 + p2 ++ pk=1.Theseprobabilitiescanbeassignedtothecorrespondingclassiers.pi= [ st Atr st i [[Atr[(1)Basedontheseestimatesitispossibletoselectapermutationfromthesetof all permutationssuchthattheterminationconditionisreachedasearlyaspossibleforthemajorityoftheunlabelledinstances.Accordingto[16,p.399],thiscanbeachievedwhentheone-classclassiersaresorteddescendingbytheestimatedprobabilities.p1 p2 pk(2)8Trueocc(1)FalseTrueocc(2)Trueocc(k )False FalseWorst caseFig. 4.One-versus-allmulti-classclassierchainThemaindisadvantageisthattheprocessingtimewasnottakenintoac-count. Therefore, we suggest to consider the probability of occurrence as productwith the processing time. Because of the consecutive chain of the one-class classi-ers, the processing timet(i)of a classier is the sum of the previously executedclassiers.Thisreferstotheexpectedvalueofapermutation.t(i)=i

m=1t(m), l=k

m=1p(m) t(m)(3)Thepermutationwiththeminimalexpectedvalueprovidestheminimalav-erageprocessingtime.Theempiricalcalculationoftheminimalexpectedvaluerepresentsafeasiblesolution. However, thissolutionisveryexpensiveduetothecalculationofk!manypermutations.Therefore,weintroducethefollowingtheoremwhichisbasedon[26]and[16,p.404].Theorem1. Letp(i),t(i)andt(i)beasdenedabove.Thearrangementoftheone-classclassiersinanOVA-basedmulti-classclassicationmodel isoptimalifandonlyifp(1)t(1)p(2)t(2)p(k)t(k). (4)Inotherwords,theminimalexpectedvalueoverallpermutationsprovidestheminimal averageprocessingtimeifandonlyif(4)holds.Proof. Supposethatp(i)

t(i)andp(i+1)

t(i+1)areinterchanged.Onthatscore,apermutationchangesfrom + p(i) t(i) + p(i+1) t(i+1) + (5)to + p(i+1)(t(i+1) t(i)) + p(i) t(i+1) +. (6)This results in a change of the expected processing time by p(i+1) t(i)p(i)

t(i+1). Thereforeunderthegivenassumptions, itfollowsthatthechangefrom(5)to(6)increasestheprocessingtime. Consequently, thepermutation(6)isnotoptimaland(4)holdsforanyoptimalpermutation.According to [16, p. 404], we showed that the permutation which provides theminimal expected value is locally optimal and that adjacent interchanges lead to9no further improvements. Moreover, it is necessary to show that the permutationisgloballyoptimal. Asstatedin[16, p. 404], weconsidertwoproofs. Therstproof uses computer scienceandthesecondproof uses amathematical trick.Finally,weconsiderthreespecialcases.First proof. Assumethat (4) holds andconsider that theone-class classi-ersaresortedasfollowsocc(1), occ(2), ..., occ(k). Suchanarrangementcanbeachieved by using a sequence of interchanges such that each interchange replaces..., occ(j), occ(i), ...by..., occ(i), occ(j), ...forsomei < j.Thisdecreasestheover-allprocessingtimeinaveragebythenon-negativeamountp(i)t(j) p(j)t(i).Thus, the permutation which provides the minimal expected value also providestheminimalaverageprocessingtime.Second proof. In accordance with [16, p. 404], replace each probability p(i)byp(i)() = p(i) + i(1+ 2+ + k)/k, (7)whereisareallysmallpositivenumber.Inthecasethatissucientlysmallwe canexclude x1px1()+ +xkpxk() =y1py1()+ +ykpyk() unlessx1=y1, ..., xk=yk px1() =py1(), ..., pxk() =pyk(). Therefore, (4) willnot holdif the processingtimes of all one-class classiers are equal andtheprobabilitiesof occurrenceof all one-classclassiersareequallydistributedaswell. This contrasts the proof of [16, p. 404]. The proof of [16, p. 404] demands theinequalityofonlyoneparameterx1 ,=y1, ..., xk ,=yk. Onthatscore, theproofof [16, p. 404] neglects the interchanging of the probabilities of each permutation.Thiscontradictsthefundamentalassumption[16,p.399]while(2)holds.Under consideration of all k! permutations of the component one-class classi-ers, we know that there exist at least one permutation which satises (4) due tothe exclusion of equality of both parameters (processing time and probability ofoccurrence).Hence,(4)uniquelydeterminesthepermutationwiththeminimalexpectedvaluefortheprobabilitiespi()ifissucientlysmall.Bycontinuity,this also holds ifis set equal to zero. Following, we consider three special cases.(a.)Special caseone.Therstspecialcasetakesintoaccountthatthepro-cessingtimesofallcomponentone-classclassiersareequalt(1)=t(2)= =t(k). Therefore, the processing time can be reduced while (2) and (4) hold. Hence,theone-classclassiersarearrangedindescendingorderbytheprobabilities.Thisreferstothebasicassumptionof[16,p.399]while(2)holds.(b.) Special casetwo. Thesecondspecial casetakes intoaccount that theprobabilities of occurrence are equallydistributedp(1)=p(2)==p(k).Therefore, theprobabilitiescanbereducedand(4)holds. Thus, theone-classclassiersarearrangedinascendingorderbytheprocessingtimes.(b.)Special casethree.Thethirdspecialcasereferstobothpreviousspecialcases while the processing times are equal and the probabilities of occurrence areequallydistributedaswell. Hence, theexpectedvaluesof all permutationsareequal anditexistsnopermutationwithminimal expectedvalueasclaimedbythe theorem.Hence,the theorem does notholdforequalityofboth parameters. .105 ClassierPre-SelectionThe application of an OVA-based multi-class classication model for data streamanomalydetectionleadstoachainofone-classclassiers.Theprevioussectiondiscussestheminimizationoftheprocessingtimeforthemajorityofunlabelledinstances.Althoughtheworstcasescenario,whenallcomponentone-classclas-siershavetobeprocessed,isstillremaining.Thechainofone-classclassiersprovidesaveryexiblestructureanditcanbeusedtotakefurtheradvantagesofit.Themainideaofthesuggestedclassierpre-selectionistheextractionofasetof candidateclassierssuchthattheprocessingtimecanbefurtherde-creased. Therefore, wesuggesttoapproximateeachone-classclassierwithanadditional classier. Thisleadstoanensembleof twoclassiersforeachclass.Ensembles of classiers are aimed to create more accurate classication decisionsbycombiningclassiersforagivenclassicationproblemattheexpenseofin-creasedcomplexity[8, 11, 17].However,weuseanensembleoftwoclassierstoreducetheprocessingtimebyclassierpre-selection.AsdepictedinFigure5,wesuggesttouseaminimalboundinghyperspherehitorepresent this additional classier. Eachhyperspherebounds thecorre-spondingone-classclassierminimally. Moreover, eachhypersphereprovidesacentreciandaradius ri. Incontrast toall other geometrical descriptions, ahypersphereprovidesamaterial advantage. Thedistanceof anunlabelledin-stancetothecentreof ahypersphereisindependentfromthepositionof theunlabelled instance to the hypersphere. Contrary to the one-class classiers, thehyperspheres must not be disjoint. If an unlabelled instance stneeds to be clas-sied, thedistancestoall of thecentresof thehyperspheresdi(st, ci)havetobecalculated. Aone-class classier is acandidateif thedistanceof anunla-belledinstancetothecentreisbelowtheradiusof theboundinghyperspheredi(st, ci) ri. Theresultingset of candidateclassiers is still inanoptimalarrangementwhiletheremainingclassierscanbeexcepted.Weassumethatthecalculationofthedistancesislesstimeconsumingthanthecalculationofallcomponentone-classclassiers.Thisassumptionholdsforuse cases when the number of classes is relatively low and the processing time ofthe one-class classiers is relatively high. However, when the number of one-classclassiersincreases, thecalculationof thedistancesincreasesaswell. Onthatscore, we suggest to use the triangle inequality to approximate the distance of anunlabelledinstancetoaclass.Thisdecreasestheprocessingtimeadditionally.6 ExperimentsAtthebeginning,thissectiondescribestheexperimentalsetup.Furtheron,wepresenttheminimizationofthetimeconsumptioninaverageandclassierpre-selection.Finally,wecompareoursuggestionswiththeHS-Treeapproach[29].11*h1h2h3occ1occ2occ3r1c1c2c3di(st , ci)r2r3stFig. 5.Classierpre-election-Ensembleoftwoclassiersforeachclass6.1 ExperimentalSetupAsdepictedinFigure6, ourimplementationprovidesanoinesubcycleandanonlinesubcyclewhicharebasedontheKDC[21, 22].Theoinesubcycleisrepresentedbyaclientandtheonlinesubcycleisrepresentedbyaserver. Theclient and the server are weakly coupled whereas the communication takes placebymeansofanexternalnetwork.Theoinesubcycleisusedtotraintheclas-sicationmodel.Thetraininginstancesarestoredinadatabase(PostgreSQL).Preprocessing, clusteringandmodel trainingcanberealizedbymeansofdatamining tools such as Matlab or specic implementations (e.g. Java-based imple-mentations). Thereafterthetrainedmodelsarestoredintoadatabaseaswell.The communication between the client and the server is implemented my meansof aprotocol. TheclientprovidesaJava-basedimplementationtoretrievethetrained models from the database and to register these models onto the server bymeans of the protocol. Moreover, the client provides the functionality to generateadatastream. Therefore, unlabelledinstancesareretrievedfromthedatabaseandsenttotheserverusingtheimplementedprotocolaswell.Weusedano-the-shelfcomputersystem(IntelCorei3with2.26GHzand4GBmemory)fortheclient.Theserverprovidesthecounterpartoftheprotocol implementation. Whentheserverwasstarteditawaitsthehandshakewithaclient.Ifthecommunica-tionhasbeenestablishedtheserverwaitsfortheregistrationofadatastream,schedulerandanomalydetectionalgorithms. Theschedulerisusedtomanagetheregisteredanomalydetectionalgorithms. Finally, whenall necessarypartshave been registered, the server awaits the initialization of the data stream. Theanomalydetectionalgorithmsproducearesultforeachincomingunlabelledin-stance. Theseresults arestoredintoresult sets. Theseresult sets aresent totheclientbymeansoftheprotocol.Theretrievedresultsetsarestoredintothedatabaseaswell.TheserverisimplementedbymeansofJavaandweusedthe12Raspberry Pi [23] as target machine which provides a low budget ARM processor(700MHzwith512MBmemory).Online subcycleMobile cyber-physical systemOffline subcycleExternal informationsystemExternal networkServerAnomaly detectionProtocolProtocolOff-the-shelf computer systemLow budget ARM processorRaspberry PiClientModel training- Matlab !!"ools#- PostgreS$LFig. 6.Prototypicalimplementation6.2 AverageTimeConsumptionandClassierPre-SelectionAccordingtotheaforementionedexample, wetrainedaone-classclassierperclass. This includes twoGaussianone-class classiers, twonearest neighbour(NN) one-class classiers, a k-centres one-class classier and two support vectordomaindescription(SVDD)one-classclassiers. Table1summarizestheclas-sierlabelsandtheprocessingtimeof eachone-classclassier. Theone-classclassiersweretrainedbymeansofMatlabandtheDDtoolstoolbox[32]. Theresultingone-class classiers werealsoevaluatedbymeans of a10-foldcrossvalidation. Wedeterminedtheprocessingtimes inaverageof thecomponentone-classclassiersempirically(wall-clocktime). AspresentedinTable1, theprocessingtimesoftheGaussianone-classclassiersarealmostthesame. Theprocessing times of the NN one-class classiers or the SVDD one-class classiersdier signicantly. The processing time of a NN one-class classier increases with13thenumberof traininginstances. Theprocessingtimeof theSVDDone-classclassierincreaseswiththenumberofselectedsupportvectors.Table1.ClassierlabelsandprocessingtimesLabel occ1occ2occ3occ4occ5occ6occ7Classier Gaussian NN k-centres NN SVDD SVDD Gaussiant(i)inms 0.411 3.773 1.703 10.618 3.198 2.113 0.436We selected two permutations. Permutation 1represents the minimized av-erageprocessingtimewhilepermutation2israndomlyselected.Moreover,weused two data sets. The rst data set Atest1represents the aforementioned classeswithapercentagedistributionofthetraininginstanceswithoutanomalies.Theseconddataset Atest2extendstherstdatasetwithanomalies.AspresentedinTable 2, the empirical values are higher than the calculated values due to the ad-ministrative overhead. Furthermore, the values are higher when the second datasetwasused. Thereasonforthisisthattheworstcasescenariohasoccurredmoreoftendue tothe existenceofanomalies.Table 2shows thatthe processingtimeof permutation1isundertheprocessingtimeof permutation2. Thisunderscorestheaforementionedtheorem(Section4)empirically.Table2.AveragetimeconsumptionPermutation Arrangement linms Xtest1inms Xtest2inms17, 5, 1, 6, 3, 4, 23.58 6.14 7.8824, 5, 7, 6, 1, 3, 214.39 17.31 18.26Further on, we used the classier pre-selection to further decrease the process-ingtimeintheworstcasescenario.Therefore,wecalculatedthedistancesfromtheunlabelledinstancestothecentresof thehyperspheresdirectly. However,thetriangleinequalitycouldbeusedtodecreasetheprocessingtimeaddition-ally.Table3summarizestheresultsunderconsiderationoftheaforementionedpermutations 1and 2as well as the data sets Atest1and Atest2. As can be seen,the classier pre-selectiondecreases the processingtime ineachof the caseswhile the empirically calculated values are approaching each other. We assumedamajor dierencebetweenbothvalues of 1and2for theseconddatasetduetotheminimizationof theaverageprocessingtime. Weexpectedthattheclassierpre-selection, asextensionof theOVA-basedmulti-classclassicationmodel, decreasestheprocessingtimeadditionally. Althoughinthegivendataset,thehyperspheresdoesnotintersect.Onthatscore,thearrangementoftheone-class classiers is not evident. We assume that the optimal arrangementbecomesmoreeectiveifthehypersphereswouldintersect.14Table3.Classierpre-selectionPermutation Arrangement linms Xtest1inms Xtest2inms17, 5, 1, 6, 3, 4, 23.58 6.09 6.0224, 5, 7, 6, 1, 3, 214.39 6.37 6.226.3 ComparisonwiththeHS-TreeApproachAcomparisonbetweenbothapproachesisverydicultandweareobligedtoprovide a fair comparison. Therefore, we initially summarize some dierences be-tweenbothapproaches.Ourapproachisanoinetrainingmethodandittakesadvantagesof theexistenceof external informationsystems. Theclassicationmodel canberetrainedduringruntimewhiletheonlineresourcesarenotad-versely aected. Therefore, the retrained model has to be transmitted from timetotimetotheonlinesubcycle.TheOVA-basedmulti-classclassicationmodelprovides a very exible structure while each class can be bounded by means of anoptimizedclassier.Hence,thetimeconsumptiondependsontheselectedone-classclassiers.Conversely,theHS-Treeapproachisanonlinetrainingmethodwithout any further transmission of the model. The HS-Tree approach intersectsthestatespacebymeansof hyperplanes. Thus, classicationresultscouldbeambiguousiftheclassesarenotparaxiallyseparable.However, thepublication[29] claimstwocontrarystatements. First[29, p.1511], theHS-Tree... requires onlynormal datafor training[...] ThemodelfeaturesanensembleofrandomHS-Trees,andthetreestructureisconstructedwithout any data.. Second [29, p. 1513], Once HS-Trees are constructed, massproleof normal datamust berecordedinthetrees beforetheycanbeem-ployedforanomalydetection..Therefore,theHS-Treeapproachisinitiallyanoinetrainingmethodwhilehistorical datacanbeusedfortraining. Itwouldbeverynegligentinsomeapplicationdomainstoapplysuchanuntrainedandunassessed anomaly detecting model. On that score, we used the HS-Tree imple-mentation as an oine training method while the model is online adaptive. Thisindicatesanotherdisadvantageof theHS-Tree. TheHS-Treeisreallybigandmanydataneedstobetransferredtotheonlinesubcyclewhichexpensesalotsystemresources(e.g.datapathandtime).Table4summarizestheprocessingtimes of the HS-Tree. The ensemble uses three trees for training. As can be seen,the processing time is approximately one millisecond smaller than our approach.Table4.HS-TreeclassicationtimeXtest1inms Xtest2inms5.08 5.34157 ConclusionThetime-eciencyisveryimportantfordatastreamprocessingandespeciallyfordatastreamanomalydetectionunderresourcerestrictions. WesuggestanOVA-basedmulti-classclassicationmodel fordatastreamanomalydetection.Moreover, we introducedclassier pre-selectionas anextensionof the OVA-basedmulti-classclassication. Weperformedareal worldcasestudyandtheresultsareverypromising. Theresultsshowthattheaverageprocessingtimecan reduced due to the selection of an optimized arrangement of the componentone-classclassiers.Theprocessingtimecanbefurtherdecreasedbytheuseofclassier pre-selection. We compared our suggestions with the HS-Tree approach.AcknowledgementsWewishtothankandacknowledgeDLR, ESAandASTRIUMSpaceTrans-portationfortheirinsightsandsupport, withspecial thankstoEnricoNoack.Moreover we would like to thank Swee Chuan Tan for the provision of the origi-nal HS-Tree source code. This work was supported by the Brandenburg MinistryofScience, ResearchandCultureaspartoftheInternational GraduateSchoolatBrandenburgUniversityofTechnologyCottbus.References1. Babu, S., Widom, J.: ContinuousQueriesoverDataStreams. SIGMODRecord30(3),109120(September2001)2. Bellmann,R.:AdaptiveControlProcesses.PrincetonUniversityPress(1961)3. Bifet,A.,Holmes,G.,Kirkby,R.,Pfahringer,B.:MOA:MassiveOnlineAnalysis.JournalofMachineLearningResearch(JMLR)11,16011604(2010)4. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Data StreamMining - APractical Approach. Tech. rep., Centre for Open Software Innovation (COSI)- Waikato University (2011), http://heanet.dl.sourceforge.net/project/moa-datastream/documentation/StreamMining.pdf5. Brereton, R.G.: One-Class Classiers. Journal of Chemometrics 25(5), 225246(2011)6. Chandola, V., Banerjee, A., Kumar, V.: Anomaly Detection: A Survey. ACM Com-put.Surv.41,15:115:58(2009)7. Cugola, G., Margara, A.: ProcessingFlowsofInformation: FromDataStreamtoComplexEventProcessing.ACMComput.Surv.44(3),15:115:62(2012)8. Dietterich,T.G.:Machine-LearningResearch:FourCurrentDirections.AIMaga-zine18,97136(1997)9. Domingos, P., Hulten, G.: Mining High-Speed Data Streams. In: Proceedings of thesixthACMSIGKDDinternational conferenceonKnowledgediscoveryanddatamining.pp.7180.KDD00,ACM(2000)10. Efron, B.: Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. Journal of theAmericanStatistical Association78(382), pp. 316331(1983)11. Gama,J.:KnowledgeDiscoveryfromDataStreams.Chapman&Hall(2010)1612. Golab, L.,Ozsu, M.T.: Data Stream Management. Morgan & Claypool Publishers(2010)13. Hashemi, S., Yang, Y., Mirzamomen, Z., Kangavari, M.: AdaptedOne-versus-AllDecisionTreesforDataStreamClassication. IEEETransactionsonKnowledgeandDataEngineering21(5),624637(2009)14. Hsu, C.W., Lin, C.J.: AComparisonof Methods for Multiclass Support VectorMachines.NeuralNetworks,IEEETransactionson13(2),415425(2002)15. Janssens, J.H.M., Flesch, I., Postma, E.O.: Outlier Detection with One-Class Clas-siersfromMLandKDD. In: Proceedingsof the2009International Conferenceon Machine Learning and Applications. pp. 147153. ICMLA 09, IEEE ComputerSociety(2009)16. Knuth,D.E.:TheArtofComputerProgramming,Volume3:SortingandSearch-ing.AddisonWesleyLongmanPublishingCo.,Inc.,2nded.edn.(1998)17. Kuncheva, L.I.: CombiningPatternClassiers: MethodsandAlgorithms. Wiley-Interscience(2004)18. Li, C., Zhang, Y., Li, X.: OcVFDT: One-classVeryFastDecisionTreeforOne-class Classicationof DataStreams. In: Proceedings of theThirdInternationalWorkshoponKnowledgeDiscoveryfromSensorData.pp.7986.SensorKDD09,ACM(2009)19. Noack, E., Belau, W., Wohlgemuth, R., M uller, R., Palumberi, S., Parodi, P.,Burzagli, F.: Eciencyof theColumbusFailureManagementSystem. In: AIAA40thInternationalConferenceonEnvironmentalSystems(2010)20. Noack, E., Noack, T., Patel, V., Schmitt, I., Richters, M., Stamminger, J., Sievi, S.:Failure Management for Cost-Eective and Ecient Spacecraft Operation. In: Pro-ceedingsofthe2011NASA/ESAConferenceonAdaptiveHardwareandSystems(AHS).IEEEComputerSociety(2011)21. Noack,T.,Schmitt,I.:MonitoringMobileCyber-PhysicalSystemsbyMeansofaKnowledge Discovery Cycle - A Case Study. In: Workshop on Knowledge Discovery,DataMiningandMachineLearning(KDML)(2012)22. Noack, T., Schmitt, I.: MonitoringMobileCyber-Physical SystemsbyMeansofaKnowledgeDiscoveryCycle.In:SeventhIEEEInternationalConferenceonRe-searchChallengesinInformationScience(RCIS)(2013)23. Raspberry Pi: Raspberry Pi (2013), http://www.raspberrypi.org/, online 2013-02-2724. dos Santos Teixeira, P.H., Milidi u, R.L.: Data Stream Anomaly Detection throughPrincipal Subspace Tracking. In: Proceedings of the 2010ACMSymposiumonAppliedComputing.pp.16091616.SAC10,ACM(2010)25. Seeger, M.: LearningwithLabeledandUnlabeledData. Tech. rep., UniversityofEdinburgh(2000),http://lapmal.ep.ch/papers/review.pdf26. Smith, W.E.: Various Optimizers for Single-Stage Production. Naval Research Lo-gisticsQuarterly3(1-2),5966(1956)27. Spinosa, E.J., deLeonF. deCarvalho, A.P., Gama, J.: NoveltyDetectionwithApplicationtoDataStreams.Intell.DataAnal.13(3),405422(2009)28. Stonebraker, M., C etintemel, U., Zdonik, S.: The8Requirements of Real-TimeStreamProcessing.SIGMODRec.34(4),4247(2005)29. Tan, S.C., Ting, K.M., Liu, T.F.: Fast AnomalyDetectionfor StreamingData.In: Proceedingsof theTwenty-Secondinternational jointconferenceonArticialIntelligence-VolumeVolumeTwo.pp.15111516.IJCAI11,AAAIPress(2011)30. Tax, D.M.J.: One-Class Classication: Concept-learning in theAbsence of Counter-Examples. Ph.D. thesis, TU Delft (2001),http://homepage.tudelft.nl/n9d04/thesis.pdf1731. Tax, D.M.J., Duin, R.P.W.: UsingTwo-classClassiersforMulticlassClassica-tion. In: PatternRecognition, 2002. Proceedings. 16thInternational Conferenceon.vol.2,pp.124127(2002)32. Tax, D.: DDtools, the Data Description Toolbox for Matlab (May 2012),http://prlab.tudelft.nl/david-tax/ddtools.html,version1.9.133. Tsai, J.J.P., Yang, S.J.H.: MonitoringandDebuggingof DistributedReal-TimeSystems.IEEEComputerSocietyPress(1995)18

csr_01_13 ova-based multi-class classification for data stream anomaly detection

Documents

computer engineeringprof

vierhaus computer network

btu cottbus issn

cottbus research groups

communication systemsprof

information systemsprof

cunningham systemsprof

april2013 ova