top 10 algorithms in data mining

37
Top 10 Algorithms in Data Mining Xindong Wu, Vipin Kumar, J. Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J. McLachlan, Angus Ng, Bing Liu, Philip S. Yu, Zhi-Hua Zhou, Michael Steinbach, David J. Hand, Dan Steinberg October 8, 2007 Abstract This paper presents the top 10 data mining algorithms identified by the IEEE International Con- ference on Data Mining (ICDM) in December 2006: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development. Introduction In an effort to identifying some of the most influential algorithms that have been widely used in the data mining community, the IEEE International Conference on Data Mining (ICDM, http://www.cs.uvm.edu/ icdm/) identified the top 10 algorithms in data mining for presen- tation at ICDM ’06 in Hong Kong. As the first step in the identification process, we invited the ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners in September 2006 to each nominate up to 10 best- known algorithms in data mining. All except one in this distinguished set of award winners responded to our invitation. We asked each nomination to come with the following information: (a) the algorithm name, (b) a brief justification, and (c) a representative publication reference. We also advised that each nominated algorithm should have been widely cited and used by other researchers in the field, and the nominations from each nominator as a group should have a reasonable representation of the different areas in data mining. After the nominations in Step 1, we verified each nomination for its citations on Google Scholar in late October 2006, and removed those nominations that did not have at least 50 citations. All remaining (18) nominations were then organized in 10 topics: association analysis, classification, clustering, sta- tistical learning, bagging and boosting, sequential patterns, integrated mining, rough sets, link mining, and graph mining. For some of these 18 algorithms such as k-means, the representative publication was not necessarily the original paper that introduced the algorithm, but a recent paper that highlights the importance of the technique. These representative publications are available at the ICDM website (http://www.cs.uvm.edu/ icdm/algorithms/CandidateList.shtml). In the third step of the identification process, we had a wider involvement of the research commu- nity. We invited the Program Committee members of KDD-06 (the 2006 ACM SIGKDD International 1

Upload: tommy96

Post on 04-Dec-2014

2.350 views

Category:

Documents


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Top 10 Algorithms in Data Mining

Top10Algorithmsin DataMining

XindongWu, Vipin Kumar,J.RossQuinlan,JoydeepGhosh,QiangYang,Hiroshi Motoda,

Geoffrey J.McLachlan,AngusNg, Bing Liu, Philip S.Yu, Zhi-HuaZhou,MichaelSteinbach,David J.Hand,DanSteinberg

October8, 2007

Abstract

Thispaperpresentsthetop10dataminingalgorithmsidentifiedby theIEEEInternationalCon-ferenceonDataMining (ICDM) in December2006:C4.5,k-Means,SVM, Apriori, EM, PageRank,AdaBoost,kNN, NaiveBayes,andCART. Thesetop 10 algorithmsareamongthemostinfluentialdatamining algorithmsin theresearchcommunity. With eachalgorithm,we providea descriptionof the algorithm,discussthe impactof the algorithm,andreview currentandfurther researchonthealgorithm.These10 algorithmscoverclassification,clustering,statisticallearning,associationanalysis,andlink mining, which areall amongthemostimportanttopicsin datamining researchanddevelopment.

Intr oduction

In aneffort to identifyingsomeof themostinfluentialalgorithmsthathavebeenwidely usedin thedatamining community, the IEEE International Conference on Data Mining (ICDM,http://www.cs.uvm.edu/ � icdm/) identifiedthetop 10 algorithmsin datamining for presen-tationat ICDM ’06 in HongKong.

As the first stepin the identificationprocess,we invited the ACM KDD Innovation Award andIEEEICDM ResearchContributionsAwardwinnersin September2006to eachnominateupto 10best-known algorithmsin datamining. All exceptonein this distinguishedsetof awardwinnersrespondedto our invitation. Weaskedeachnominationto comewith thefollowing information:(a) thealgorithmname,(b) abrief justification,and(c) a representative publicationreference.Wealsoadvisedthateachnominatedalgorithmshouldhave beenwidely citedandusedby otherresearchersin thefield, andthenominationsfrom eachnominatorasa groupshouldhave a reasonablerepresentationof thedifferentareasin datamining.

After thenominationsin Step1, weverifiedeachnominationfor its citationsonGoogleScholarinlateOctober2006,andremovedthosenominationsthatdid nothaveat least50citations.All remaining(18) nominationswerethenorganizedin 10 topics:associationanalysis,classification,clustering,sta-tistical learning,baggingandboosting,sequentialpatterns,integratedmining, roughsets,link mining,andgraphmining. For someof these18 algorithmssuchask-means,the representative publicationwasnot necessarilytheoriginal paperthat introducedthealgorithm,but a recentpaperthathighlightsthe importanceof thetechnique.Theserepresentative publicationsareavailableat theICDM website(http://www.cs.uvm.edu/� icdm/algorithms/CandidateList.shtml).

In thethird stepof theidentificationprocess,we hada wider involvementof theresearchcommu-nity. We invited theProgramCommitteemembersof KDD-06 (the2006ACM SIGKDD International

1

Page 2: Top 10 Algorithms in Data Mining

ConferenceonKnowledgeDiscoveryandDataMining), ICDM ’06 (the2006IEEEInternationalCon-ferenceonDataMining), andSDM ’06 (the2006SIAM InternationalConferenceonDataMining), aswell astheACM KDD InnovationAwardandIEEE ICDM ResearchContributionsAwardwinnerstoeachvotefor up to 10 well-known algorithmsfrom the18-algorithmcandidatelist. Thevoting resultsof this stepwerepresentedat theICDM ’06 panelon Top10Algorithmsin DataMining.

At the ICDM ’06 panelof December21, 2006,we alsotook anopenvotewith all 145attendeeson the top 10 algorithmsfrom theabove 18-algorithmcandidatelist, andthe top 10 algorithmsfromthis openvote werethe sameasthe voting resultsfrom the above third step. The 3-hourpanelwasorganizedasthelastsessionof theICDM ’06 conference,in parallelwith 7 paperpresentationsessionsof theWebIntelligence(WI ’06) andIntelligentAgentTechnology(IAT ’06) conferencesat thesamelocation,andattracting145participantsto thispanelclearlyshowedthatthepanelwasagreatsuccess.

1 C4.5and Beyond

by J.R. Quinlan1

1.1 Intr oduction

Systemsthat constructclassifiersareoneof the commonly-usedtools in datamining. Suchsystemstake asinput a collectionof cases,eachbelongingto oneof a smallnumberof classesanddescribedby its valuesfor a fixedsetof attributes,andoutputa classifierthatcanaccuratelypredicttheclasstowhichanew casebelongs.

ThesenotesdescribeC4.5[50], a descendantof CLS [34] andID3 [48]. Like CLS andID3, C4.5generatesclassifiersexpressedasdecisiontrees,but it canalsoconstructclassifiersin morecompre-hensiblerulesetform. I will outline the algorithmsemployed in C4.5,highlight somechangesin itssuccessorSee5/C5.0,andconcludewith acoupleof openresearchissues.

1.2 DecisionTrees

GivenasetSof cases,C4.5firstgrowsaninitial treeusingthedivide-and-conquer algorithmasfollows:� If all thecasesin Sbelongto thesameclassor S is small,thetreeis a leaf labeledwith themostfrequentclassin S.� Otherwise,choosea testbasedon a singleattributewith two or moreoutcomes.Make this testthe root of the treewith onebranchfor eachoutcomeof the test,partition S into correspond-ing subsets

���,���

, ... accordingto the outcomefor eachcase,andapply the sameprocedurerecursively to eachsubset.

Thereareusuallymany teststhat could be chosenin this last step. C4.5 usestwo heuristiccri-teria to rank possibletests: informationgain, which minimizesthe total entropy of the subsets� ���(but is heavily biasedtowardstestswith numerousoutcomes),andthe default gain ratio that dividesinformationgainby theinformationprovidedby thetestoutcomes.

Attributescanbe eithernumericor nominalandthis determinesthe format of the testoutcomes.For a numericattribute A they are � ������������� wherethe threshold� is foundby sorting

�on

thevaluesof � andchoosingthesplit betweensuccessive [email protected]

2

Page 3: Top 10 Algorithms in Data Mining

An attributeA with discretevalueshasby default oneoutcomefor eachvalue,but anoptionallows thevaluesto begroupedinto two or moresubsetswith oneoutcomefor eachsubset.

Theinitial treeis thenprunedto avoid overfitting. Thepruningalgorithmis basedon apessimisticestimateof theerrorrateassociatedwith asetof N cases,E of whichdonotbelongto themostfrequentclass.Insteadof E/N, C4.5determinestheupperlimit of thebinomialprobabilitywhenE eventshavebeenobservedin N trials,usingauser-specifiedconfidencewhosedefault valueis 0.25.

Pruningis carriedout from theleavesto theroot. Theestimatederrorat a leaf with N casesandEerrorsis N timesthepessimisticerrorrateasabove. For asubtree,C4.5addstheestimatederrorsof thebranchesandcomparesthis to theestimatederror if thesubtreeis replacedby a leaf; if thelatter is nohigherthantheformer, thesubtreeis pruned.Similarly, C4.5checkstheestimatederror if thesubtreeis replacedby oneof its branchesandwhenthis appearsbeneficialthe tree is modifiedaccordingly.Thepruningprocessis completedin onepassthroughthetree.

C4.5’s tree-constructionalgorithmdiffersin severalrespectsfrom CART [7], for instance:� Testsin CART arealwaysbinary, but C4.5allows two or moreoutcomes.� CART usestheGini diversity index to ranktests,whereasC4.5usesinformation-basedcriteria.� CART prunestreesusing a cost-complexity modelwhoseparametersareestimatedby cross-validation;C4.5usesasingle-passalgorithmderivedfrom binomialconfidencelimits.� Thisbrief discussionhasnotmentionedwhathappenswhensomeof acase’svaluesareunknown.CART looks for surrogateteststhatapproximatetheoutcomeswhenthetestedattribute hasanunknown value,but C4.5apportionsthecaseprobabilisticallyamongtheoutcomes.

1.3 RulesetClassifiers

Complex decisiontreescanbedifficult to understand,for instancebecauseinformationaboutoneclassis usuallydistributedthroughoutthetree.C4.5introducedanalternative formalismconsistingof a listof rulesof the form ‘if A andB andC and... thenclassX’, whererulesfor eachclassaregroupedtogether. A caseis classifiedby finding thefirst rule whoseconditionsaresatisfiedby thecase;if norule is satisfied,thecaseis assignedto adefault class.

C4.5rulesetsareformedfrom the initial (unpruned)decisiontree. Eachpathfrom theroot of thetreeto a leaf becomesa prototyperule whoseconditionsaretheoutcomesalongthepathandwhoseclassis the labelof the leaf. This rule is thensimplifiedby determiningtheeffect of discardingeachconditionin turn. Droppinga conditionmayincreasethenumberN of casescoveredby therule, andalsothenumberE of casesthatdo not belongto theclassnominatedby the rule, andmay lower thepessimisticerror ratedeterminedasabove. A hill-climbing algorithmis usedto dropconditionsuntilthelowestpessimisticerrorrateis found.

To completetheprocess,a subsetof simplifiedrulesis selectedfor eachclassin turn. Theseclasssubsetsareorderedto minimizetheerroron thetrainingcasesanda default classis chosen.Thefinalrulesetusuallyhasfar fewer rulesthanthenumberof leaveson thepruneddecisiontree.

The principal disadvantageof C4.5’s rulesetsis the amountof CPU time andmemorythat theyrequire. In oneexperiment,samplesrangingfrom 10,000to 100,000casesweredrawn from a largedataset.For decisiontrees,moving from 10K to 100K casesincreasedCPU time on a PC from 1.4secondsto 61 seconds,a factor of 44. The time requiredfor rulesets,however, increasedfrom 32secondsto 9,715seconds,a factorof 300.

3

Page 4: Top 10 Algorithms in Data Mining

1.4 See5/C5.0

C4.5 wassupersededin 1997by a commercialsystemSee5/C5.0(or C5.0 for short). The changesencompassnew capabilitiesaswell asmuch-improvedefficiency, andinclude:� A variantof boosting[20], whichconstructsanensembleof classifiersthatarethenvotedto give

afinal classification.Boostingoftenleadsto adramaticimprovementin predictive accuracy.� New datatypes(eg, dates),“not applicable”values,variablemisclassificationcosts,andmecha-nismsto pre-filterattributes.� Unorderedrulesets- whena caseis classified,all applicablerulesare found andvoted. Thisimprovesboththeinterpretabilityof rulesetsandtheirpredictive accuracy.� Greatly improved scalability of both decisiontreesand (particularly) rulesets. Scalability isenhancedby multi-threading;C5.0cantake advantageof computerswith multiple CPUsand/orcores.

More detailsareavailablefrom http://rulequest.com/see5-comparison.html.

1.5 Research Issues

I have frequentlyheardcolleaguesexpresstheview thatdecisiontreesarea“solvedproblem.” I donotagreewith thispropositionandwill closewith acoupleof openresearchproblems.

Stabletr ees: It is well known thattheerrorrateof atreeonthecasesfrom whichit wasconstructed(the resubstitutionerror rate) is muchlower thanthe error rateon unseencases(the predictive errorrate). For example,on a well-known letter recognitiondatasetwith 20,000cases,the resubstitutionerrorratefor C4.5is4%,but theerrorratefrom aleave-one-out(20,000-fold)cross-validationis11.7%.As thisdemonstrates,leaving outa singlecasefrom 20,000oftenaffectsthetreethatis constructed!

Supposenow thatwe coulddevelopa non-trivial tree-constructionalgorithmthatwashardlyeveraffectedby omittingasinglecase.Forsuchstabletrees,theresubstitutionerrorrateshouldapproximatetheleave-one-outcross-validatederrorrate,suggestingthatthetreeis of the“right” size.

Decomposingcomplextrees: Ensembleclassifiers,whethergeneratedbyboosting,bagging,weightrandomization,or other techniques,usuallyoffer improved predictive accuracy. Now, given a smallnumberof decisiontrees,it is possibleto generatea single(very complex) treethat is exactly equiva-lent to voting theoriginal trees,but canwe go theotherway? That is, cana complex treebebrokendown to asmallcollectionof simpletreesthat,whenvotedtogether, givethesameresultasthecomplextree?Suchdecompositionwould beof greathelpin producingcomprehensibledecisiontrees.

C4.5Acknowledgments

Researchon C4.5wasfundedfor many yearsby theAustralianResearchCouncil.C4.5 is freely available for researchand teaching, and source can be downloaded from

http://rulequest.com/Personal/c4.5r8.tar.gz .

2 The � -MeansAlgorithm

by JoydeepGhosh2

[email protected]

4

Page 5: Top 10 Algorithms in Data Mining

2.1 The Algorithm

Thek-means algorithmis a simpleiterative methodto partitiona givendatasetinto a user-specifiednumberof clusters, � . This algorithm hasbeendiscoveredby several researchersacrossdifferentdisciplines,mostnotablyLloyd (1957,1982)[40], Forgey (1965),FriedmanandRubin(1967),andMc-Queen(1967).Adetailedhistoryof k-means alongwithdescriptionsof severalvariationsaregiveninJainandDubes[36]. GrayandNeuhoff [28] provideanicehistoricalbackgroundfor k-means placedin thelargercontext of hill-climbing algorithms.

Thealgorithmoperateson a setof � -dimensionalvectors,�������! �" #$�&%'�)(*(*(*��+ , where �! -,/.10denotesthe #3254 datapoint. Thealgorithmis initialized by picking � pointsin .60 astheinitial � clusterrepresentativesor “centroids”.Techniquesfor selectingtheseinitial seedsincludesamplingat randomfrom thedataset,settingthemasthesolutionof clusteringa smallsubsetof thedataor perturbingtheglobalmeanof thedata � times.Thenthealgorithmiteratesbetweentwo stepstill convergence:

Step1: Data Assignment. Eachdatapoint is assignedto its closestcentroid,with tiesbrokenarbi-trarily. This resultsin apartitioningof thedata.

Step2: Relocationof “means”. Eachclusterrepresentative is relocatedto the center(mean)ofall datapointsassignedto it. If the datapointscomewith a probability measure(weights),thentherelocationis to theexpectations(weightedmean)of thedatapartitions.

Thealgorithmconvergeswhentheassignments(andhencethe 798 values)no longerchange.Thealgorithmexecutionis visually depictedin Fig. 1. Notethateachiterationneeds+;:<� comparisons,which determinesthetime complexity of oneiteration. Thenumberof iterationsrequiredfor conver-gencevariesandmaydependon + , but asa first cut, this is analgorithmcanbeconsideredlinear inthedatasetsize.

Oneissueto resolve is how to quantify “closest” in theassignmentstep. The default measureofclosenessis the Euclideandistance,in which caseone can readily show that the non-negative costfunction, => �5? �A@CBADFEAG #IHKJA"*" �! KLM798N"*" �� O (1)

will decreasewhenever thereis a changein the assignmentor the relocationsteps,andhencecon-vergenceis guaranteedin a finite numberof iterations. The greedy-descentnatureof k-means ona non-convex costalsoimplies that theconvergenceis only to a local optimum,andindeedthealgo-rithm is typically quitesensitive to theinitial centroidlocations.Fig. 2 3 illustrateshow apoorerresultis obtainedfor the samedatasetasin Fig. 1 for a differentchoiceof the threeinitial centroids.Thelocal minimaproblemcanbecounteredto someextentby runningthealgorithmmultiple timeswithdifferentinitial centroids,or by doinglimited local searchabouttheconvergedsolution.

2.2 Limitations

In additionto beingsensitive to initialization,thek-means algorithmsuffersfrom severalotherprob-lems. First, observe thatk-means is a limiting caseof fitting databy a mixtureof � Gaussianswithidentical,isotropiccovariancematrices( PQ�SR �)T ), whenthesoftassignmentsof datapointsto mixturecomponentsarehardenedto allocateeachdatapoint solely to themostlikely component.So, it will

3Figures1 and2 aretakenfrom theslidesfor thebook,Introductionto DataMining, Tan,Kumar, Steinbach,2006.

5

Page 6: Top 10 Algorithms in Data Mining

Figure1: Changesin clusterrepresentative locations(indicatedby ‘ U ’ signs)anddataassignments(indicatedby color) duringanexecutionof thek-means algorithm.

falterwhenever thedatais not well describedby reasonablyseparatedsphericalballs, for example,iftherearenon-covex shapedclustersin thedata.This problemmaybealleviatedby rescalingthedatato “whiten” it beforeclustering,or by usinga differentdistancemeasurethat is moreappropriateforthedataset.For example,information-theoreticclusteringusestheKL-divergenceto measurethedis-tancebetweentwo datapointsrepresentingtwo discreteprobabilitydistributions. It hasbeenrecentlyshown thatif onemeasuresdistanceby selectingany memberof averylargeclassof divergencescalledBregmandivergencesduringtheassignmentstepandmakesno otherchanges,theessentialpropertiesof k-means, includingguaranteedconvergence,linear separationboundariesandscalability, arere-tained[2]. This resultmakesk-means effective for a much larger classof datasetsso long asanappropriatedivergenceis used.

k-means canbepairedwith anotheralgorithmto describenon-convex clusters.Onefirst clustersthedatainto alargenumberof groupsusingk-means. Thesegroupsarethenagglomeratedinto largerclustersusingsinglelink hierarchicalclustering,whichcandetectcomplex shapes.Thisapproachalsomakesthesolutionlesssensitive to initialization,andsincethehierarchicalmethodprovidesresultsatmultiple resolutions,onedoesnotneedto pre-specify� either.

Second,thecostof theoptimalsolutiondecreaseswith increasing� till it hits zerowhenthenum-ber of clustersequalsthe numberof distinct data-points.This makesit moredifficult to (a) directlycomparesolutionswith differentnumbersof clustersand(b) to find the optimumvalueof � . If thedesired� is notknown in advance,onewill typically runk-means with differentvaluesof � , andthenuseasuitablecriterionto selectoneof theresults.For example,SASusesthecube-clustering-criterion,while X-meansaddsacomplexity term(whichincreaseswith � ) to theoriginalcostfunction(Eq.1) and

6

Page 7: Top 10 Algorithms in Data Mining

thenidentifiesthe � which minimizesthis adjustedcost.Alternatively, onecanprogressively increasethe numberof clusters,in conjunctionwith a suitablestoppingcriterion. Bisectingk-means [59]achieves this by first putting all the datainto a singlecluster, andthenrecursively splitting the leastcompactclusterinto two using2-means.ThecelebratedLBG algorithm[28] usedfor vectorquantiza-tion doublesthenumberof clusterstill a suitablecode-booksizeis obtained.Both theseapproachesthusalleviatetheneedto know � beforehand.

The algorithmis alsosensitive to the presenceof outliers,since“mean” is not a robust statistic.A preprocessingstepto remove outlierscanbe helpful. Post-processingthe results,for exampletoeliminatesmallclusters,or to mergecloseclustersinto a largecluster, is alsodesirable.Ball andHall’sISODATA algorithmfrom 1967effectively usedbothpre-andpost-processingonk-means.

Figure2: Effectof aninferior initializationon thek-means results.

2.3 Generalizationsand Connections

As mentionedearlier, k-means is closelyrelatedto fitting a mixtureof � isotropicGaussiansto thedata. Moreover, the generalizationof the distancemeasureto all Bregmandivergencesis relatedtofitting thedatawith a mixtureof � componentsfrom theexponentialfamily of distributions. Anotherbroadgeneralizationis to view the “means”as probabilisticmodelsinsteadof points in VW0 . Here,in the assignmentstep,eachdatapoint is assignedto the most likely model to have generatedit. Inthe“relocation” step,themodelparametersareupdatedto bestfit theassigneddatasets.Suchmodel-basedk-means allow oneto caterto morecomplex data,e.g.sequencesdescribedby HiddenMarkovmodels.

Onecanalso“kernelize”k-means [15]. Thoughboundariesbetweenclustersarestill linear inthe implicit high-dimensionalspace,they canbecomenon-linearwhenprojectedbackto theoriginalspace,thusallowing kernelk-means to dealwith morecomplex clusters. Dhillon et al [15] haveshown a closeconnectionbetweenkernelk-means andspectralclustering.TheK-medoidalgorithmis similar to k-means exceptthat thecentroidshave to belongto thedatasetbeingclustered.Fuzzyc-meansis alsosimilar, except that it computesfuzzy membershipfunctionsfor eachclustersratherthanahardone.

Despiteits drawbacks,k-means remainsthe mostwidely usedpartitionalclusteringalgorithmin practice. The algorithmis simple,easilyunderstandableandreasonablyscalable,andcanbe eas-ily modifiedto dealwith streamingdata. To dealwith very large datasets,substantialeffort hasalsogoneinto further speedingup k-means, mostnotablyby usingkd-treesor exploiting the triangularinequalityto avoid comparingeachdatapoint with all thecentroidsduringtheassignmentstep.Con-

7

Page 8: Top 10 Algorithms in Data Mining

tinual improvementsandgeneralizationsof the basicalgorithmhave ensuredits continuedrelevanceandgraduallyincreasedits effectivenessaswell.

3 Support Vector Machines

by QiangYang4

In today’s machinelearningapplications,supportvectormachines(SVM) [66] areconsideredamusttry - it offersoneof themostrobustandaccuratemethodsamongall well-known algorithms.Ithasa soundtheoreticalfoundation,requiresonly a dozenexamplesfor training,andis insensitive tothenumberof dimensions.In addition,efficientmethodsfor trainingSVM arealsobeingdevelopedata fastpace.

In atwo-classlearningtask,theaimof SVM is to find thebestclassificationfunctionto distinguishbetweenmembersof thetwo classesin thetrainingdata.Themetricfor theconceptof the”best” clas-sificationfunctioncanberealizedgeometrically. For a linearlyseparabledataset,a linearclassificationfunctioncorrespondsto aseparatinghyperplaneX @ZY O thatpassesthroughthemiddleof thetwo classes,separatingthetwo. Oncethis functionis determined,new datainstanceY\[ canbeclassifiedby simplytestingthesignof thefunction X @ZY [ O ; Y [ belongsto thepositive classif X @ZY [ O �Q] .

Becausetherearemany suchlinearhyperplanes,whatSVM additionallyguaranteeis thatthebestsuchfunction is foundby maximizingthemargin betweenthe two classes.Intuitively, themargin isdefinedasthe amountof space,or separationbetweenthe two classesasdefinedby the hyperplane.Geometrically, themargin correspondsto theshortestdistancebetweentheclosestdatapointsto apointon thehyperplane.Having this geometricdefinitionallows usto explorehow to maximizethemargin,so thateven thoughtherearean infinite numberof hyperplanes,only a few qualify asthesolutiontoSVM.

Thereasonwhy SVM insistson finding themaximummargin hyperplanesis that it offersthebestgeneralizationability. It allows not only the bestclassificationperformance(e.g., accuracy) on thetraining data,but also leaves muchroom for the correctclassificationof the future data. To ensurethatthemaximummargin hyperplanesareactuallyfound,anSVM classifierattemptsto maximizethefollowing functionwith respectto ^_ and ` :

acb � %dfe ^_ e LgP 2�*? ��h �Zij� @ ^_�k ^�! lUm` O UQP 2�5? �nh � (2)

where o is thenumberof trainingexamples,andh � �p#q��%'�)()()(!�po arenon-negative numberssuchthat

thederivativesofacb

with respecttoh �

arezero.h �

aretheLagrangemultipliersanda$b

is calledtheLagrangian.In thisequation,thevectors ^_ andconstant definethehyperplane.

Thereareseveral importantquestionsand relatedextensionson the above basicformulationofsupportvectormachines.We list thesequestionsandextensionsbelow.

1. Canwe understandthemeaningof theSVM throughasolid theoreticalfoundation?

2. Canwe extendtheSVM formulationto handlecaseswherewe allow errorsto exist, wheneventhebesthyperplanemustadmitsomeerrorson thetrainingdata?

3. CanweextendtheSVM formulationsothatit worksin situationswherethetrainingdataarenotlinearly separable?

[email protected]

8

Page 9: Top 10 Algorithms in Data Mining

4. CanweextendtheSVM formulationsothatthetaskis to predictnumericalvaluesor to ranktheinstancesin thelikelihoodof beingapositive classmember, ratherthanclassification?

5. Canwe scaleup the algorithmfor finding themaximummargin hyperplanesto thousandsandmillions of instances?

Question1: Canwe understandthemeaningof theSVM throughasolid theoreticalfoundation?Severalimportanttheoreticalresultsexist to answerthisquestion.A learningmachine,suchastheSVM, canbemodeledasa functionclassbasedon someparam-

etersh

. Differentfunction classescanhave differentcapacityin learning,which is representedby aparameter� known asthe VC dimension[66]. The VC dimensionmeasuresthe maximumnumberof training exampleswherethe function classcanstill be usedto learnperfectly, by obtainingzeroerror rateson the training data,for any assignmentof classlabelson thesepoints. It canbe proventhattheactualerroron thefuturedatais boundedby a sumof two terms.Thefirst termis thetrainingerror, andthesecondterm if proportionalto thesquareroot of theVC dimension� . Thus,if we canminimize � , we canminimizethefutureerror, aslong aswe alsominimizethetrainingerror. In fact,theabove maximummargin functionlearnedby SVM learningalgorithmsis onesuchfunction.Thus,theoretically, theSVM algorithmis well founded.

Question 2: Canwe extendtheSVM formulationto handlecaseswherewe allow errorsto exist,wheneventhebesthyperplanemustadmitsomeerrorson thetrainingdata?

To answerthis question,imaginethat therearea few pointsof theoppositeclassesthat crossthemiddle. Thesepointsrepresentthe training error that existing even for the maximummargin hyper-planes.The ”soft margin” ideais aimedat extendingtheSVM algorithm[66] so that thehyperplaneallows a few of suchnoisy datato exist. In particular, introducea slack variable r � to accountforthe amountof a violation of classificationby the function X @ZY � O ; r � hasa direct geometricexplana-tion throughthedistancefrom a mistakenly classifieddatainstanceto thehyperplaneX @ZY O . Then,thetotal costintroducedby the slackvariablescanbe usedto revise the original objective minimizationfunction.

Question3: Canwe extendtheSVM formulationso that it works in situationswherethe trainingdataarenot linearlyseparable?

The answerto this questiondependson an observation on the objective function wherethe onlyappearancesof ^�! is in the form of a dot product. Thus,if we extendthedot product ^�! k ^�l8 througha functional mapping s @ ^�! O of each ^�! to a different spacet of larger and even possibly infinitedimensions,thentheequationsstill hold. In eachequation,wherewe hadthedot product ^�! k ^�l8 , wenow have thedotproductof thetransformedvectorss @ ^�! O k s @ ^�l8 O , which is calledakernelfunction.

The kernel function canbe usedto definea variety of nonlinearrelationshipbetweenits inputs.For example,besideslinearkernelfunctions,youcandefinequadraticor exponentialkernelfunctions.Much studyin recentyearshave goneinto thestudyof differentkernelsfor SVM classification[56]andfor many otherstatisticaltests.We canalsoextendtheabove descriptionsof theSVM classifiersfrom binaryclassifiersto problemsthatinvolve morethantwo classes.Thiscanbedoneby repeatedlyusingoneof the classesasa positive class,andthe restasthe negative classes(thus,this methodisknown astheone-against-allmethod.

9

Page 10: Top 10 Algorithms in Data Mining

Question 4: Canwe extend the SVM formulationso that the task is to learn to approximatedatausinga linear function, or to rank the instancesin the likelihood of beinga positive classmember,ratheraclassification?

SVM canbeeasilyextendedto performnumericalcalculations.Herewe discusstwo suchexten-sions.Thefirst is to extendSVM to performregressionanalysis,wherethegoalis to producea linearfunction that canapproximatethat target function. Carefulconsiderationgoesinto the choiceof theerrormodels;in supportvectorregression,or SVR, theerror is definedto bezerowhenthedifferencebetweenactualandpredictedvaluesarewithin a epsilonamount. Otherwise,the epsiloninsensitiveerror will grow linearly. The supportvectorscan then be learnedthroughthe minimization of theLagrangian.An advantageof supportvectorregressionis reportedto beits insensitivity to outliers.

Anotherextensionis to learnto rankelementsratherthanproducinga classificationfor individualelements[33]. Rankingcanbereducedto comparingpairsof instancesandproducinga Uu% estimateif thepair is in thecorrectrankingorder, and Lv% otherwise.Thus,a way to reducethis taskto SVMlearningis to constructnew instancesfor eachpair of rankedinstancein thetrainingdata,andto learnahyperplaneon thisnew trainingdata.

Thismethodcanbeappliedto many areaswhererankingis important,suchasin documentrankingin informationretrieval areas.

Question5: Canwescaleupthealgorithmfor findingthemaximummargin hyperplanesto thousandsandmillions of instances?

Oneof the initial drawbacksof SVM is its computationalinefficiency. However, this problemisbeingsolved with greatsuccess.Oneapproachis to breaka large optimizationprobleminto a seriesof smallerproblems,whereeachproblemonly involvesa coupleof carefullychosenvariablesso thatthe optimizationcanbe doneefficiently. The processiteratesuntil all the decomposedoptimizationproblemsaresolved successfully. A morerecentapproachis to considerthe problemof learninganSVM asthatof findinganapproximateminimumenclosingball of asetof instances.

Theseinstances,whenmappedto an N-dimensionalspace,representa coreset that canbe usedto constructan approximationto the minimum enclosingball. Solving the SVM learningproblemon thesecoresetscanproducea goodapproximationsolution in very fastspeed.For example,thecore-vectormachine[64] thusproducedcanlearnanSVM for millions of datain seconds.

4 The Apriori Algorithm

by HiroshiMotoda5

4.1 Description of the Algorithm

Oneof themostpopulardataminingapproachesis to find frequentitemsetsfrom a transactiondatasetandderive associationrules.Findingfrequentitemsets(itemsetswith frequency largerthanor equaltoa userspecifiedminimumsupport)is not trivial becauseof its combinatorialexplosion.Oncefrequentitemsetsareobtained,it is straightforward to generateassociationruleswith confidencelarger thanorequalto auserspecifiedminimumconfidence.

Apriori is a seminalalgorithmfor finding frequentitemsetsusingcandidategeneration[1]. It ischaracterizedas a level-wise completesearchalgorithm using anti-monotonicityof itemsets,“if anitemsetis not frequent,any of its supersetis never frequent”. By convention,Apriori assumesthat

[email protected]

10

Page 11: Top 10 Algorithms in Data Mining

itemswithin a transactionor itemsetaresortedin lexicographicorder. Let thesetof frequentitemsetsof size � be wyx andtheir candidatesbe z{x . Apriori first scansthedatabaseandsearchesfor frequentitemsetsof size1 by accumulatingthecountfor eachitemandcollectingthosethatsatisfytheminimumsupportrequirement.It theniterateson thefollowing threestepsandextractsall thefrequentitemsets.

1. Generatez{x)| � , candidatesof frequentitemsetsof size �}U~% , from thefrequentitemsetsof size� .2. Scanthedatabaseandcalculatethesupportof eachcandidateof frequentitemsets.

3. Add thoseitemsetsthatsatisfiestheminimumsupportrequirementto wyx)| � .TheApriori algorithmis shown in Figure1. Functionapriori-genin line 3 generatesz{x)| � from wyx

in thefollowing two stepprocess.

1. Joinstep:GenerateVW� | � , theinitial candidatesof frequentitemsetsof size ��US% by takingtheunion of the two frequentitemsetsof size � , ��x and �Wx that have the first ��L�% elementsincommon.

V�� | � �S�yx6���WxW����#3op� G�� �)()()(��p#3op� G xN� � �p#3op� G x��p#3o�� G x�� ��xW����#3op� G�� �p#3o�� G � �)()()(!�p#3op� G xN� � �p#3op� G x �WxW����#3op� G�� �p#3op� G � �)()()(��p#3op� G xN� � �p#3op� G x �

where,#3o�� G��-� #3op� G d � ()()( � #3op� G x � #3op� G x � .2. Prunestep:Checkif all theitemsetsof size � in V�x9| � arefrequentandgeneratez{x)| � by remov-

ing thosethatdo not passthis requirementfrom V�x)| � . This is becauseany subsetof size � ofz{x)| � thatis not frequentcannotbea subsetof a frequentitemsetof size �vUS% .Functionsubsetin line 5 finds all the candidatesof the frequentitemsetsincludedin transaction

t. Apriori, then,calculatesfrequency only for thosecandidatesgeneratedthis way by scanningthedatabase.

It is evident that Apriori scansthe databaseat most ���$�n� | � times when the maximumsize offrequentitemsetsis setat �j�{��� .

TheApriori achievesgoodperformanceby reducingthesizeof candidatesets.However, in situ-ationswith very many frequentitemsets,large itemsets,or very low minimumsupport,it still suffersfrom thecostof generatinga hugenumberof candidatesetsandscanningthedatabaserepeatedlytochecka large setof candidateitemsets.In fact, it is necessaryto generate

d �I�p�candidateitemsetsto

obtainfrequentitemsetsof size100.

11

Page 12: Top 10 Algorithms in Data Mining

Algorithm 1 Aprioriw � =(Frequentitemsetsof cardinality1);for( ����%'��w�x���~�!�n�}UQU ) dobeginz{x)| � = apriori-gen(w�x ); //New candidates

for all transactionso{, Databasedobeginz��2 = subset(z{x)| � �po ); //Candidatescontainedin ofor all candidate�W, z �2 do�F(¡�9¢F£KH�o!UQU ;endw�x)| � ���Fz�, z{x)| � " �F(¡�9¢F£KH�o6¤ minimumsupport

end

endAnswer �-xNw�x ;

Figure3: Apriori Algorithm

4.2 The Impact of the Algorithm

Many of thepatternfinding algorithmssuchasdecisiontree,classificationrulesandclusteringtech-niquesthatarefrequentlyusedin datamining have beendevelopedin machinelearningresearchcom-munity. Frequentpatternandassociationrule mining is oneof the few exceptionsto this tradition.The introductionof this techniqueboosteddatamining researchand its impact is tremendous.Thealgorithmis quitesimpleandeasyto implement.Experimentingwith Apriori-like algorithmis thefirstthing thatdataminerstry to do.

4.3 Curr ent and Further Research

SinceApriori algorithmwasfirst introducedandasexperiencewasaccumulated,therehavebeenmanyattemptsto devisemoreefficient algorithmsof frequentitemsetmining. Many of themsharethesameideawith Apriori in that they generatecandidates.Theseincludehash-basedtechnique,partitioning,samplingandusingverticaldataformat. Hash-basedtechniquecanreducethesizeof candidateitem-sets.Eachitemsetis hashedinto a correspondingbucket by usinganappropriatehashfunction. Sincea bucket cancontaindifferent itemsets,if its countis lessthana minimumsupport,theseitemsetsinthe bucket canbe removed from the candidatesets. A partitioningcanbe usedto divide the entireminingprobleminto H smallerproblems.Thedatasetis dividedinto H non-overlappingpartitionssuchthat eachpartition fits into main memoryandeachpartition is minedseparately. Sinceany itemsetthatis potentiallyfrequentwith respectto theentiredatasetmustoccurasa frequentitemsetin at leastoneof thepartitions,all thefrequentitemsetsfoundthis way arecandidates,which canbecheckedbyaccessingtheentiredatasetonly once.Samplingis simply to minea randomsampledsmallsubsetoftheentiredata.Sincethereis no guaranteethatwe canfind all the frequentitemsets,normalpracticeis to usea lowersupportthreshold.Tradeoff hasto bemadebetweenaccuracy andefficiency. Aprioriusesahorizontaldataformat,i.e. frequentitemsetsareassociatedwith eachtransaction.Usingverticaldataformatis to useadifferentformatin whichtransactionIDs (TID) areassociatedwith eachitemset.With this format, mining canbe performedby taking the intersectionof TIDs. The supportcountissimply thelengthof theTID setfor theitemset.Thereis no needto scanthedatabasebecauseTID setcarriesthecompleteinformationrequiredfor computingsupport.

The mostoutstandingimprovementover Apriori would be a methodcalledFP-growth (frequent

12

Page 13: Top 10 Algorithms in Data Mining

patterngrowth) thatsucceededin eliminatingcandidategeneration[30]. It adoptsadivideandconquerstrategy by 1) compressingthe databaserepresentingfrequentitems into a structurecalled FP-tree(frequentpatterntree)thatretainsall theessentialinformationand2) dividing thecompresseddatabaseinto a set of conditionaldatabases,eachassociatedwith one frequentitemsetandmining eachoneseparately. It scansthedatabaseonly twice. In thefirst scan,all the frequentitemsandtheir supportcounts(frequencies)arederived andthey aresortedin theorderof descendingsupportcountin eachtransaction.In thesecondscan,itemsin eachtransactionaremergedinto aprefixtreeanditems(nodes)thatappearin commonin differenttransactionsarecounted.Eachnodeis associatedwith anitem andits count. Nodeswith thesamelabelarelinked by a pointercallednode-link. Sinceitemsaresortedin the descendingorderof frequency, nodescloserto the root of the prefix treearesharedby moretransactions,thusresultingin a very compactrepresentationthatstoresall thenecessaryinformation.Patterngrowth algorithmworksonFP-treeby choosinganitemin theorderof increasingfrequency andextractingfrequentitemsetsthatcontainthechosenitemby recursively calling itself on theconditionalFP-tree.FP-growth is anorderof magnitudefasterthantheoriginal Apriori algorithm.

Thereareseveralotherdimensionsregardingtheextensionsof frequentpatternmining. Thema-jor onesincludethe followings. 1) incorporatingtaxonomyin items[58]: Useof taxonomymakesitpossibleto extract frequentitemsetsthatareexpressedby higherconceptsevenwhenuseof thebaselevel conceptsproducesonly infrequentitemsets.2) incrementalmining: In this setting,it is assumedthat the databaseis not stationaryanda new instanceof transactionkeepsadded. The algorithmin[9] updatesthefrequentitemsetswithout restartingfrom scratch.3) usingnumericvaluablefor item:Whentheitemcorrespondsto acontinuousnumericvalue,currentfrequentitemsetminingalgorithmisnotapplicableunlessthevaluesarediscretized.A methodof subspaceclusteringcanbeusedto obtainanoptimalvalueinterval for eachitem in eachitemset[68]. 4) usingothermeasuresthanfrequency,suchasinformationgainor ¥ � value:Thesemeasuresareusefulin finding discriminative patternsbutunfortunatelydo notsatisfyanti-monotonicityproperty. However, theremeasureshave anicepropertyof beingconvex with respectto their argumentsand it is possibleto estimatetheir upperboundforsupersetsof a patternandthuspruneunpromisingpatternsefficiently. AprioriSMP usesthis principle[45]. 5) usingricherexpressionsthanitemset:Many algorithmshavebeenproposedfor sequences,treeandgraphsto enablemining from morecomplex datastructure[73, 35]. 6) closeditemsets:A frequentitemsetis closedif it is not includedin any otherfrequentitemsets.Thus,oncethecloseditemsetsarefound,all thefrequentitemsetscanbederivedfrom them.LCM is themostefficient algorithmto findthecloseditemsets[65].

5 The EM Algorithm

by Geoffrey J.McLachlan6 andAngusNg

Finite mixture distributionsprovide a flexible andmathematical-basedapproachto the modelingandclusteringof dataobserved on randomphenomena.We focushereon theuseof normalmixturemodels,which canbeusedto clustercontinuousdataandto estimatetheunderlyingdensityfunction.Thesemixture modelscanbe fitted by maximumlikelihoodvia theEM (Expectation-Maximization)algorithm.

[email protected]

13

Page 14: Top 10 Algorithms in Data Mining

5.1 Intr oduction

Finitemixturemodelsarebeingincreasinglyusedtomodelthedistributionsof awidevarietyof randomphenomenaandto clusterdatasets[43]. Herewe considertheir applicationin thecontext of clusteranalysis.

We let the p-dimensionalvector ( y � @ i � �)(*(*(*� i�¦ OI§ ) containthe valuesof ¨ variablesmeasuredon eachof H (independent)entitiesto beclustered,andwe let yJ denotethevalueof y correspondingto the © th entity (©ª�«%'�)(*(*(*�pH ). With the mixture approachto clustering,y

� �)(*(*(*� y [ areassumedtobe an observed randomsamplefrom mixture of a finite number, sayg, of groupsin someunknownproportions¬ � �)(*(*(*��¬K­ .

Themixturedensityof yJ is expressedas

X @ ij� �¯® O � ­> �5? � ¬ � X � @ i J'��° � O @ ©±��%'�)(*(*(*�pH O (3)

wherethe mixing proportions ¬ � �)(*(*(*��¬ ­ sum to one and the group-conditionaldensity X � @ i J'��° � O isspecifiedup to a vector ° � of unknown parameters( #<�²%'�)(*(*(*� E ). The vector of all the unknownparametersis givenby

®~� @ ¬ � �)(*(*(*��¬ ­ � � ��° § � �)(*(*(*��° §­ O §wherethesuperscript³ denotesvectortranspose.Usinganestimateof ® , this approachgivesa prob-abilistic clusteringof the datainto E clustersin termsof estimatesof the posteriorprobabilitiesofcomponentmembership, ´ � @ i J �¯® O � ¬ � X � @ i J'��° � OX @ i J'�¯® O � (4)

where� @ yJ O is theposteriorprobabilitythat

i J (really theentitywith observationi J ) belongsto the # th

componentof themixture( #y�µ%'�)(*(*(*� E �I©¶��%'�)(*(*(*�pH ).Theparametervector ® canbeestimatedby maximumlikelihood.Themaximumlikelihoodesti-

mate(MLE) of ® , ·® , is givenby anappropriaterootof thelikelihoodequation,¸loga @ ® Op¹ ¸ ®��S]l� (5)

where

loga @ ® O � [>J ? �º� ¢ E X @ i J��¯® O (6)

is thelog likelihoodfunctionfor ® . Solutionsof (6) correspondingto localmaximizerscanbeobtainedvia theexpectation-maximization(EM) algorithm[13].

For the modelingof continuousdata, the component-conditional densitiesare usually taken tobelongto thesameparametricfamily, for example,thenormal.In thiscase,

X � @ i J ��° � O �~� @ i J �p» � �¯P � O � (7)

14

Page 15: Top 10 Algorithms in Data Mining

where � @ i Jj�p»¼�¯P O denotesthe ¨ -dimensionalmultivariatenormaldistribution with meanvector » andcovariancematrix P .

Oneattractive featureof adoptingmixturemodelswith elliptically symmetriccomponentssuchasthenormalor o densities,is that the implied clusteringis invariantunderaffine transformationsof thedata(that is, underoperationsrelatingto changesin location,scale,androtationof the data). Thusthe clusteringprocessdoesnot dependon irrelevant factorssuchasthe units of measurementor theorientationof theclustersin space.

5.2 Maximum Lik elihoodEstimation of Normal Mixtur es

McLachlanandPeel([43], Chapter3) have describedtheE- andM-stepsof theEM algorithmfor themaximumlikelihood(ML) estimationof multivariatenormalcomponents;seealso[42]. In the EMframework for this problem,theunobservablecomponentlabels ½ � J aretreatedasbeingthe“missing”data,where ½ � J is definedto be one or zero accordingas

i J belongsor doesnot belongto the # thcomponentof themixture,( #-��%'�)(*(*(*� E �)�I©¶�µ%'�)(*(*(*�pH ).

On the ( �¾U¿% )th iterationof the EM algorithm,theE-steprequirestaking the expectationof thecomplete-datalog likelihoodlog

a$À @ ® O , given the currentestimate® x for ® . As is linear in the un-observable ½ � J , this E-stepis effectedby replacingthe ½ � J by their conditionalexpectationgiven theobserved data

i J , using ®ÂÁ x)à . That is, ½ � J is replacedby ´ � J , which is theposteriorprobability thati J

belongsto the # th componentof themixture,usingthecurrentfit ® Á x)à for ® , ( #-�µ%'�)(*(*(*� E �I©¾��%'�)(*(*(*�pH ).It canbeexpressedas

´ Á x)Ã� J � ¬ Á x)Ã� � @ i Jj�p» Á x9Ã� �¯P Á x)Ã� OX @ i Jj�¯® Á x)à O ( (8)

On theM-step,theupdatedestimatesof themixing proportion ¬ J , themeanvector » � , andthecovari-ancematrix P � for the # th componentaregivenby

¬ Á x)| � Ã� � [>J ? � ´ Á x)Ã� J ¹ H (9)

» Á x9| � Ã� � [>J ? � ´ Á x9Ã� J i J ¹ [>J ? � ´ Á x)Ã� J (10)

and

P Á x9| � Ã� �ÅÄ [J ? � ´ Á x)Ã� J @ i J1LÆ» Á x)| � Ã� O @ i J1LÆ» Á x)| � Ã� OI§Ä [J ? � ´ Á x)Ã� J (11)

It canbeseenthattheM-stepexistsin closedform.TheseE- andM-stepsarealternateduntil thechangesin theestimatedparametersor thelog likeli-

hoodarelessthansomespecifiedthreshold.

15

Page 16: Top 10 Algorithms in Data Mining

5.3 Number of Clusters

We canmake a choiceasto anappropriatevalueof E by considerationof the likelihoodfunction. Intheabsenceof any prior informationasto thenumberof clusterspresentin thedata,we monitor theincreasein thelog likelihoodfunctionasthevalueof E increases.

At any stage,the choiceof E � EjÇ versusE � E � , for instanceE � � E'Ç U�% , canbe madebyeitherperformingthe likelihoodratio testor by usingsomeinformation-basedcriterion,suchasBIC(Bayesianinformationcriterion). Unfortunately, regularity conditionsdo not hold for the likelihoodratio teststatistic È to have its usualnull distribution of chi-squaredwith degreesof freedomequaltothedifferenced in thenumberof parametersfor E � E � and E � E Ç componentsin themixturemodels.Onewayto proceedis to usearesamplingapproachasin [41]. Alternatively, onecanapplyBIC, whichleadsto theselectionof E � E � over E � EjÇ if L d logÈ is greaterthan � log @ H O .6 PageRank

by Bing Liu andPhilip S.Yu7

6.1 Overview

PageRank[8] waspresentedandpublishedby Sergey Brin andLarry Pageat theSeventhInternationalWorld Wide WebConference(WWW7) in April, 1998. It is a searchrankingalgorithmusinghyper-links on theWeb. Basedon thealgorithm,they built thesearchengineGoogle,which hasbeena hugesuccess.Now, every searchenginehasits own hyperlinkbasedrankingmethod.

PageRankproducesa staticrankingof Webpagesin thesensethata PageRankvalueis computedfor eachpageoff-line andit doesnotdependonsearchqueries.Thealgorithmreliesonthedemocraticnatureof the Web by usingits vast link structureasan indicatorof an individual page’s quality. Inessence,PageRankinterpretsa hyperlink from pagex to pagey as a vote, by pagex, for pagey.However, PageRanklooksat morethanjust thesheernumberof votes,or links thata pagereceives.Italsoanalyzesthepagethatcaststhevote.Votescastedby pagesthatarethemselves“important” weighmoreheavily andhelpto make otherpagesmore“important”. This is exactly theideaof rankprestigein socialnetworks[69].

6.2 The Algorithm

Wenow introducethePageRankformula.Let usfirst statesomemainconceptsin theWebcontext.In-links of pagei: Thesearethehyperlinksthatpoint to pagei from otherpages.Usually, hyper-

links from thesamesitearenot considered.Out-links of pagei: Thesearethehyperlinksthat point out to otherpagesfrom pagei. Usually,

links to pagesof thesamesitearenot considered.Thefollowing ideasbasedonrankprestige[69] areusedto derive thePageRankalgorithm.

1. A hyperlink from a pagepointing to anotherpageis an implicit conveyanceof authorityto thetargetpage.Thus,themorein-links thatapagei receives,themoreprestigethepagei has.

2. Pagesthat point to pagei alsohave their own prestigescores.A pagewith a higherprestigescorepointing to i is moreimportantthana pagewith a lower prestigescorepointing to i. Inotherwords,apageis importantif it is pointedto by otherimportantpages.

[email protected]

16

Page 17: Top 10 Algorithms in Data Mining

According to rank prestigein socialnetworks, the importanceof page # ( # ’s PageRankscore)isdeterminedby summingup thePageRankscoresof all pagesthatpoint to i. Sinceapagemaypoint tomany otherpages,its prestigescoreshouldbesharedamongall thepagesthatit pointsto.

To formulatethe above ideas,we treattheWeb asa directedgraph ÉÊ� @3Ë ��Ì O , where Ë is thesetof verticesor nodes,i.e., the setof all pages,andE is thesetof directededgesin the graph,i.e.,hyperlinks.Let thetotal numberof pageson theWebbe H (i.e., HÍ�&" Ë " ). ThePageRankscoreof thepage# (denotedby � @ # O ) is definedby:

� @ # O � >Á J�Î � ÃÐÏFÑ � @ ©

OÒ J � (12)

whereÒ J is thenumberof out-linksof page© . Mathematically, wehaveasystemof H linearequations

(12) with H unknowns. We canusea matrix to representall theequations.Let � bea H -dimensionalcolumnvectorof PageRankvalues,i.e.,��� @ � @ % O ��� @ d O �)(*(*(*��� @ H OpO ³q(Let � betheadjacency matrixof ourgraphwith

� � JÓ�&Ô �ÕÖ if @ #��I© O ,�Ì] otherwise(13)

Wecanwrite thesystemof n equationswith

P � A§

P ( (14)

This is the characteristicequationof the eigensystem, wherethe solution to � is an eigenvectorwith thecorrespondingeigenvalueof 1. Sincethis is acirculardefinition,aniterativealgorithmis usedto solve it. It turnsout thatif someconditionsaresatisfied,1 is thelargesteigenvalueandthePageRankvector � is theprincipaleigenvector. A well known mathematicaltechniquecalledpower iteration[25]canbeusedto find � .

However, theproblemis thatEquation(14) doesnot quitesuffice becausetheWebgraphdoesnotmeettheconditions.In fact,Equation(14)canalsobederivedbasedon theMarkov chain.Thensometheoreticalresultsfrom Markov chainscanbeapplied.After augmentingtheWebgraphto satisfytheconditions,thefollowing PageRankequationis produced:

P � @ %1LÆ� O e U×� A § P ( (15)

wheree is acolumnvectorof all 1’s. This givesusthePageRankformulafor eachpage# :� @ # O � @ %1LM� O U×� [>J ? � �ØJ � � @ © O � (16)

which is equivalentto theformulagivenin theoriginal PageRankpapers[8, 47]:

� @ # O � @ %1LÆ� O U×� >Á JnÎ � ÃÐÏFÑ � @ ©

OÒ J ( (17)

Theparameter� is calledthedampingfactor whichcanbesetto avaluebetween0 and1. � = 0.85is usedin [8, 39].

17

Page 18: Top 10 Algorithms in Data Mining

ÙºÚFÛ Ü¯Ý-ÞFß�àAá5âIãCänånæ çCè�éZêyëìí← î3ïÐðñ

← òó¯ô�õ�ön÷ ø ùúûü ýþÿ�� �� ����+−←�

← � ���������������� ���! #"�$&%('*)�+(,!- ε. / 0 1 2 354 6Figure4: Thepower iterationmethodfor PageRank

The computationof PageRankvaluesof the Web pagescan be doneusing the power iterationmethod[25], which producesthe principal eigenvector with the eigenvalue of 1. The algorithm issimple,andis given in Fig. 1. Onecanstartwith any initial assignmentsof PageRankvalues. TheiterationendswhenthePageRankvaluesdonotchangemuchor converge. In Fig. 1, theiterationendsafterthe1-normof theresidualvectoris lessthanapre-specifiedthresholde.

Sincein Web search,we areonly interestedin the rankingof the pages,the actualconvergencemaynotbenecessary. Thus,fewer iterationsareneeded.In [8], it is reportedthaton adatabaseof 322million links thealgorithmconvergesto anacceptabletolerancein roughly52 iterations.

6.3 Further Referenceson PageRank

SincePageRankwaspresentedin [8, 47], researchershaveproposedmany enhancementsto themodel,alternative models,improvementsfor its computation,addingthe temporaldimension[74], etc. Thebooksby Liu [39] andby LangvilleandMeyer [38] containin-depthanalysesof PageRankandseveralotherlink-basedalgorithms.

7 AdaBoost

by Zhi-HuaZhou8

7.1 Description of the Algorithm

Ensemblelearning [16] dealswith methodswhich employ multiple learnersto solve a problem. Thegeneralizationability of anensembleis usuallysignificantlybetterthanthatof a singlelearner, soen-semblemethodsareveryattractive. TheAdaBoostalgorithm[20] proposedby Yoav FreundandRobertSchapireis oneof themostimportantensemblemethods,sinceit hassolid theoreticalfoundation,veryaccurateprediction,greatsimplicity (Schapiresaidit needsonly “just 10 linesof code”),andwideandsuccessfulapplications.

Let 7 denotethe instancespaceand 8 thesetof classlabels. Assume8 ���ALv%'�¯Uu% . Given aweakor baselearningalgorithmanda trainingset � @*9 � � i � O � @*9 � � i � O � k)k)k � @*9 �Â� i � O where 9 � ,:7and

i � ,;8 @ #{� %'� k)k)k � G O , theAdaBoostalgorithmworksasfollows. First, it assignsequalweightsto all the training examples @*9 � � ij� O @ # ,��A%'� k)k)k � G O . Denotethe distribution of the weightsatthe o -th learningroundas � 2 . From the training setand � 2 the algorithmgeneratesa weakor baselearner � 2=< 7 > 8 by calling the baselearningalgorithm. Then, it usesthe training examplesto test � 2 , andtheweightsof the incorrectlyclassifiedexampleswill be increased.Thus,anupdated

[email protected]

18

Page 19: Top 10 Algorithms in Data Mining

weightdistribution � 2 | � is obtained.Fromthetrainingsetand � 2 | � AdaBoostgeneratesanotherweaklearnerby calling thebaselearningalgorithmagain.Sucha processis repeatedfor ³ rounds,andthefinal modelis derived by weightedmajority voting of the ³ weaklearners,wheretheweightsof thelearnersaredeterminedduringthetrainingprocess.In practice,thebaselearningalgorithmmaybealearningalgorithmwhich canuseweightedtraining examplesdirectly; otherwisethe weightscanbeexploitedby samplingthetrainingexamplesaccordingto theweightdistribution � 2 . Thepseudo-codeof AdaBoostis shown in Fig. 5.

Input: Dataset ?���� @*9 � � il� O � @*9 � � i�� O � k)k)k � @*9 � � i � O ;Baselearningalgorithm @ ;Numberof learningrounds³ .

Process:� � @ # O �µ% ¹ G . % Initialize theweightdistributionfor o¼�µ%'� k)k)k ��³ :� 2 �A@ @ ? ��� 2 O ; % Traina weaklearner� 2 from ? usingdistribution � 2B 2 � Pr

�&CED ÖGF � 2 @*9 � �� ij� OIH ; % Measuretheerrorof � 2h 2 � �� ln J � �LK�MK MON ; % Determinetheweightof � 2� 2 | � @ # O � D M Á � ÃP M : Ô exp @ L h 2 O if � 2 @*9 � O � i �exp @ h 2 O if � 2 @*9 � O �� ij�� D M Á � à exp Á �LQ MSR Ö 4 M Á�T Ö Ã ÃP M % Updatethedistribution,where U 2 is

% anormalizationfactorwhichenables� 2 | � beadistributionend.

Output: V @*9 O � sign J Ä § 2 ? � h 2 � 2 @*9 O NFigure5: TheAdaBoostalgorithm

In orderto dealwith multi-classproblems,FreundandSchapirepresentedtheAdaBoost.M1algo-rithm [20] whichrequiresthattheweaklearnersarestrongenoughevenonharddistributionsgeneratedduringtheAdaBoostprocess.Anotherpopularmulti-classversionof AdaBoostis AdaBoost.MH[55]which works by decomposingmulti-classtask to a seriesof binary tasks. AdaBoostalgorithmsfordealingwith regressionproblemshave alsobeenstudied.Sincemany variantsof AdaBoosthave beendevelopedduringthepastdecade,Boostinghasbecomethemostimportant“f amily” of ensemblemeth-ods.

7.2 Impact of the Algorithm

As mentionedin Section7.1, AdaBoostis oneof the most importantensemblemethods,so it is notstrangethatitshighimpactcanbeobservedhereandthere.In thisshortarticleweonlybriefly introducetwo issues,onetheoreticalandtheotherapplied.

In 1988,KearnsandValiantposedaninterestingquestion,i.e.,whethera weaklearningalgorithmthat performsjust slightly betterthan randomguesscould be “boosted” into an arbitrarily accuratestrong learningalgorithm. In other words, whethertwo complexity classes,weakly learnableandstrongly learnableproblems,areequal. Schapire[53] found that theanswerto thequestionis “yes”,andthe proof he gave is a construction,which is the first Boostingalgorithm. So, it is evident thatAdaBoostwasborn with theoreticalsignificance.AdaBoosthasgiven rise to abundantresearchon

19

Page 20: Top 10 Algorithms in Data Mining

theoreticalaspectsof ensemblemethods,which canbeeasilyfoundin machinelearningandstatisticsliterature. It is worth mentioningthat for their AdaBoostpaper[20], SchapireandFreundwon theGodelPrize,which is oneof themostprestigiousawardsin theoreticalcomputerscience,in theyearof 2003.

AdaBoostandits variantshave beenappliedto diversedomainswith greatsuccess.For example,Viola andJones[67] combinedAdaBoostwith a cascadeprocessfor facedetection. They regardedrectangularfeaturesasweaklearners,andby usingAdaBoostto weight the weak learners,they gotvery intuitive featuresfor facedetection.In orderto gethigh accuracy aswell ashigh efficiency, theyuseda cascadeprocess(which is beyond thescopeof this article). As theresult,they reporteda verystrongfacedetector:On a 466MHz machine,facedetectionon a WYX[Z<: d XYX imagecostonly 0.067seconds,which is 15 timesfasterthanstate-of-the-artfacedetectorsat that time but with comparableaccuracy. Thisfacedetectorhasbeenrecognizedasoneof themostexciting breakthroughsin computervision(in particular, facedetection)duringthepastdecade.It is notstrangethat“Boosting” hasbecomeabuzzword in computervisionandmany otherapplicationareas.

7.3 Further Research

Many interestingtopicsworth furtherstudying.Herewe only discusson onetheoreticaltopic andoneappliedtopic.

Many empiricalstudyshow that AdaBoostoften doesnot overfit, i.e., the testerror of AdaBoostoften tendsto decreaseeven after the training error is zero. Many researchershave studiedthis andseveral theoreticalexplanationshave beengiven, e.g. [32]. Schapireet al. [54] presenteda margin-basedexplanation.They arguedthatAdaBoostis ableto increasethemargins evenafter the trainingerroris zero,andthusit doesnotoverfit evenaftera largenumberof rounds.However, Breiman[6] in-dicatedthatlargermargin doesnot necessarilymeanbettergeneralization,whichseriouslychallengedthemargin-basedexplanation.Recently, Reyzin andSchapire[51] foundthatBreimanconsideredmin-imum margin insteadof averageor medianmargin, which suggeststhat themargin-basedexplanationstill haschanceto survive. If this explanationsucceeds,a strongconnectionbetweenAdaBoostandSVM couldbefound. It is obviousthatthis topic is well worthstudying.

Many real-world applicationsarebornwith high dimensionality, i.e.,with a largeamountof inputfeatures. Thereare two paradigmsthat can help us to deal with suchkind of data,i.e., dimensionreductionandfeatureselection.Dimensionreductionmethodsareusuallybasedon mathematicalpro-jections,which attemptto transformthe original featuresinto an appropriatefeaturespace. Afterdimensionreduction,theoriginalmeaningof thefeaturesis usuallylost. Featureselectionmethodsdi-rectly selectsomeoriginal featuresto use,andthereforethey canpreserve theoriginal meaningof thefeatures,which is verydesirablein many applications.However, featureselectionmethodsareusuallybasedon heuristics,lackingsolid theoreticalfoundation.Inspiredby Viola andJones’s work [67], wethink AdaBoostcouldbevery usefulin featureselection,especiallywhenconsideringthatit hassolidtheoreticalfoundation.Currentresearchmainlyfocusonimages,yetwethink generalAdaBoost-basedfeatureselectiontechniquesarewell worthstudying.

8 kNN: k-NearestNeighbor Classification

by MichaelSteinbach9

[email protected]

20

Page 21: Top 10 Algorithms in Data Mining

8.1 Description of the Algorithm

One of the simplest,and rathertrivial classifiersis the Rote classifier, which memorizesthe entiretrainingdataandperformsclassificationonly if theattributesof thetestobjectmatchoneof thetrainingexamplesexactly. An obviousdrawbackof thisapproachis thatmany testrecordswill notbeclassifiedbecausethey donotexactlymatchany of thetrainingrecords.A moresophisticatedapproach,k-nearestneighbor(kNN) classification[19, 60], findsa groupof k objectsin thetrainingsetthatareclosesttothe testobject,andbasesthe assignmentof a label on the predominanceof a particularclassin thisneighborhood.Therearethreekey elementsof this approach:a setof labeledobjects,e.g.,a setofstoredrecords,a distanceor similarity metric to computedistancebetweenobjects,andthe valueofk, thenumberof nearestneighbors.To classifyanunlabeledobject,thedistanceof this objectto thelabeledobjectsis computed,its k-nearestneighborsareidentified,andtheclasslabelsof thesenearestneighborsarethenusedto determinetheclasslabelof theobject.

Figure6 provides a high-level summaryof the nearest-neighborclassificationmethod. Given atrainingset � andatestobjectY � @ � � � i � O , thealgorithmcomputesthedistance(or similarity) between½ andall the trainingobjects @ x � i O ,M� to determineits nearest-neighborlist, � \ . (x is thedataof atrainingobject,while

iis its class.Likewise,x � is thedataof thetestobjectand

i � is its class.)

Input: � bethesetof � trainingobjectsandtestobject ½Â� @ x � � i � OProcess:Compute� @ x � � x O , thedistancebetween½ andevery object, @ x � i O ,�� .Select� \^]Q� , thesetof � closesttrainingobjectsto ½ .Output:

i � � argmax_ Ä Á x Ö Î R Ö ÃCÏ Da`cb @�d � i � OFigure6: The � -nearestneighborclassificationalgorithm.

Oncethenearest-neighborlist is obtained,the testobjectis classifiedbasedon themajority classof its nearestneighbors:

Majority Voting:i � � argmax_ >

Á x Ö Î R Ö ÃÐÏ D ` b @�d � ij� O � (18)

where d is a classlabel,ij�

is the classlabel for the #I254 nearestneighbors,andb @ k O is an indicator

functionthatreturnsthevalue1 if its argumentis trueand ] otherwise.

8.2 Issues

Thereareseveral key issuesthat affect the performanceof kNN. Oneis the choiceof k. If k is toosmall, thenthe resultcanbe sensitive to noisepoints. On the otherhand,if k is too large, thentheneighborhoodmayincludetoomany pointsfrom otherclasses.

Another issueis the approachto combiningthe classlabels. The simplestmethodis to take amajority vote,but this canbea problemif thenearestneighborsvary widely in their distanceandthecloserneighborsmorereliably indicatetheclassof theobject. A moresophisticatedapproach,whichis usually much lesssensitive to the choiceof k, weightseachobject’s vote by its distance,wheretheweight factoris often taken to be the reciprocalof thesquareddistance:e � ��% ¹ � @ x �C� x � O � . Thisamountsto replacingStep5 of Algorithm 6 with thefollowing:

Distance-WeightedVoting:i � � argmax_ >

Á x Ö Î R Ö ÃCÏ Da` e � : b @�d � ij� O ( (19)

21

Page 22: Top 10 Algorithms in Data Mining

Thechoiceof thedistancemeasureis anotherimportantconsideration.Althoughvariousmeasurescanbe usedto computethedistancebetweentwo points,themostdesirabledistancemeasureis onefor which a smallerdistancebetweentwo objectsimplies a greaterlikelihood of having the sameclass.Thus,for example,if kNN is beingappliedto classifydocuments,thenit maybebetterto usethe cosinemeasureratherthanEuclideandistance.Somedistancemeasurescanalsobe affectedbythehigh dimensionalityof thedata.In particular, it is well known thattheEuclideandistancemeasurebecomelessdiscriminatingasthenumberof attributesincreases.Also,attributesmayhaveto bescaledto preventdistancemeasuresfrom beingdominatedby oneof theattributes. For example,consideradatasetwherethe heightof a personvariesfrom 1.5m to 1.8m, the weight of a personvariesfrom90lb to 300lb,andthe incomeof a personvariesfrom $10,000to $1,000,000.If a distancemeasureis usedwithout scaling,the incomeattribute will dominatethecomputationof distanceandthus,theassignmentof classlabels.A numberof schemeshave beendevelopedthattry to computetheweightsof eachindividual attributebasedupona trainingset[26].

In addition,weightscanbeassignedto thetrainingobjectsthemselves.This cangive moreweightto highly reliabletrainingobjects,while reducingtheimpactof unreliableobjects.ThePEBLSsystemby by CostandSalzberg [10] is a well known exampleof suchanapproach.

KNN classifiersarelazy learners,thatis, modelsarenotbuilt explicitly unlike eagerlearners(e.g.,decisiontrees,SVM, etc.). Thus, building the model is cheap,but classifyingunknown objectsisrelatively expensive sinceit requiresthe computationof the k-nearestneighborsof the object to belabeled.This, in general,requirescomputingthedistanceof theunlabeledobjectto all theobjectsinthelabeledset,whichcanbeexpensiveparticularlyfor largetrainingsets.A numberof techniqueshavebeendevelopedfor efficient computationof k-nearestneighbordistancethatmake useof thestructurein the datato avoid having to computedistanceto all objectsin the training set. Thesetechniques,which areparticularly applicablefor low dimensionaldata,canhelp reducethe computationalcostwithout affectingclassificationaccuracy.

8.3 Impact

KNN classificationis an easyto understandandeasyto implementclassificationtechnique.Despiteits simplicity, it canperformwell in many situations.In particular, a well known resultby Cover andHart [11] shows that the the error of the nearestneighborrule is boundedabove by twice the Bayeserrorundercertainreasonableassumptions.Also, theerrorof thegeneralkNN methodasymptoticallyapproachesthatof theBayeserrorandcanbeusedto approximateit.

KNN is particularlywell suitedfor multi-modalclassesaswell asapplicationsin which anobjectcanhavemany classlabels.For example,for theassignmentof functionsto genesbasedonexpressionprofiles,someresearchersfound that kNN outperformedSVM, which is a muchmoresophisticatedclassificationscheme[37].

8.4 Curr ent and Future Research

Although the basickNN algorithmandsomeof its variations,suchasweightedkNN andassigningweightsto objects,arerelatively well known, someof themoreadvancedtechniquesfor kNN aremuchlessknown. For example,it is typically possibleto eliminatemany of thestoreddataobjects,but stillretaintheclassificationaccuracy of thekNN classifier. This is known as‘condensing’andcangreatlyspeedup the classificationof new objects[29]. In addition,dataobjectscanbe removed to improveclassificationaccuracy, aprocessknown as‘editing’ [71]. Therehasalsobeenaconsiderableamountofworkontheapplicationof proximity graphs(nearestneighborgraphs,minimumspanningtrees,relative

22

Page 23: Top 10 Algorithms in Data Mining

neighborhoodgraphs,Delaunaytriangulations,andGabrielgraphs)to thekNN problem.Recentpapersby Toussaint[62, 63], which emphasizea proximity graphviewpoint, provide an overview of workaddressingthesethreesareasandindicatesomeremainingopenproblems.Otherimportantresourcesincludethecollectionof papersby Dasarathy[12] andthebookby Devroye, Gyorfi andLugosi [14].Finally, a fuzzy approachto kNN canbefoundin thework of Bezdek[3].

9 NaiveBayes

by David J.Hand10

9.1 Intr oduction

Givena setof objects,eachof which belongsto a known class,andeachof which hasa known vectorof variables,our aim is to constructa rule whichwill allow usto assignfutureobjectsto a class,givenonly the vectorsof variablesdescribingthe future objects. Problemsof this kind, called problemsof supervisedclassification,areubiquitous,andmany methodsfor constructingsuchruleshave beendeveloped.Oneveryimportantoneis thenaiveBayesmethod- alsocalledidiot’sBayes,simpleBayes,andindependenceBayes.Thismethodis importantfor severalreasons.It is veryeasyto construct,notneedingany complicatediterative parameterestimationschemes.Thismeansit maybereadilyappliedto hugedatasets.It is easyto interpret,sousersunskilledin classifiertechnologycanunderstandwhyit is makingtheclassificationit makes. And finally, it oftendoessurprisinglywell: it maynot bethebestpossibleclassifierin any particularapplication,but it canusuallybereliedon to berobustandtodo quitewell. Generaldiscussionof thenaive Bayesmethodandits meritsaregivenin DomingosandPazzani(1997)[18] andHandandYu (2001)[27].

9.2 The BasicPrinciple

For convenienceof expositionhere,we will assumejust two classes,labeled #�� ]l��% . Our aim isto usethe initial setof objectswith known classmemberships(the training set) to constructa scoresuchthatlargerscoresareassociatedwith class1 objects(say)andsmallerscoreswith class0 objects.Classificationis thenachieved by comparingthis scorewith a threshold,t. If we define � @ #�" Y O to bethe probability that an objectwith measurementvector Y � @ZY � �)(*(*(*� Y ¦ O belongsto class # , thenanymonotonicfunction of � @ #�" Y O would make a suitablescore. In particular, the ratio � @ %A" Y Op¹ � @ ] ¹ Y Owouldbesuitable.Elementaryprobabilitytellsusthatwecandecompose� @ #�" Y O as X @ZY " # O � @ # O , whereX @ZY " # O is theconditionaldistribution of Y for class# objects,and � @ # O is theprobability thatanobjectwill belongto class# if weknow nothingfurtheraboutit (the‘prior’ probabilityof class# ). Thismeansthattheratiobecomes � @ %A" Y O� @ ]K" Y O � X @ZY " % O � @ % OX @ZY " ] O � @ ] O ( (20)

To usethis to produceclassifications,we needto estimatethe X @ZY " # O andthe � @ # O . If the trainingsetwasa randomsamplefrom theoverall population,the � @ # O canbeestimateddirectly from thepro-portionof class# objectsin thetrainingset. To estimatethe X @ZY " # O , thenaive Bayesmethodassumesthat the componentsof Y are independent,X @ZY JA" # O �gf ¦ J ? � X @ZY JA" # O , andthenestimateseachof theunivariatedistributions X @ZY JA" # O �I©�� %'�)(*(*(*�C¨!�p#q�µ]l��% , separately. Thusthe ¨ dimensionalmultivariate

[email protected]

23

Page 24: Top 10 Algorithms in Data Mining

problemhasbeenreducedto ¨ univariateestimationproblems.Univariateestimationis familiar, sim-ple, andrequiressmallertrainingsetsizesto obtainaccurateestimates.This is oneof theparticular,indeeduniqueattractionsof thenaive Bayesmethods:estimationis simple,very quick, anddoesnotrequirecomplicatediterative estimationschemes.

If themarginal distributions X @ZY JA" # O arediscrete,with eachY J takingonly a few values,thentheestimate ·X @ZY JA" # O isamultinomialhistogramtypeestimator(seebelow) - simplycountingtheproportionof classi objectswhich fall into eachcell. If the X @ZY JA" # O arecontinuous,thena commonstrategy is tosegmenteachof theminto a smallnumberof intervalsandagainusemultinomialestimator, but moreelaborateversionsbasedon continuousestimates(e.g.kernelestimates)arealsoused.

Giventheindependenceassumption,theratio in (20)becomes� @ %A" Y O� @ ]K" Y O � f ¦ J ? � X @ZY JA" % O � @ % Of ¦ J ? � X @ZY JA" ] O � @ ] O � � @ % O� @ ] O¦hJ ? � X @ZY JA" % OX @ZY J " ] O ( (21)

Now, recalling that our aim wasmerely to producea scorewhich wasmonotonicallyrelatedto� @ #�" Y O , we cantake logs of (21) - log is a monotonicincreasingfunction. This givesan alternativescore

ln� @ %A" Y O� @ ]K" Y O � ln

� @ % O� @ ] O U¦>J ? � ln X @ZY J " % OX @ZY J " ] O ( (22)

If we define ecJv� ln @ X @ZY J�" % Op¹ X @ZY J�" ] OpO anda constant��� ln @ � @ % Op¹ � @ ] OpO we seethat(22) takestheform of asimplesum

ln� @ %A" Y O� @ ]K" Y O � �}U

¦>J ? � ecJj� (23)

sothattheclassifierhasaparticularlysimplestructure.The assumptionof independenceof the Y J within eachclassimplicit in the naive Bayesmodel

might seemunduly restrictive. In fact,however, variousfactorsmaycomeinto play which meanthattheassumptionis not asdetrimentalasit might seem.Firstly, a prior variableselectionstephasoftentaken place,in which highly correlatedvariableshave beeneliminatedon the groundsthat they arelikely to contributein asimilarwayto theseparationbetweenclasses.Thismeansthattherelationshipsbetweentheremainingvariablesmightwell beapproximatedby independence.Secondly, assumingtheinteractionsto bezeroprovidesan implicit regularizationstep,soreducingthevarianceof themodelandleadingto moreaccurateclassifications.Thirdly, in somecaseswhenthevariablesarecorrelatedtheoptimaldecisionsurfacecoincideswith thatproducedundertheindependenceassumption,sothatmakingtheassumptionis notatall detrimentalto performance.Fourthly, of course,thedecisionsurfaceproducedby thenaveBayesmodelcanin facthaveacomplicatednonlinearshape:thesurfaceis linearin the e J but highly nonlinearin theoriginal variablesY J , sothatit canfit quiteelaboratesurfaces.

9.3 SomeExtensions

Despitetheabove,a largenumberof authorshaveproposedmodificationsto thenaveBayesmethodinanattemptto improve its predictive accuracy.

Oneearlyproposedmodificationwasto shrink thesimplisticmultinomialestimateof thepropor-tions of objectsfalling into eachcategory of eachdiscretepredictorvariable. So, if the © th discretepredictorvariable, Y J , has �ji categories,andif HKJki of the total of H objectsfall into the D th category

24

Page 25: Top 10 Algorithms in Data Mining

of this variable,the usualmultinomial estimatorof the probability that a future objectwill fall intothis category, HKJki ¹ H , is replacedby @ HKJki{Uª� � �i Op¹ @ H U¿% O . This shrinkagealsohasa direct Bayesianinterpretation.It leadsto estimateswhichhave lower variance.

Perhapstheobvious way of easingthe independenceassumptionis by introducingextra termsinthemodelsof thedistributionsof Y in eachclass,to allow for interactions.Thishasbeenattemptedin alargenumberof ways,but wemustrecognizethatdoingthisnecessarilyintroducescomplications,andso sacrificesthebasicsimplicity andeleganceof thenave Bayesmodel. Within either(or any, moregenerally)class,thejoint distribution of Y isX @ZY O � X @ZY � O X @ZY � " Y � O X @ZY � " Y � � Y � O (*(*( X @ZY ¦ " Y � � Y � �)(*(*(*� Y ¦ � � O � (24)

andthiscanbeapproximatedby simplifying theconditionalprobabilities.TheextremeariseswithX @ZY � " Y � �)(*(*(*� Y � � � O ��X @ZY � O for all # , andthis is thenaive Bayesmethod.Obviously, however, modelsbetweenthesetwo extremescanbeused.For example,onecouldusetheMarkov modelX @ZY O � X @ZY � Op¹ X @ZY � " Y � O X @ZYml " Y � O (*(*( X @ZY ¦ " Y ¦ � � O ( (25)

This is equivalent to using a subsetof two way marginal distributions insteadof the univariatemarginal distributionsin thenaive Bayesmodel.

Anotherextensionto the naive Bayesmodelwasdevelopedentirely independentlyof it. This isthelogistic regressionmodel. In theabove we obtainedthedecomposition(21) by adoptingthenaiveBayesindependenceassumption.However, exactly thesamestructurefor theratio resultsif we modelX @ZY " % O by E@ZY O f ¦ J ? � � � @ZY J O and X @ZY " ] O by E@ZY O f ¦ J ? � � � @ZY J O , wherethefunction E�@ZY O is thesameineachmodel.Theratio is thus

� @ %A" Y O� @ ]K" Y O � � @ % O E@ZY O f ¦ J ? � � � @ZY J O� @ ] O E@ZY O f ¦ J ? � � � @ZY J O � � @ % O� @ ] O k f¦ J ? � � � @ZY J Of ¦ J ? � � � @ZY J O ( (26)

Here, the � � @ZY J O do not even have to be probability density functions- it is sufficient that theE@ZY O f ¦ J ? � � � @ZY J O aredensities.The model in (26) is just assimpleas the naive Bayesmodel,andtakesexactly the sameform - take logs andwe have a sumasin (23) - but it is muchmoreflexiblebecauseit doesnotassumeindependenceof the Y J in eachclass.In fact,it permitsarbitrarydependencestructures,via the E@ZY O function,whichcantake any form. Thepoint is, however, thatthisdependenceis thesamein the two classes,so that it cancelsout in the ratio in (26). Of course,this considerableextra flexibility of the logistic regressionmodel is not obtainedwithout cost. Although the resultingmodelform is identicalto thenaive Bayesmodelform (with differentparametervalues,of course),itcannotbeestimatedby looking at theunivariatemarginalsseparately:aniterative procedurehasto beused.

9.4 Concluding Remarkson NaiveBayes

Thenaive Bayesmodelis tremendouslyappealingbecauseof its simplicity, elegance,androbustness.It is oneof the oldestformal classificationalgorithms,andyet even in its simplestform it is oftensurprisinglyeffective. It is widely usedin areassuchastext classificationandspamfiltering. A largenumberof modificationshave beenintroduced,by thestatistical,datamining, machinelearning,andpatternrecognitioncommunities,in anattemptto make it moreflexible, but onehasto recognizethatsuchmodificationsarenecessarilycomplications,which detractfrom its basicsimplicity. Somesuchmodificationsaredescribedin Ridgeway et al (1998)[52] andFriedmanet al (1997)[23].

25

Page 26: Top 10 Algorithms in Data Mining

10 CART

by DanSteinberg11

The1984monograph,“CART: ClassificationandRegressionTrees,” co-authoredby LeoBreiman,JeromeFriedman,RichardOlshen,andCharlesStone,[7] representsamajormilestonein theevolutionof Artificial Intelligence,MachineLearning,non-parametricstatistics,anddatamining. Thework isimportantfor thecomprehensivenessof its studyof decisiontrees,the technicalinnovationsit intro-duces,its sophisticateddiscussionof tree-structureddataanalysis,and its authoritative treatmentoflarge sampletheory for trees. While CART citationscanbe found in almostany domain,far moreappearin fieldssuchaselectricalengineering,biology, medicalresearchandfinancialtopicsthan,forexample,in marketingresearchor sociologywhereothertreemethodsaremorepopular. This sectionis intendedto highlightkey themestreatedin theCART monographsoasto encouragereadersto returnto theoriginal sourcefor moredetail.

10.1 Overview

TheCART decisiontreeis a binaryrecursive partitioningprocedurecapableof processingcontinuousandnominalattributesbothastargetsandpredictors.Dataarehandledin their raw form; no binningis requiredor recommended.Treesaregrown to amaximalsizewithout theuseof astoppingrule andthenprunedback(essentiallysplit by split) to therootvia cost-complexity pruning.Thenext split to beprunedis theonecontributing leastto theoverallperformanceof thetreeontrainingdata(andmorethanonesplit mayberemovedat a time). Theprocedureproducestreesthatareinvariantunderany orderpreservingtransformationof thepredictorattributes.TheCART mechanismis intendedto producenotone,but a sequenceof nestedprunedtrees,all of which arecandidateoptimal trees.The“right sized”or “honest” tree is identified by evaluatingthe predictive performanceof every tree in the pruningsequence.CART offersno internalperformancemeasuresfor treeselectionbasedon thetrainingdataassuchmeasuresaredeemedsuspect.Instead,treeperformanceis alwaysmeasuredon independenttestdata(or via crossvalidation)andtreeselectionproceedsonly after test-data-basedevaluation. Ifno testdataexist andcrossvalidationhasnot beenperformed,CART will remainagnosticregardingwhich tree in the sequenceis best. This is in sharpcontrastto methodssuchasC4.5 that generatepreferredmodelson thebasisof trainingdatameasures.

TheCART mechanismincludesautomatic(optional)classbalancing,automaticmissingvaluehan-dling, andallows for cost-sensitive learning,dynamicfeatureconstruction,andprobabilitytreeestima-tion. Thefinal reportsincludeanovel attributeimportanceranking.TheCART authorsalsobroke newgroundin showing how crossvalidationcanbeusedto assessperformancefor every treein thepruningsequencegiventhat treesin differentCV folds maynot align on thenumberof terminalnodes.Eachof thesemajorfeaturesis discussedbelow.

10.2 Splitting Rules

CART splitting rulesarealwayscouchedin theform

An instancegoesleft if CONDITION,andgoesright otherwise,

wheretheCONDITION is expressedas“attribute n � � � z ” for continuousattributes. For nominalattributestheCONDITION is expressedasmembershipin anexplicit list of values.TheCART authors

[email protected]

26

Page 27: Top 10 Algorithms in Data Mining

arguethatbinarysplitsareto bepreferredbecause(1) they fragmentthedatamoreslowly thanmulti-way splits, and(2) repeatedsplits on the sameattribute areallowed and, if selected,will eventuallygenerateasmany partitionsfor anattributeasrequired.Any lossof easein readingthetreeis expectedto beoffsetby improvedperformance.A third implicit reasonis thatthelargesampletheorydevelopedby theauthorswasrestrictedto binarypartitioning.

TheCART monographfocusesmostof its discussionontheGini rule,whichis similar to thebetterknown entropy or information-gaincriterion. For a binary(0/1) target the“Gini measureof impurity”of a nodeo is É @ o O �µ%6L�¨ @ o O � L @ %1L�¨ @ o OpO � � (27)

where @ o O is the (possiblyweighted)relative frequency of class1 in thenode,andthe improvement(gain)generatedby asplit of theparentnode� into left andright childrenL andR isb @ � O �~É @ � O Lpo�É @ a O L @ %1Lqo O É @ V O ( (28)

Here, o is the (possiblyweighted)fraction of instancesgoing left. TheCART authorsfavor theGinicriterionover informationgainbecausetheGini canbereadilyextendedto includesymmetrizedcosts(seebelow) andis computedmorerapidly thaninformationgain. (Laterversionsof CART have addedinformationgainasanoptionalsplittingrule.) They introducethemodifiedtwoingrule,which is basedon adirectcomparisonof thetargetattributedistribution in two child nodes:

b @sr ¨ � #3o O � F ( dYt @ o @ %ÓLpo OpOvu > x " ¨mw @ � O L�¨�x @ � O " H � � (29)

where� indexesthetargetclasses,¨mw @ O andyx @ O aretheprobabilitydistributionsof thetargetin theleftandright child nodesrespectively, andthepower term £ embedsa user-controllablepenaltyon splitsgeneratingunequal-sizedchild nodes.(This splitter is a modifiedversionof MessengerandMandell,1972, [44].) They also introducea variantof the twoing split criterion that treatsthe classesof thetargetasordered;orderedtwoing attemptsto ensuretargetclassesrepresentedon theleft of a split arerankedbelow thoserepresentedon theright. In our experiencethetwoing criterionis oftena superiorperformeron multi-classtargetsaswell ason inherentlydifficult-to-predict(e.g.noisy)binarytargets.For regression(continuoustargets), CART offersa choiceof LeastSquares(LS) andLeastAbsoluteDeviation (LAD) criteriaasthebasisfor measuringthe improvementof a split. Threeothersplittingrulesfor cost-sensitive learningandprobabilitytreesarediscussedseparatelybelow.

10.3 Prior Probabilities and ClassBalancing

In its default classificationmodeCART alwayscalculatesclassfrequenciesin any noderelative to theclassfrequenciesin the root. This is equivalent to automaticallyreweightingthe datato balancetheclasses,andensuresthat thetreeselectedasoptimalminimizesbalancedclasserror. Thereweightingis implicit in the calculationof all probabilitiesandimprovementsandrequiresno userintervention;thereportedsamplecountsin eachnodethusreflecttheunweighteddata.For abinary(0/1) targetanynodeis classifiedasclass1 if, andonly if,+ � @ H�¢F� � Op¹ + � @ZD ¢'¢No O �Q+ � @ H�¢F� � Op¹ + � @ZD ¢'¢No O ( (30)

This default modeis referredto as“priors equal” in the monograph.It hasallowed CART userstowork readily with any unbalanceddata, requiring no specialmeasuresregardingclassrebalancing

27

Page 28: Top 10 Algorithms in Data Mining

or the introductionof manuallyconstructedweights. To work effectively with unbalanceddatait issufficient to run CART usingits default settings.Implicit reweightingcanbe turnedoff by selectingthe “priors data” option,andthemodelercanalsoelectto specifyan arbitrarysetof priors to reflectcosts,or potentialdifferencesbetweentrainingdataandfuturedatatargetclassdistributions.

10.4 Missing ValueHandling

Missingvaluesappearfrequentlyin realworld, andespeciallybusiness-relateddatabases,andtheneedto dealwith themis avexing challengefor all modelers.Oneof themajorcontributionsof CART wastoincludea fully automatedandhighly effective mechanismfor handlingmissingvalues.Decisiontreesrequirea missingvalue-handlingmechanismat threelevels: (a) during splitter evaluation,(b) whenmoving thetrainingdatathrougha node,and(c) whenmoving testdatathrougha nodefor final classassignment.(SeeQuinlan,1989for a cleardiscussionof thesepoints.)Regarding(a), thefirst versionof CART evaluatedeachsplitterstrictly on its performanceon thesubsetof datafor which thesplitteris available. Later versionsoffer a family of penaltiesthat reducethe split improvementmeasureasa functionof thedegreeof missingness.For (b) and(c), theCART mechanismdiscovers“surrogate”or substitutesplittersfor every nodeof the tree,whethermissingvaluesoccurin the training dataornot. Thesurrogatesarethusavailableshouldthetreebeappliedto new datathatdoesincludemissingvalues.This is in contrastto machinesthatcanonly learnaboutmissingvaluehandlingfrom trainingdatathatincludemissingvalues.Friedman(1975)[21] suggestsmoving instanceswith missingsplitterattributesinto bothleft andright child nodesandmakinga final classassignmentby poolingall nodesin which aninstanceappears.Quinlan(1989)[49] optsfor a weightedvariantof Friedman’s approachin his studyof alternative missingvalue-handlingmethods.Our own assessmentsof theeffectivenessof CART surrogateperformancein thepresenceof missingdataarelargely favorable,while Quinlanremainsagnosticon thebasisof theapproximatesurrogateshe implementsfor testpurposes[49]. InFriedman,Kohavi, andYun (1996)[22], Friedmannotesthat50%of theCART codewasdevotedtomissingvaluehandling;it is thusunlikely thatQuinlan’s experimentalversionproperlyreplicatedtheentireCART surrogatemechanism.

In CART the missingvaluehandlingmechanismis fully automaticandlocally adaptive at everynode.At eachnodein the treethechosensplitter inducesa binarypartitionof thedata(e.g.,X1 � �c1 andX1 � c1). A surrogatesplitter is a singleattribute Z that canpredictthis partition wherethesurrogateitself is in theform of abinarysplitter(e.g.,Z � � d andZ � d). In otherwords,everysplitterbecomesa new targetwhich is to bepredictedwith a singlesplit binarytree.Surrogatesarerankedbyanassociationscorethatmeasurestheadvantageof thesurrogateover thedefault rule predictingthatall casesgoto thelargerchild node.To qualify asasurrogate,thevariablemustoutperformthisdefaultrule(andthusit maynotalwaysbepossibleto find surrogates).Whenamissingvalueis encounteredinaCART treetheinstanceis movedto theleft or theright accordingto thetop-rankedsurrogate.If thissurrogateis alsomissingthenthesecondrankedsurrogateis usedinstead,(andsoon). If all surrogatesaremissingthedefault rule assignstheinstanceto thelargerchild node(possiblyadjustingnodesizesfor priors).Tiesarebrokenby moving aninstanceto theleft.

10.5 Attrib ute Importance

The importanceof anattribute is basedon thesumof the improvementsin all nodesin which theat-tributeappearsasasplitter(weightedby thefractionof thetrainingdatain eachnodesplit). Surrogatesarealsoincludedin the importancecalculations,which meansthatevena variablethatnever splits anodemaybeassigneda largeimportancescore.Thisallows thevariableimportancerankingsto reveal

28

Page 29: Top 10 Algorithms in Data Mining

variablemaskingandnonlinearcorrelationamongthe attributes. Importancescoresmay optionallybe confinedto splittersandcomparingthe splitters-onlyandthe full importancerankingsis a usefuldiagnostic.

10.6 Dynamic FeatureConstruction

Friedman(1975)[21] discussestheautomaticconstructionof new featureswithin eachnodeand,forthebinarytarget,recommendsaddingthesinglefeature

Y{z e��where Y is the original attribute vectorand e is a scaleddifferenceof meansvectoracrossthe twoclasses(the directionof the Fisherlinear discriminant). This is similar to runninga logistic regres-sion on all availableattributesin thenodeandusingthe estimatedlogit asa predictor. In theCARTmonograph,theauthorsdiscusstheautomaticconstructionof linearcombinationsthat includefeatureselection;this capabilityhasbeenavailablefrom the first releaseof the CART software. BFOSalsopresentamethodfor constructingBooleancombinationsof splitterswithin eachnode,acapabilitythathasnotbeenincludedin thereleasedsoftware.

10.7 Cost-SensitiveLearning

Costsarecentralto statisticaldecisiontheorybut cost-sensitive learningreceivedonly modestattentionbeforeDomingos(1999)[17]. Sincethen,several conferenceshave beendevotedexclusively to thistopic anda largenumberof researchpapershave appearedin thesubsequentscientificliterature.It isthereforeusefulto notethattheCART monographintroducedtwo strategiesfor cost-sensitive learningandtheentiremathematicalmachinerydescribingCART is castin termsof thecostsof misclassifica-tion. Thecostof misclassifyingan instanceof class| asclass} is ~{��|G�I}�� andis assumedto beequalto 1 unlessspecifiedotherwise;~���|G��|������ for all | . The completesetof costsis representedin thematrix ~ containinga row anda columnfor eachtargetclass.Any classificationtreecanhave a totalcostcomputedfor its terminalnodeassignmentsby summingcostsover all misclassifications.Theissuein cost-sensitive learningis to inducea treethat takesthecostsinto accountduring its growingandpruningphases.

Thefirst andmoststraightforwardmethodfor handlingcostsmakesuseof weighting:instancesbe-longingto classesthatarecostlyto misclassifyareweightedupwards,with acommonweightapplyingto all instancesof a givenclass,a methodrecentlyrediscoveredby Ting (2002)[61]. As implementedin CART. theweightingis accomplishedtransparentlysothatall nodecountsarereportedin their rawunweightedform. For multi-classproblemsBFOSsuggestedthat the entriesin the misclassificationcostmatrixbesummedacrosseachrow to obtainrelativeclassweightsthatapproximatelyreflectcosts.This techniqueignoresthedetailwithin thematrixbut hasnow beenwidely adopteddueto its simplic-ity. For theGini splittingrule theCART authorsshow thatit is possibleto embedtheentirecostmatrixinto thesplitting rule, but only after it hasbeensymmetrized.The”symGini” splitting rule generatestreessensitive to thedifferencein costs~{��|k�I}�� and ~���|G�k��� , andis mostusefulwhenthesymmetrizedcostmatrix is anacceptablerepresentationof thedecisionmaker’s problem.By contrast,theinstanceweightingapproachassignsasinglecostto all misclassificationsof objectsof class| . BFOSreportthatpruningthetreeusingthefull costmatrix is essentialto successfulcost-sensitive learning.

29

Page 30: Top 10 Algorithms in Data Mining

10.8 StoppingRules,Pruning, TreeSequences,and TreeSelection

The earliestwork on decisiontreesdid not allow for pruning. Instead,treesweregrown until theyencounteredsomestoppingconditionandtheresultingtreewasconsideredfinal. In theCART mono-graphtheauthorsarguedthatno rule intendedto stoptreegrowth canguaranteethat it will not missimportantdatastructure(e.g.,considerthetwo-dimensionalXOR problem).They thereforeelectedtogrow treeswithout stopping. The resultingoverly large treeprovidesthe raw materialfrom which afinal optimalmodelis extracted.

The pruningmechanismis basedstrictly on the training dataandbegins with a cost-complexitymeasuredefinedas ��� �S����� � �S���a���y� � � (31)

where

� �S�^� is the training samplecostof the tree, � ��� is the numberof terminalnodesin the treeand � is a penaltyimposedon eachnode. If �=��� thentheminimumcost-complexity treeis clearlythe largestpossible.If � is allowed to progressively increasethe minimumcost-complexity treewillbecomesmallersincethesplitsat thebottomof the treethat reduce

� �S�^� the leastwill becut away.Theparametera is progressively increasedfrom 0 to a valuesufficient to pruneaway all splits. BFOSprove thatany treeof size � extractedin thiswaywill exhibit acost

� ����� thatis minimumwithin theclassof all treeswith � terminalnodes.

Theoptimaltreeis definedasthattreein theprunedsequencethatachievesminimumcoston testdata.Becausetestmisclassificationcostmeasurementis subjectto samplingerror, uncertaintyalwaysremainsregardingwhich treein thepruningsequenceis optimal. BFOSrecommendselectingthe “1SE” treethatis thesmallesttreewith anestimatedcostwithin 1 standarderrorof theminimumcost(or“0 SE”) tree.

10.9 Probability Trees

Probabilitytreeshave beenrecentlydiscussedin aseriesof insightful articleselucidatingtheirproper-tiesandseekingto improve their performance(SeeProvostandDomingos,2000). TheCART mono-graphincludeswhatappearsto bethefirst detaileddiscussionof probabilitytreesandtheCART soft-wareoffers a dedicatedsplitting rule for the growing of “classprobability trees.” A key differencebetweenclassificationtreesandprobabilitytreesis that thelatterwantto keepsplits thatgenerateter-minal nodechildrenassignedto thesameclasswhereastheformerwill not (sucha split accomplishesnothingso far asclassificationaccuracy is concerned).A probability treewill alsobe pruneddiffer-ently thanits counterpartclassificationtree,therefore,thefinal structureof thetwo optimaltreescanbesomewhatdifferent(althoughthedifferencesareusuallymodest).Theprimarydrawbackof probabilitytreesis thattheprobabilityestimatesbasedontrainingdatain theterminalnodestendto bebiased(e.g.,towards0 or 1 in thecaseof thebinary target)with thebiasincreasingwith thedepthof thenode.IntherecentML literaturetheuseof theLaPlaceadjustmenthasbeenrecommendedto reducethis bias(Provost andDomingos,2002). TheCART monographoffersa somewhatmorecomplex methodtoadjusttheterminalnodeestimatesthathasrarelybeendiscussedin theliterature.Dubbedthe“Breimanadjustment”,it adjuststheestimatedmisclassificationrate � * ����� of any terminalnodeupwardsby���!�����#���������E� �[¡¢�*£¤�����¥�§¦¨� (32)

where �¤���G� is the train sampleestimatewithin thenode, £¢���G� is the fraction of the training sampleinthe nodeand ¦ and � areparametersthat aresolved for asa function of the differencebetweenthetrain andtesterror ratesfor a given tree. In contrastto theLaPlacemethod,theBreimanadjustment

30

Page 31: Top 10 Algorithms in Data Mining

doesnot dependon theraw predictedprobability in thenodeandtheadjustmentcanbevery small ifthe testdatashow that the treeis not overfit. Bloch, Olshen,andWalker (2002)[4] reportvery goodperformancefor theBreimanadjustmentin aseriesof empiricalexperiments.

10.10 Theoretical Foundations

The earliestwork on decisiontreeswasentirely atheoretical.Treeswereproposedasmethodsthatappearedto beusefulandconclusionsregardingtheir propertieswerebasedon observingtreeperfor-manceonahandfulof empiricalexamples.While thisapproachremainspopularin MachineLearning,therecenttendency in thedisciplinehasbeento reachfor strongertheoreticalfoundations.TheCARTmonographtacklestheorywith sophistication,offering importanttechnicalinsightsandproofsfor sev-eral key results. For example,the authorsderive the expectedmisclassificationratefor the maximal(largestpossible)tree,showing thatit is boundedfrom aboveby twice theBayesrate.Theauthorsalsodiscussthebiasvariancetradeoff in treesandshow how thebiasis affectedby thenumberof attributes.Basedlargelyon theprior work of CART co-authorsRichardOlshenandCharlesStone,thefinal threechaptersof themonographrelateCART to theoreticalwork on nearestneighborsandshow thatasthesamplesizetendsto infinity the following hold: (1) theestimatesof the regressionfunctionconvergeto thetruefunction,and(2) therisksof theterminalnodesconverge to therisksof thecorrespondingBayesrules. In otherwords,speakinginformally, with largeenoughsamplestheCART treewill con-vergeto thetruefunctionrelatingthetargetto its predictorsandachieve thesmallestcostpossible(theBayesrate). Practicallyspeaking.suchresultsmayonly berealizedwith samplesizesfar larger thanin commonusetoday.

10.11 SelectedBiographical Details

CART is often thoughtto have originatedfrom thefield of Statisticsbut this is only partially correct.JeromeFriedmancompletedhis PhDin Physicsat UC Berkeley andbecameleaderof theNumericalMethodsGroup at the StanfordLinear AcceleratorCenterin 1972, wherehe focusedon problemsin computation. One of his most influential papersfrom 1975 presentsa state-of-the-artalgorithmfor high speedsearchesfor nearestneighborsin a database.RichardOlshenearnedhis BA at UCBerkeley andPhDin StatisticsatYaleandfocusedhisearliestwork onlargesampletheoryfor recursivepartitioning. He beganhis collaborationwith Friedmanafter joining theStanfordLinearAcceleratorCenterin 1974. Leo Breimanearnedhis B.A. in Physicsat the California Institute of Technology,his PhD in Mathematicsat UC Berkeley, andmadenotablecontributions to pureprobability theory(Breiman,1968)[5] while a Professorat UCLA. In 1967he left academiafor 13 yearsto work asanindustrialconsultant;duringthis time heencounteredthemilitary dataanalysisproblemsthatinspiredhis contributions to CART. An interview with Leo Breimandiscussinghis careerand personallifeappearsin Olshen(2001)[46].

CharlesStoneearnedhis BA in mathematicsat the California Institute of Technology, and hisPhDin StatisticsatStanford.Hepursuedprobabilitytheoryin hisearlyyearsasanacademicandis theauthorof severalcelebratedpapersin probabilitytheoryandnonparametricregression.HeworkedwithBreimanat UCLA andwasdrawn by Breimaninto the researchleadingto CART in theearly1970s.BreimanandFriedmanfirst metat anInterfaceconferencein 1976,which shortly led to collaborationinvolving all four co-authors.Thefirst outlineof their bookwasproducedin a memodated1978andthecompletedCART monographwaspublishedin 1984.

Thefour co-authorshaveeachbeendistinguishedfor theirwork outsideof CART. Stone,Breiman,and Friedmanwere eachelectedto the AmericanAcademyof Sciences(in 1993, 2001, and 2005.

31

Page 32: Top 10 Algorithms in Data Mining

respectively) andthespecificwork for which they werehonoredcanalsobe foundon theacademy’swebsite. Olshenis a Fellow of the Institute of MathematicalStatistics,a Fellow of the IEEE, andFellow, AmericanAssociationfor theAdvancementof Science.

11 Conclusions

Datamining is a broadareathat integratestechniquesfrom severalfields includingmachinelearning,statistics,patternrecognition,artificial intelligence,and databasesystems,for the analysisof largevolumesof data. Therehave beena large numberof datamining algorithmsrootedin thesefields toperformdifferentdataanalysistasks.The10 algorithmsidentifiedby the IEEE InternationalConfer-enceon DataMining (ICDM) andpresentedin this articleareamongthemostinfluentialalgorithmsfor classification,clustering,statisticallearning,associationanalysis,andlink mining. We hopethispapercaninspiremoreresearchersin datamining to furtherexplore thesealgorithms,including theirimpactandnew researchissues.

Acknowledgements

The initiave of identifying top-10 datamining algorithmsstartedin May 2006 out of a discussionbetweenDr. JiannongCaoin theDepartmentof Computingat theHongKongPolytechnicUniversity(PolyU) andDr. Xindong Wu, whenDr. Wu wasgiving a seminaron 10 ChallengingProblemsinDataMining Research[72] at PolyU.Dr. Wu andDr. Kumarcontinuedthis discussionat KDD-06 inAugust2006with variouspeople,andreceivedvery enthusiasticsupport.

Naila Elliott in the Departmentof ComputerScienceandEngineeringat the University of Min-nesotacollectedandcompiledthealgorithmnominationsandvoting resultsin the3-stepidentificationprocess.YanZhangin theDepartmentof ComputerScienceat theUniversityof Vermontconvertedthe10 sectionsubmissionsin differentformatsinto thesameLaTeX format,whichwasa time-consumingprocess.

32

Page 33: Top 10 Algorithms in Data Mining

References

[1] Agrawal, R. andSrikant,R. “FastAlgorithmsfor Mining AssociationRules”,Proceedingsof the20thVLDBConference, pages487-499,1994.

[2] A. Banerjee,S.Merugu,I. Dhillon andJ.Ghosh.”Clusteringwith BregmanDivergences,” Jour-nal of MachineLearningResearch (JMLR) Vol.6,1705-1749.2005.

[3] Bezdek,J. C., Chuah,S. K., andLeep,D. 1986.“Generalizedk-nearestneighborrules”. FuzzySetsSyst. 18,3 (Apr. 1986),237-256.DOI= http://dx.doi.org/10.1016/0165-0114(86)90004-7

[4] Bloch,D.A. RA Olshen,MG Walker. (2002)“Risk Estimationfor ClassificationTrees”.Journalof Computational& GraphicalStatistics, vol 11,263-288.

[5] Breiman,L. (1968).ProbabilityTheory. Addison-Wesley, Reading,MA. Republished(1991)inClassicsof Mathematics.SIAM, Philadelphia.

[6] L. Breiman,“Predictiongamesandarcingclassifiers”,Neural Computation, 11(7):1493-1517,1999.

[7] Breiman,L., Friedman,J.H., Olshen,R. A., andStone,C. J. (1984).“ClassificationandRegres-sionTrees.” Belmont,CA: Wadsworth.

[8] S.Brin andL. Page.TheAnatomyof aLarge-ScaleHypertextual WebSearchSngine.ComputerNetworks,30(1-7),pp.107-117,1998.

[9] Cheung,D. W. andHan,J.andNg V. andWong,C. Y.,“Maintenanceof DiscoveredAssociationRulesin Large Databases:An IncrementalUpdatingTechnique”,Proc. of the ACM SIGMODInternationalConferenceon Managementof Data, pages13-23,1996.

[10] Cost,S., Salzberg, S.: “A weightednearestneighboralgorithmfor learningwith symbolic fea-tures”.MachineLearning10 (1993)57.78.(PEBLS:ParallelExamplar-BasedLearningSystem)

[11] Cover, T.; Hart, P. “Nearestneighborpatternclassification.IEEE Transactionson InformationTheory, Vol.13,Iss.1,Jan1967,Pages:21-27.

[12] B. V. Dasarathy(editor).“Nearestneighbor(NN) norms:NN patternclassificationtechniques”.IEEE ComputerSocietyPress,1991.

[13] Dempster, A.P., Laird, N.M., andRubin, D.B. (1977).“Maximum likelihood from incompletedatavia theEM algorithm(with discussion)”.Journalof theRoyalStatisticalSocietyB 39,1-38.

[14] L. DevroyeandL. Gyorfi andG.Lugosi.A ProbabilisticTheoryof PatternRecognition.Springer-Verlag,New York, 1996.ISBN 0-387-94618-7

[15] I.S. Dhillon, Y. Guan,andB. Kulis, Kernel k-means:spectral clusteringand normalizedcuts.KDD 2004,pp.551-556,2004.

[16] T. G. Dietterich,“Machinelearning:Fourcurrentdirections”,AI Magazine, 18(4): 97-136,1997.

[17] Domingos,P. (1999).MetaCost:A generalmethodfor makingclassifierscost-sensitive. In Pro-ceedingsof the Fifth InternationalConferenceon KnowledgeDiscovery andDataMining, pp.155-164.

33

Page 34: Top 10 Algorithms in Data Mining

[18] DomingosP. andPazzaniM. (1997)On the optimality of the simpleBayesianclassifierunderzero-oneloss.MachineLearning, 29,103-130.

[19] Fix, E. andHodges,J. L., Jr. “Discriminatory analysis,nonparametricdiscrimination”.USAFSchool of Aviation Medicine, RandolphField, Tex., Project 21-49-004, Rept. 4, ContractAF41(128)-31,February1951.

[20] Y. Freundand R. E. Schapire,“A decision-theoreticgeneralizationof on-line learningand anapplicationto boosting”,Journalof ComputerandSystemSciences, 55(1):119-139,1997.

[21] Friedman,J. H. J.L. Bentley, andR.A. Finkel, (1977)“An algorithmfor finding bestmatchesinlogarithmictime,” ACM Trans.Math. Software3, 209.Also availableasStanfordLinearAccel-eratorCenterRep.SIX-PUB-1549,Feb. 1975.

[22] Friedman,J.H., Kohavi, R. andYun,Y. (1996)Lazy DecisionTrees.In Proceedingsof theThir-teenthNationalConferenceon Artificial Intelligence,pages717-724,SanFrancisco,CA. AAAIPress/MITPress.

[23] FriedmanN., GeigerD., andGoldszmidtM. (1997)Bayesiannetwork classifiers.MachineLearn-ing, 29,131-163.

[24] Gates,G. W. (1972).“The ReducedNearestNeighborRule”. IEEE Transactionson InformationTheory18: 431-433.

[25] G. H. Golub, andC. F. Van Loan. Matrix Computations.The JohnsHopkinsUniversity Press,1983.

[26] Han,E. Text CategorizationUsingWeightAdjustedk-NearestNeighborClassification.PhDthe-sis,Universityof Minnesota,October1999.

[27] HandD.J.andYu K. (2001)IdiotsBayes- notsostupidafterall? InternationalStatisticalReview,69,385-398.

[28] R.M. GrayandD.L. Neuhoff, “Quantization,” IEEETransactionsonInformationTheory, Vol. 44,No. 6. pp.2325-2384,1998.

[29] Hart,P. (1968).“The condensednearestneighborrule”. IEEETrans.onInform.Th., 14,515–516.

[30] Han,J. andPei,J. andYin, Y. “Mining FrequentPatternswithout CandidateGeneration”,Proc.of ACM SIGMODInternationalConferenceon Managementof Data, pages1-12,2000.

[31] Hastie,T. and Tibshirani, R. 1996. “Discriminant Adaptive NearestNeighborClassification”.IEEE Trans.PatternAnal.Mach. Intell. 18,6 (Jun.1996),607-616.

[32] J. FriedmanandT. HastieandR. Tibshirani,“Additive logistic regression:A statisticalview ofboosting.with discussions”,TheAnnalsof Statistics, 28(2):337-407,2000.

[33] R. HerbrichandT. GraepelandK. Obermayer, “Rank Boundariesfor Ordinal Regression”Ad-vancesin Margin Classifiers, pages115-132,2000

[34] Hunt, E. B., Marin, J., and Stone,P. J. (1966). “Experimentsin Induction”. New York, NY:AcademicPress.

34

Page 35: Top 10 Algorithms in Data Mining

[35] Inokuchi,A. andWashio,T. andMotoda,H.,“GeneralFramework for Mining FrequentSubgraphsfrom LabeledGraphs”,FundamentaInformaticae, 66 (1-2):53-82,2005.

[36] A. K. JainandR. C. Dubes,Algorithmsfor ClusteringData, PrenticeHall, 1988.

[37] Michihiro KuramochiandGeorge Karypis, “GeneClassificationusing ExpressionProfiles: AFeasibilityStudy”, InternationalJournal of Artificial IntelligenceTools(IJAIT), 14(4):641–660,2005.

[38] A. N. LangvilleandC.D. Meyer. Google’sPageRankandBeyond:TheScienceof SearchEngineRankings.PrincetonUniversityPress,2006.

[39] B. Liu. WebDataMining: ExploringHyperlinks,ContentsandUsageData.Springer, 2007.

[40] S. P. Lloyd, “Least squaresquantizationin PCM,” unpublishedBell Lab. Tech.Note, portionspresentedat the Instituteof MathematicalStatisticsMeet.,Atlantic City, NJ, Sept.1957.Also,IEEETrans.Inform. Theory(SpecialIssueonQuantization),vol. IT-28,pp.129–137,Mar. 1982.

[41] McLachlan,G.J. (1987).“On bootstrappingthe likelihoodratio teststatisticfor the numberofcomponentsin a normalmixture”. AppliedStatistics36,318-324.

[42] McLachlan,G.J.andKrishnan,T. (1997).“The EM Algorithm andExtensions”.New York: Wi-ley.

[43] McLachlan,G. J.andPeel,D. (2000).“Finite Mixture Models”.Wiley, New York.

[44] Messenger, R. C. andMandell, M. L. (1972)A modelsearchtechniquefor predictive nominalscalemultivariateanalysis.Journalof theAmericanStatisticalAssociation.67,768-772.

[45] Morishita,S. andSese,J.,“TraversingLattice Itemsetwith StatisticalMetric Pruning”,Proc. ofPODS’00, pages226-236,2000.

[46] Olshen,R. (2001)“A Conversationwith Leo Breiman”,StatisticalScience2001,Vol. 16,No. 2,184-198.

[47] L. Page,S.Brin, R. MotwamiandT. Winograd.ThePageRankCitationRanking:BringingOrderto the Web. TechnicalReport1999-0120,ComputerScienceDepartment,StanfordUniversity,1999.

[48] Quinlan,J.R. (1979).“Discoveringrulesby inductionfrom largecollectionsof examples.” In D.Michie (Ed),ExpertSystemsin theMicro ElectronicAge.Edinburgh,UK: Edinburgh UniversityPress.

[49] Quinlan,R. (1989)Unknown attribute Valuesin Induction.In, Proceedingsof thesixth interna-tional workshopon MachineLearning,164-168.

[50] Quinlan,J. R. (1993).“C4.5: Programsfor MachineLearning.” SanMateo,CA: MorganKauf-mannPublishers.

[51] L. Reyzin andR. E. Schapire,“How boostingthe margin canalsoboostclassifiercomplexity”,Proceedingsof the 23rd InternationalConferenceon Machine Learning, pages753-760,Pitts-burgh,PA, 2006.

35

Page 36: Top 10 Algorithms in Data Mining

[52] Ridgeway G., MadiganD., andRichardsonT. (1998)Interpretableboostednaive Bayesclassifi-cation.Proceedingsof the Fourth InternationalConferenceon Knowledge Discovery and DataMining, ed.R.Agrawal, P.Stolorz,andG.Piatetsky-Shapiro,AAAI Press,MenloPark,California,101-104.

[53] R. E. Schapire,“The strengthof weaklearnability”,MachineLearning,5(2):197-227, 1990.

[54] R. E. SchapireandY. FreundandP. Bartlett andW. S. Lee,“Boostingthe margin: A new ex-planationfor the effectivenessof voting methods”,The Annalsof Statistics, 26(5):1651-1686,1998.

[55] R. E. SchapireandY. Singer,“Improvedboostingalgorithmsusingconfidence-ratedpredictions”,MachineLearning, 37(3):297-336,1999.

[56] B. Scholkopf andA. J.Smola.“Learningwith Kernels”,MIT Press,2002.

[57] Seidl,T. andKriegel,H. 1998.“Optimal multi-stepk-nearestneighborsearch”.In Proceedingsofthe1998ACM SIGMODinternationalConferenceonManagementof Data(Seattle, Washington,UnitedStates,June01 - 04,1998). A. Tiwary andM. Franklin,Eds.SIGMOD ’98. ACM Press,New York, NY, 154-165.

[58] Srikant,R. andAgrawal, R. “Mining GeneralizedAssociationRules”, Proceedingsof the 21stVLDBConference, pages407-419,1995.

[59] M. Steinbach,G. Karypis and V. Kumar. “A comparisonof documentclusteringtechniques”,Proc.KDD WorkshoponText Mining, 2000.

[60] Pang-NingTan, Michael Steinbach,and Vipin Kumar, Introduction to Data Mining, PearsonAddison-Wesley, 2006.

[61] Ting, K. M. (2002)An instance-weightingmethodto inducecost-sensitive trees.IEEE Trans.KnowledgeandDataEngineering.14,659-665.

[62] GodfriedT. Toussaint,“Proximity graphsfor nearestneighbordecisionrules: recentprogress,”Interface-2002,34th Symposiumon Computingand Statistics(theme: Geoscienceand RemoteSensing), Ritz-CarltonHotel,Montreal,Canada,April 17-20,2002.

[63] GodfriedT. Toussaint:“Open Problemsin GeometricMethodsfor Instance-BasedLearning”.JCDCG2002:273-283.

[64] Ivor W. Tsangand JamesT. Kwok and Pak-Ming Cheung,“Corevector machines:FastSVMtrainingon very largedatasets”,Journalof MachineLearningResearch, 6:363-392,2005.

[65] Uno, T. andAsai, T. andUchida,Y. andArimura, H., “An Efficient Algorithm for EnumeratingFrequentClosedPatternsin TransactionDatabases”,Proc.of the7thInternationalConferenceonDiscoveryScience, LNAI 3245,Springe4,pages16-30,2004.

[66] V. Vapnik,“The Natureof StatisticalLearningTheory”,Springer-Verlag,New York, 1995.

[67] P. Viola andM. Jones,“Rapid objectdetectionusinga boostedcascadeof simplefeatures”,Pro-ceedingsof theIEEEComputerSocietyConferenceonComputerVisionandPatternRecognition,pages511-518,Kauai,HI, 2001.

36

Page 37: Top 10 Algorithms in Data Mining

[68] T.Washioand K.Nakanishiand H.Motoda, “AssociationRulesBasedon Levelwise SubspaceClustering”,Proceedings.of 9th EuropeanConferenceon PrinciplesandPracticeof KnowledgeDiscoveryin Databases, pages692-700,LNAI 3721,Springer, 2005.

[69] S.WassermanandK. Raust.SocialNetwork Analysis.CambridgeUniversityPress,1994.

[70] D. Wettschereck,D. Aha,andT. Mohri. “A review andempiricalevaluationof featureweightingmethodsfor aclassof lazylearningalgorithms”.Artificial IntelligenceReview, 11:273-314,1997.

[71] Wilson,D.L. “AsymptoticPropertiesof NearestNeighborRulesUsingEditedData.IEEETrans-actionson Systems,Man,andCybernetics, 2:408-420,1972.

[72] QiangYangandXindong Wu, “10 ChallengingProblemsin DataMining Research”,Interna-tional Journalof InformationTechnology & DecisionMaking, Vol. 5, No. 4, 2006,597-604.

[73] Yan, X. and Han, J. “gSpan: Graph-basedSubstructurePattern Mining”, ProceedingsofICDM’02, pages721-724,2002.

[74] P.S.Yu,X. Li andB. Liu. AddingtheTemporalDimensiontoSearch- A CaseStudyin PublicationSearch.In Proc.of WebIntelligence(WI’05), 2005.

37