3. classification methods

28
9/03 Data Mining – Classification G Dong 1 3. Classification Methods Patterns and Models Regression, NBC k-Nearest Neighbors Decision Trees and Rules Large size data

Upload: deacon

Post on 14-Jan-2016

56 views

Category:

Documents


3 download

DESCRIPTION

3. Classification Methods. Patterns and Models Regression, NBC k-Nearest Neighbors Decision Trees and Rules Large size data. Models and Patterns. A model is a global description of data, or an abstract representation of a real-world process Estimating parameters of a model - PowerPoint PPT Presentation

TRANSCRIPT

  • 3. Classification MethodsPatterns and ModelsRegression, NBCk-Nearest NeighborsDecision Trees and RulesLarge size data

  • Models and PatternsA model is a global description of data, or an abstract representation of a real-world processEstimating parameters of a modelData-driven model buildingExamples: Regression, Graphical model (BN), HMMA pattern is about some local aspects of dataPatterns in data matricesPredicates (age < 40) ^ (income < 10)Patterns for strings (ASCII characters, DNA alphabet)Pattern discovery: rules

  • Performance MeasuresGeneralityHow many instances are coveredApplicabilityOr is it useful? All husbands are male.AccuracyIs it always correct? If not, how often?ComprehensibilityIs it easy to understand? (a subjective measure)

  • Forms of KnowledgeConceptsProbabilistic, logical (proposition/predicate), functionalRulesTaxonomies and HierarchiesDendrograms, decision treesClustersStructures and Weights/Probabilities ANN, BN

  • Induction from DataInferring knowledge from data - generalizationSupervised vs. unsupervised learningSome graphical illustrations of learning tasks (regression, classification, clustering)Any other types of learning?Compare: The task of deductionInfer information/fact that is a logical consequence of facts in a databaseWho is Johns grandpa? (deduced from e.g. Mary is Johns mother, Joe is Marys father)Deductive databases: extending the RDBMS

  • The Classification ProblemFrom a set of labeled training data, build a system (a classifier) for predicting the class of future data instances (tuples). A related problem is to build a system from training data to predict the value of an attribute (feature) of future data instances.

  • What is a bad classifier?Some simplest classifiersTable-LookupWhat if x cannot be found in the training data?We give up!?Or, we can A simple classifier Cs can be built as a referenceIf it can be found in the table (training data), return its class; otherwise, what should it return?A bad classifier is one that does worse than Cs.Do we need to learn a classifier for data of one class?

  • Many TechniquesDecision treesLinear regressionNeural networksk-nearest neighbourNave Bayesian classifiersSupport Vector Machinesand many more ...

  • Regression for Numeric PredictionLinear regression is a statistical technique when class and all the attributes are numeric. y = + x, where and are regression coefficientsWe need to use instances to find and by minimizing SSE (least squares)SSE = (yi-yi)2 = (yi- - xi)2 ExtensionsMultiple regression Piecewise linear regressionPolynomial regression

  • Nearest NeighborAlso called instance based learningAlgorithmGiven a new instance x, find its nearest neighbor Return y as the class of xDistance measuresNormalization?!Some interesting questionsWhats its time complexity?Does it learn?

  • Nearest Neighbor (2)Dealing with noise k-nearest neighborUse more than 1 neighborHow many neighbors?Weighted nearest neighborsHow to speed up?Huge storageUse representatives (a problem of instance selection)SamplingGridClustering

  • Nave Bayes ClassificationThis is a direct application of Bayes ruleP(C|x) = P(x|C)P(C)/P(x)x - a vector of x1,x2,,xnThats the best classifier you can ever buildYou dont even need to select features, it takes care of it automaticallyBut, there are problemsThere are a limited number of instancesHow to estimate P(x|C)

  • NBC (2)Assume conditional independence between xisWe have P(C|x) P(x1|C) P(xi|C) (xn|C)P(C)How good is it in reality?Lets build one NBC for a very simple data setEstimate the priors and conditional probabilities with the training dataP(C=1) = ? P(C=2) =? P(x1=1|C=1)? P(x1=2|C=1)? What is the class for x=(1,2,1)?P(1|x) P(x1=1|1) P(x2=2|1) P(x3=1|1) P(1), P(2|x) What is the class for (1,2,2)?

  • Example of NBC

  • Golf Data

    Outlook

    Temp

    Humidity

    Windy

    Class

    Sunny

    Hot

    High

    No

    Yes

    Sunny

    Hot

    High

    Yes

    Yes

    Ocast

    Hot

    High

    No

    No

    Rain

    Mild

    Normal

    No

    No

    Rain

    Cool

    Normal

    No

    No

    Rain

    Cool

    Normal

    Yes

    Yes

    Ocast

    Cool

    Normal

    Yes

    No

    Sunny

    Mild

    High

    No

    Yes

    Sunny

    Cool

    Normal

    No

    No

    Rain

    Mild

    Normal

    No

    No

    Sunny

    Mild

    Normal

    Yes

    No

    Ocast

    Mild

    High

    Yes

    No

    Ocast

    Hot

    Normal

    No

    No

    Rain

    Mild

    High

    Yes

    Yes

  • Decision TreesA decision tree

    OutlookHumidityWindsunnyovercastrainYEShighnormalstrongweakNOYESNOYES

  • How to `grow a tree?Randomly Random Forests (Breiman, 2001)What are the criteria to build a tree?AccurateCompactA straightforward way to grow isPick an attributeSplit data according to its valuesRecursively do the first two steps untilNo data leftNo feature left

  • DiscussionThere are many possible treeslets try it on the golf dataHow to find the most compact one that is consistent with the data?Why the most compact?Occams razor principleIssue of efficiency w.r.t. optimalityOne attribute at a time or

  • Grow a good tree efficientlyThe heuristic to find commonality in feature values associated with class valuesTo build a compact tree generalized from the dataIt means we look for features and splits that can lead to pure leaf nodes. Is it a good heuristic?What do you think?How to judge it?Is it really efficient?How to implement it?

  • Lets grow oneMeasuring the purity of a data set EntropyInformation gain (see the brief review)Choose the feature with max gain

  • Different numbers of valuesDifferent attributes can have varied numbers of valuesSome treatmentsRemoving useless attributes before learningBinarizationDiscretizationGain-ratio is another practical solutionGain = root-Info InfoAttribute(i)Split-Info = -((|Ti|/|T|)log2 (|Ti|/|T|))Gain-ratio = Gain / Split-Info

  • Another kind of problemsA difficult problem. Why is it difficult?Similar ones are Parity, Majority problems.XOR problem 0 0 0 0 1 1 1 0 1 1 1 0

  • Tree PruningOverfitting: Model fits training data too well, but wont work well for unseen data.An effective approach to avoid overfitting and for a more compact tree (easy to understand)Two general ways to prunePre-pruning: stop splitting furtherAny significant difference in classification accuracy before and after divisionPost-pruning to trim back

  • Rules from Decision TreesTwo types of rulesOrder sensitive (more compact, less efficient)Order insensitive The most straightforward way is Class-based methodGroup rules according to classesSelect most general rules (or remove redundant ones)Data-based methodSelect one rule at a time (keep the most general one)Work on the remaining data until all data is covered

  • Variants of Decision Trees and RulesTree stumpsHoltes 1R rules (1992)For each attribute ASort according to its values vFind the most frequent class value c for each vBreaking tie with coin flippingOutput the most accurate rule as if A=v then cAn example (the Golf data)

  • Handling Large Size DataWhen data simply cannot fit in memory Is it a big problem?Three representative approachesSmart data structures to avoid unnecessary recalculationHash treesSPRINTSufficient statisticsAVC-set (Attribute-Value, Class label) to summarize the class distribution for each attributeExample: RainForestParallel processingMake data parallelizable

  • Ensemble MethodsA group of classifiersHybrid (Stacking)Single typeStrong vs. weak learnersA good ensembleAccuracyDiversitySome major approaches form ensemblesBaggingBoosting

  • BibliographyI.H. Witten and E. Frank. Data Mining Practical Machine Learning Tools and Techniques with Java Implementations. 2000. Morgan Kaufmann.M. Kantardzic. Data Mining Concepts, Models, Methods, and Algorithms. 2003. IEEE.J. Han and M. Kamber. Data Mining Concepts and Techniques. 2001. Morgan Kaufmann.D. Hand, H. Mannila, P. Smyth. Principals of Data Mining. 2001. MIT. T. G. Dietterich. Ensemble Methods in Machine Learning. I. J. Kittler and F. Roli (eds.) 1st Intl Workshop on Multiple Classifier Systems, pp 1-15, Springer-Verlag, 2000.

    SSE sum of squares of errorsInput variable explanatory variable, independent variableOutput variable outcome variable, dependent variableAn extension of table lookup?

    Weighted nearest neighbor: give nearer neighbor more weights.2 01 11 23 11 2http://craig.nevill-manning.com/~nevill/publications/IDA95.pdf