3. classification methods

3. Classification MethodsPatterns and ModelsRegression, NBCk-Nearest NeighborsDecision Trees and RulesLarge size data

Models and PatternsA model is a global description of data, or an abstract representation of a real-world processEstimating parameters of a modelData-driven model buildingExamples: Regression, Graphical model (BN), HMMA pattern is about some local aspects of dataPatterns in data matricesPredicates (age < 40) ^ (income < 10)Patterns for strings (ASCII characters, DNA alphabet)Pattern discovery: rules

Performance MeasuresGeneralityHow many instances are coveredApplicabilityOr is it useful? All husbands are male.AccuracyIs it always correct? If not, how often?ComprehensibilityIs it easy to understand? (a subjective measure)

Forms of KnowledgeConceptsProbabilistic, logical (proposition/predicate), functionalRulesTaxonomies and HierarchiesDendrograms, decision treesClustersStructures and Weights/Probabilities ANN, BN

Induction from DataInferring knowledge from data - generalizationSupervised vs. unsupervised learningSome graphical illustrations of learning tasks (regression, classification, clustering)Any other types of learning?Compare: The task of deductionInfer information/fact that is a logical consequence of facts in a databaseWho is Johns grandpa? (deduced from e.g. Mary is Johns mother, Joe is Marys father)Deductive databases: extending the RDBMS

The Classification ProblemFrom a set of labeled training data, build a system (a classifier) for predicting the class of future data instances (tuples). A related problem is to build a system from training data to predict the value of an attribute (feature) of future data instances.

What is a bad classifier?Some simplest classifiersTable-LookupWhat if x cannot be found in the training data?We give up!?Or, we can A simple classifier Cs can be built as a referenceIf it can be found in the table (training data), return its class; otherwise, what should it return?A bad classifier is one that does worse than Cs.Do we need to learn a classifier for data of one class?

Many TechniquesDecision treesLinear regressionNeural networksk-nearest neighbourNave Bayesian classifiersSupport Vector Machinesand many more ...

Regression for Numeric PredictionLinear regression is a statistical technique when class and all the attributes are numeric. y = + x, where and are regression coefficientsWe need to use instances to find and by minimizing SSE (least squares)SSE = (yi-yi)2 = (yi- - xi)2 ExtensionsMultiple regression Piecewise linear regressionPolynomial regression

Nearest NeighborAlso called instance based learningAlgorithmGiven a new instance x, find its nearest neighbor Return y as the class of xDistance measuresNormalization?!Some interesting questionsWhats its time complexity?Does it learn?

Nearest Neighbor (2)Dealing with noise k-nearest neighborUse more than 1 neighborHow many neighbors?Weighted nearest neighborsHow to speed up?Huge storageUse representatives (a problem of instance selection)SamplingGridClustering

Nave Bayes ClassificationThis is a direct application of Bayes ruleP(C|x) = P(x|C)P(C)/P(x)x - a vector of x1,x2,,xnThats the best classifier you can ever buildYou dont even need to select features, it takes care of it automaticallyBut, there are problemsThere are a limited number of instancesHow to estimate P(x|C)

NBC (2)Assume conditional independence between xisWe have P(C|x) P(x1|C) P(xi|C) (xn|C)P(C)How good is it in reality?Lets build one NBC for a very simple data setEstimate the priors and conditional probabilities with the training dataP(C=1) = ? P(C=2) =? P(x1=1|C=1)? P(x1=2|C=1)? What is the class for x=(1,2,1)?P(1|x) P(x1=1|1) P(x2=2|1) P(x3=1|1) P(1), P(2|x) What is the class for (1,2,2)?

Example of NBC

Golf Data

Outlook

Temp

Humidity

Windy

Class

Sunny

Hot

High

No

Yes

Sunny

Hot

High

Yes

Yes

Ocast

Hot

High

No

No

Rain

Mild

Normal

No

No

Rain

Cool

Normal

No

No

Rain

Cool

Normal

Yes

Yes

Ocast

Cool

Normal

Yes

No

Sunny

Mild

High

No

Yes

Sunny

Cool

Normal

No

No

Rain

Mild

Normal

No

No

Sunny

Mild

Normal

Yes

No

Ocast

Mild

High

Yes

No

Ocast

Hot

Normal

No

No

Rain

Mild

High

Yes

Yes

Decision TreesA decision tree

OutlookHumidityWindsunnyovercastrainYEShighnormalstrongweakNOYESNOYES

How to `grow a tree?Randomly Random Forests (Breiman, 2001)What are the criteria to build a tree?AccurateCompactA straightforward way to grow isPick an attributeSplit data according to its valuesRecursively do the first two steps untilNo data leftNo feature left

DiscussionThere are many possible treeslets try it on the golf dataHow to find the most compact one that is consistent with the data?Why the most compact?Occams razor principleIssue of efficiency w.r.t. optimalityOne attribute at a time or

Grow a good tree efficientlyThe heuristic to find commonality in feature values associated with class valuesTo build a compact tree generalized from the dataIt means we look for features and splits that can lead to pure leaf nodes. Is it a good heuristic?What do you think?How to judge it?Is it really efficient?How to implement it?

Lets grow oneMeasuring the purity of a data set EntropyInformation gain (see the brief review)Choose the feature with max gain

Different numbers of valuesDifferent attributes can have varied numbers of valuesSome treatmentsRemoving useless attributes before learningBinarizationDiscretizationGain-ratio is another practical solutionGain = root-Info InfoAttribute(i)Split-Info = -((|Ti|/|T|)log2 (|Ti|/|T|))Gain-ratio = Gain / Split-Info

Another kind of problemsA difficult problem. Why is it difficult?Similar ones are Parity, Majority problems.XOR problem 0 0 0 0 1 1 1 0 1 1 1 0

Tree PruningOverfitting: Model fits training data too well, but wont work well for unseen data.An effective approach to avoid overfitting and for a more compact tree (easy to understand)Two general ways to prunePre-pruning: stop splitting furtherAny significant difference in classification accuracy before and after divisionPost-pruning to trim back

Rules from Decision TreesTwo types of rulesOrder sensitive (more compact, less efficient)Order insensitive The most straightforward way is Class-based methodGroup rules according to classesSelect most general rules (or remove redundant ones)Data-based methodSelect one rule at a time (keep the most general one)Work on the remaining data until all data is covered

Variants of Decision Trees and RulesTree stumpsHoltes 1R rules (1992)For each attribute ASort according to its values vFind the most frequent class value c for each vBreaking tie with coin flippingOutput the most accurate rule as if A=v then cAn example (the Golf data)

Handling Large Size DataWhen data simply cannot fit in memory Is it a big problem?Three representative approachesSmart data structures to avoid unnecessary recalculationHash treesSPRINTSufficient statisticsAVC-set (Attribute-Value, Class label) to summarize the class distribution for each attributeExample: RainForestParallel processingMake data parallelizable

Ensemble MethodsA group of classifiersHybrid (Stacking)Single typeStrong vs. weak learnersA good ensembleAccuracyDiversitySome major approaches form ensemblesBaggingBoosting

BibliographyI.H. Witten and E. Frank. Data Mining Practical Machine Learning Tools and Techniques with Java Implementations. 2000. Morgan Kaufmann.M. Kantardzic. Data Mining Concepts, Models, Methods, and Algorithms. 2003. IEEE.J. Han and M. Kamber. Data Mining Concepts and Techniques. 2001. Morgan Kaufmann.D. Hand, H. Mannila, P. Smyth. Principals of Data Mining. 2001. MIT. T. G. Dietterich. Ensemble Methods in Machine Learning. I. J. Kittler and F. Roli (eds.) 1st Intl Workshop on Multiple Classifier Systems, pp 1-15, Springer-Verlag, 2000.

SSE sum of squares of errorsInput variable explanatory variable, independent variableOutput variable outcome variable, dependent variableAn extension of table lookup?

Weighted nearest neighbor: give nearer neighbor more weights.2 01 11 23 11 2http://craig.nevill-manning.com/~nevill/publications/IDA95.pdf

3. classification methods

Documents