01 data mining-classification basic

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 327 #1

8Classification: Basic ConceptsClassification is a form of data analysis that extracts models describing important data classes.

Such models, called classifiers, predict categorical (discrete, unordered) class labels. Forexample, we can build a classification model to categorize bank loan applications as eithersafe or risky. Such analysis can help provide us with a better understanding of the data atlarge. Many classification methods have been proposed by researchers in machine learn-ing, pattern recognition, and statistics. Most algorithms are memory resident, typicallyassuming a small data size. Recent data mining research has built on such work, develop-ing scalable classification and prediction techniques capable of handling large amounts ofdisk-resident data. Classification has numerous applications, including fraud detection,target marketing, performance prediction, manufacturing, and medical diagnosis.

We start off by introducing the main ideas of classification in Section 8.1. In therest of this chapter, you will learn the basic techniques for data classification such ashow to build decision tree classifiers (Section 8.2), Bayesian classifiers (Section 8.3), andrule-based classifiers (Section 8.4). Section 8.5 discusses how to evaluate and comparedifferent classifiers. Various measures of accuracy are given as well as techniques forobtaining reliable accuracy estimates. Methods for increasing classifier accuracy are pre-sented in Section 8.6, including cases for when the data set is class imbalanced (i.e.,where the main class of interest is rare).

8.1 Basic ConceptsWe introduce the concept of classification in Section 8.1.1. Section 8.1.2 describes thegeneral approach to classification as a two-step process. In the first step, we build a clas-sification model based on previous data. In the second step, we determine if the modelsaccuracy is acceptable, and if so, we use the model to classify new data.

8.1.1 What Is Classification?A bank loans officer needs analysis of her data to learn which loan applicants are safeand which are risky for the bank. A marketing manager at AllElectronics needs data

c 2012 Elsevier Inc. All rights reserved.Data Mining: Concepts and Techniques 327

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 328 #2

328 Chapter 8 Classification: Basic Concepts

analysis to help guess whether a customer with a given profile will buy a new computer.A medical researcher wants to analyze breast cancer data to predict which one of threespecific treatments a patient should receive. In each of these examples, the data analysistask is classification, where a model or classifier is constructed to predict class (categor-ical) labels, such as safe or risky for the loan application data; yes or no for themarketing data; or treatment A, treatment B, or treatment C for the medical data.These categories can be represented by discrete values, where the ordering among valueshas no meaning. For example, the values 1, 2, and 3 may be used to represent treatmentsA, B, and C, where there is no ordering implied among this group of treatment regimes.

Suppose that the marketing manager wants to predict how much a given customerwill spend during a sale at AllElectronics. This data analysis task is an example of numericprediction, where the model constructed predicts a continuous-valued function, orordered value, as opposed to a class label. This model is a predictor. Regression analysisis a statistical methodology that is most often used for numeric prediction; hence thetwo terms tend to be used synonymously, although other methods for numeric predic-tion exist. Classification and numeric prediction are the two major types of predictionproblems. This chapter focuses on classification.

8.1.2 General Approach to ClassificationHow does classification work? Data classification is a two-step process, consisting of alearning step (where a classification model is constructed) and a classification step (wherethe model is used to predict class labels for given data). The process is shown for theloan application data of Figure 8.1. (The data are simplified for illustrative purposes.In reality, we may expect many more attributes to be considered.

In the first step, a classifier is built describing a predetermined set of data classes orconcepts. This is the learning step (or training phase), where a classification algorithmbuilds the classifier by analyzing or learning from a training set made up of databasetuples and their associated class labels. A tuple, X, is represented by an n-dimensionalattribute vector, X = (x1, x2, . . . , xn), depicting n measurements made on the tuplefrom n database attributes, respectively, A1, A2, . . . , An.1 Each tuple, X, is assumed tobelong to a predefined class as determined by another database attribute called the classlabel attribute. The class label attribute is discrete-valued and unordered. It is categor-ical (or nominal) in that each value serves as a category or class. The individual tuplesmaking up the training set are referred to as training tuples and are randomly sam-pled from the database under analysis. In the context of classification, data tuples can bereferred to as samples, examples, instances, data points, or objects.2

1Each attribute represents a feature of X. Hence, the pattern recognition literature uses the term fea-ture vector rather than attribute vector. In our discussion, we use the term attribute vector, and in ournotation, any variable representing a vector is shown in bold italic font; measurements depicting thevector are shown in italic font (e.g., X = (x1, x2, x3)).2In the machine learning literature, training tuples are commonly referred to as training samples.Throughout this text, we prefer to use the term tuples instead of samples.

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 329 #3

8.1 Basic Concepts 329

(a)

(b)

name loan_decisionage income

Training data

Classification algorithm

Classification rules

...

IF age youth THEN loan_decision riskyIF incomehigh THEN loan_decision safe

IF agemiddle_aged AND income low THEN loan_decision risky

Sandy JonesBill LeeCaroline FoxRick FieldSusan LakeClaire PhipsJoe Smith...

youthyouthmiddle_agedmiddle_agedseniorseniormiddle_aged...

lowlowhighlowlowmediumhigh...

riskyriskysaferiskysafesafesafe...

Classification rules

(John Henry, middle_aged, low)Loan decision?

risky

Test data New data

Juan BelloSylvia CrestAnne Yee...

seniormiddle_agedmiddle_aged...

lowlowhigh...

name age income loan_decision

saferiskysafe...

Figure 8.1 The data classification process: (a) Learning : Training data are analyzed by a classificationalgorithm. Here, the class label attribute is loan decision, and the learned model or classifier isrepresented in the form of classification rules. (b) Classification: Test data are used to estimatethe accuracy of the classification rules. If the accuracy is considered acceptable, the rules canbe applied to the classification of new data tuples.

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 331 #5

8.2 Decision Tree Induction 331

age?

youth senior

student? yes

yes

credit_rating?

no

yesno yesno

fair excellent

middle_aged

Figure 8.2 A decision tree for the concept buys computer, indicating whether an AllElectronics cus-tomer is likely to purchase a computer. Each internal (nonleaf) node represents a test onan attribute. Each leaf node represents a class (either buys computer = yes or buys computer= no).

likely to purchase a computer. Internal nodes are denoted by rectangles, and leaf nodesare denoted by ovals. Some decision tree algorithms produce only binary trees (whereeach internal node branches to exactly two other nodes), whereas others can producenonbinary trees.

How are decision trees used for classification? Given a tuple, X, for which the asso-ciated class label is unknown, the attribute values of the tuple are tested against thedecision tree. A path is traced from the root to a leaf node, which holds the classprediction for that tuple. Decision trees can easily be converted to classification rules.

Why are decision tree classifiers so popular? The construction of decision tree clas-sifiers does not require any domain knowledge or parameter setting, and therefore isappropriate for exploratory knowledge discovery. Decision trees can handle multidi-mensional data. Their representation of acquired knowledge in tree form is intuitive andgenerally easy to assimilate by humans. The learning and classification steps of decisiontree induction are simple and fast. In general, decision tree classifiers have good accu-racy. However, successful use may depend on the data at hand. Decision tree inductionalgorithms have been used for classification in many application areas such as medicine,manufacturing and production, financial analysis, astronomy, and molecular biology.Decision trees are the basis of several commercial rule induction systems.

In Section 8.2.1, we describe a basic algorithm for learning decision trees. Duringtree construction, attribute selection measures are used to select the attribute that bestpartitions the tuples into distinct classes. Popular measures of attribute selection aregiven in Section 8.2.2. When decision trees are built, many of the branches may reflectnoise or outliers in the training data. Tree pruning attempts to identify and remove suchbranches, with the goal of improving classification accuracy on unseen data. Tree prun-ing is described in Section 8.2.3. Scalability issues for the induction of decision trees

Front Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes

Chapter 2. Getting to Know Your Data2.1 Data Objects and Attribute Types2.2 Basic Statistical Descriptions of Data2.3 Data Visualization2.4 Measuring Data Similarity and Dissimilarity2.5 Summary2.6 Exercises2.7 Bibliographic Notes

Chapter 3. Data Preprocessing3.1 Data Preprocessing: An Overview3.2 Data Cleaning3.3 Data Integration3.4 Data Reduction3.5 Data Transformation and Data Discretization3.6 Summary3.7 Exercises3.8 Bibliographic Notes

Chapter 4. Data Warehousing and Online Analytical Processing4.1 Data Warehouse: Basic Concepts4.2 Data Warehouse Modeling: Data Cube and OLAP4.3 Data Warehouse Design and Usage4.4 Data Warehouse Implementation4.5 Data Generalization by Attribute-Oriented Induction4.6 Summary4.7 Exercises4.8 Bibliographic Notes

Chapter 5. Data Cube Technology5.1 Data Cube Computation: Preliminary Concepts5.2 Data Cube Computation Methods5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology5.4 Multidimensional Data Analysis in Cube Space5.5 Summary5.6 Exercises5.7 Bibliographic Notes

Chapter 6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods6.1 Basic Concepts6.2 Frequent Itemset Mining Methods6.3 Which Patterns Are Interesting?Pattern Evaluation Methods6.4 Summary6.5 Exercises6.6 Bibliographic Notes

Chapter 7. Advanced Pattern Mining7.1 Pattern Mining: A Road Map7.2 Pattern Mining in Multilevel, Multidimensional Space7.3 Constraint-Based Frequent Pattern Mining7.4 Mining High-Dimensional Data and Colossal Patterns7.5 Mining Compressed or Approximate Patterns7.6 Pattern Exploration and Application7.7 Summary7.8 Exercises7.9 Bibliographic Notes

Chapter 8. Classification: Basic Concepts8.1 Basic Concepts8.2 Decision Tree Induction8.3 Bayes Classification Methods8.4 Rule-Based Classification8.5 Model Evaluation and Selection8.6 Techniques to Improve Classification Accuracy8.7 Summary8.8 Exercises8.9 Bibliographic Notes

Chapter 9. Classification: Advanced Methods9.1 Bayesian Belief Networks9.2 Classification by Backpropagation9.3 Support Vector Machines9.4 Classification Using Frequent Patterns9.5 Lazy Learners (or Learning from Your Neighbors)9.6 Other Classification Methods9.7 Additional Topics Regarding Classification9.8 Summary9.9 Exercises9.10 Bibliographic Notes

Chapter 10. Cluster Analysis: Basic Concepts and Methods10.1 Cluster Analysis10.2 Partitioning Methods10.3 Hierarchical Methods10.4 Density-Based Methods10.5 Grid-Based Methods10.6 Evaluation of Clustering10.7 Summary10.8 Exercises10.9 Bibliographic Notes

Chapter 11. Advanced Cluster Analysis11.1 Probabilistic Model-Based Clustering11.2 Clustering High-Dimensional Data11.3 Clustering Graph and Network Data11.4 Clustering with Constraints11.5 Summary11.6 Exercises11.7 Bibliographic Notes

Chapter 12. Outlier Detection12.1 Outliers and Outlier Analysis12.2 Outlier Detection Methods12.3 Statistical Approaches12.4 Proximity-Based Approaches12.5 Clustering-Based Approaches12.6 Classification-Based Approaches12.7 Mining Contextual and Collective Outliers12.8 Outlier Detection in High-Dimensional Data12.9 Summary12.10 Exercises12.11 Bibliographic Notes

Chapter 13. Data Mining Trends and Research Frontiers13.1 Mining Complex Data Types13.2 Other Methodologies of Data Mining13.3 Data Mining Applications13.4 Data Mining and Society13.5 Data Mining Trends13.6 Summary13.7 Exercises13.8 Bibliographic Notes

BibliographyIndexFront Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes







































BibliographyIndex

01 data mining-classification basic

Documents