03 data mining-bayesian classification

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 350 #24

350 Chapter 8 Classification: Basic Concepts

A tree is interactively constructed as follows. The user visualizes the multidimen-sional data in the Data Interaction window and selects a splitting attribute and one ormore split-points. The current decision tree in the Knowledge Interaction window isexpanded. The user selects a node of the decision tree. The user may either assign a classlabel to the node (which makes the node a leaf) or request the visualization of the train-ing data corresponding to the node. This leads to a new visualization of every attributeexcept the ones used for splitting criteria on the same path from the root. The interactiveprocess continues until a class has been assigned to each leaf of the decision tree.

The trees constructed with PBC were compared with trees generated by the CART,C4.5, and SPRINT algorithms from various data sets. The trees created with PBC wereof comparable accuracy with the tree from the algorithmic approaches, yet were signifi-cantly smaller and, thus, easier to understand. Users can use their domain knowledge inbuilding a decision tree, but also gain a deeper understanding of their data during theconstruction process.

8.3 Bayes Classification MethodsWhat are Bayesian classifiers? Bayesian classifiers are statistical classifiers. They canpredict class membership probabilities such as the probability that a given tuple belongsto a particular class.

Bayesian classification is based on Bayes theorem, described next. Studies compar-ing classification algorithms have found a simple Bayesian classifier known as the naveBayesian classifier to be comparable in performance with decision tree and selected neu-ral network classifiers. Bayesian classifiers have also exhibited high accuracy and speedwhen applied to large databases.

Nave Bayesian classifiers assume that the effect of an attribute value on a given classis independent of the values of the other attributes. This assumption is called class-conditional independence. It is made to simplify the computations involved and, in thissense, is considered nave.

Section 8.3.1 reviews basic probability notation and Bayes theorem. In Section 8.3.2you will learn how to do nave Bayesian classification.

8.3.1 Bayes TheoremBayes theorem is named after Thomas Bayes, a nonconformist English clergyman whodid early work in probability and decision theory during the 18th century. Let X be adata tuple. In Bayesian terms, X is considered evidence. As usual, it is described bymeasurements made on a set of n attributes. Let H be some hypothesis such as thatthe data tuple X belongs to a specified class C. For classification problems, we want todetermine P(H |X), the probability that the hypothesis H holds given the evidence orobserved data tuple X. In other words, we are looking for the probability that tuple Xbelongs to class C, given that we know the attribute description of X.

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 351 #25

8.3 Bayes Classification Methods 351

P(H |X) is the posterior probability, or a posteriori probability, of H conditionedon X. For example, suppose our world of data tuples is confined to customers describedby the attributes age and income, respectively, and that X is a 35-year-old customer withan income of $40,000. Suppose that H is the hypothesis that our customer will buy acomputer. Then P(H |X) reflects the probability that customer X will buy a computergiven that we know the customers age and income.

In contrast, P(H) is the prior probability, or a priori probability, of H . For our exam-ple, this is the probability that any given customer will buy a computer, regardless of age,income, or any other information, for that matter. The posterior probability, P(H |X),is based on more information (e.g., customer information) than the prior probability,P(H), which is independent of X.

Similarly, P(X|H) is the posterior probability of X conditioned on H . That is, it is theprobability that a customer, X, is 35 years old and earns $40,000, given that we know thecustomer will buy a computer.

P(X) is the prior probability of X. Using our example, it is the probability that aperson from our set of customers is 35 years old and earns $40,000.

How are these probabilities estimated? P(H), P(X|H), and P(X) may be estimatedfrom the given data, as we shall see next. Bayes theorem is useful in that it providesa way of calculating the posterior probability, P(H |X), from P(H), P(X|H), and P(X).Bayes theorem is

P(H |X)= P(X|H)P(H)P(X)

. (8.10)

Now that we have that out of the way, in the next section, we will look at how Bayestheorem is used in the nave Bayesian classifier.

8.3.2 Nave Bayesian ClassificationThe nave Bayesian classifier, or simple Bayesian classifier, works as follows:

1. Let D be a training set of tuples and their associated class labels. As usual, each tupleis represented by an n-dimensional attribute vector, X = (x1, x2, . . . , xn), depicting nmeasurements made on the tuple from n attributes, respectively, A1, A2, . . . , An.

2. Suppose that there are m classes, C1, C2, . . . , Cm. Given a tuple, X, the classifier willpredict that X belongs to the class having the highest posterior probability, condi-tioned on X. That is, the nave Bayesian classifier predicts that tuple X belongs to theclass Ci if and only if

P(Ci|X) > P(Cj|X) for 1 j m, j 6= i.Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is calledthe maximum posteriori hypothesis. By Bayes theorem (Eq. 8.10),

P(Ci|X)= P(X|Ci)P(Ci)P(X)

. (8.11)

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 352 #26


3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be maximized. If theclass prior probabilities are not known, then it is commonly assumed that the classesare equally likely, that is, P(C1)= P(C2)= = P(Cm), and we would thereforemaximize P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci). Note that the class priorprobabilities may be estimated by P(Ci)= |Ci,D|/|D|, where |Ci,D| is the number oftraining tuples of class Ci in D.

4. Given data sets with many attributes, it would be extremely computationallyexpensive to compute P(X|Ci). To reduce computation in evaluating P(X|Ci), thenave assumption of class-conditional independence is made. This presumes thatthe attributes values are conditionally independent of one another, given the classlabel of the tuple (i.e., that there are no dependence relationships among theattributes). Thus,

P(X|Ci)=n

k=1P(xk|Ci) (8.12)

= P(x1|Ci) P(x2|Ci) P(xn|Ci).We can easily estimate the probabilities P(x1|Ci), P(x2|Ci), . . . , P(xn|Ci) from thetraining tuples. Recall that here xk refers to the value of attribute Ak for tuple X. Foreach attribute, we look at whether the attribute is categorical or continuous-valued.For instance, to compute P(X|Ci), we consider the following:(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class Ci in D having

the value xk for Ak , divided by |Ci,D|, the number of tuples of class Ci in D.(b) If Ak is continuous-valued, then we need to do a bit more work, but the cal-

culation is pretty straightforward. A continuous-valued attribute is typicallyassumed to have a Gaussian distribution with a mean and standard deviation , defined by

g(x, , )= 12pi

e (x)2

22 , (8.13)

so that

P(xk|Ci)= g(xk , Ci , Ci ). (8.14)These equations may appear daunting, but hold on! We need to compute Ciand Ci , which are the mean (i.e., average) and standard deviation, respectively,of the values of attribute Ak for training tuples of class Ci . We then plug these twoquantities into Eq. (8.13), together with xk , to estimate P(xk|Ci).

For example, let X = (35,$40,000), where A1 and A2 are the attributes age andincome, respectively. Let the class label attribute be buys computer. The associatedclass label for X is yes (i.e., buys computer = yes). Lets suppose that age has notbeen discretized and therefore exists as a continuous-valued attribute. Supposethat from the training set, we find that customers in D who buy a computer are

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 353 #27

8.3 Bayes Classification Methods 353

38 12 years of age. In other words, for attribute age and this class, we have= 38 years and = 12. We can plug these quantities, along with x1 = 35 forour tuple X, into Eq. (8.13) to estimate P(age = 35|buys computer = yes). For aquick review of mean and standard deviation calculations, please see Section 2.2.

5. To predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci . Theclassifier predicts that the class label of tuple X is the class Ci if and only if

P(X|Ci)P(Ci) > P(X|Cj)P(Cj) for 1 j m, j 6= i. (8.15)In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is themaximum.

How effective are Bayesian classifiers? Various empirical studies of this classifier incomparison to decision tree and neural network classifiers have found it to be com-parable in some domains. In theory, Bayesian classifiers have the minimum error ratein comparison to all other classifiers. However, in practice this is not always the case,owing to inaccuracies in the assumptions made for its use, such as class-conditionalindependence, and the lack of available probability data.

Bayesian classifiers are also useful in that they provide a theoretical justification forother classifiers that do not explicitly use Bayes theorem. For example, under certainassumptions, it can be shown that many neural network and curve-fitting algorithmsoutput the maximum posteriori hypothesis, as does the nave Bayesian classifier.

Example 8.4 Predicting a class label using nave Bayesian classification. We wish to predict theclass label of a tuple using nave Bayesian classification, given the same training dataas in Example 8.3 for decision tree induction. The training data were shown earlierin Table 8.1. The data tuples are described by the attributes age, income, student, andcredit rating. The class label attribute, buys computer, has two distinct values (namely,{yes, no}). Let C1 correspond to the class buys computer = yes and C2 correspond tobuys computer= no. The tuple we wish to classify is

X = (age= youth, income=medium, student= yes, credit rating= fair)We need to maximize P(X|Ci)P(Ci), for i = 1, 2. P(Ci), the prior probability of each

class, can be computed based on the training tuples:

P(buys computer= yes)= 9/14= 0.643P(buys computer= no) = 5/14= 0.357To compute P(X|Ci), for i = 1, 2, we compute the following conditional probabilities:P(age= youth | buys computer= yes) = 2/9= 0.222P(age= youth | buys computer= no) = 3/5= 0.600P(income=medium | buys computer= yes)= 4/9= 0.444P(income=medium | buys computer= no) = 2/5= 0.400P(student= yes | buys computer= yes) = 6/9= 0.667

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 354 #28


P(student= yes | buys computer= no) = 1/5= 0.200P(credit rating= fair | buys computer= yes)= 6/9= 0.667P(credit rating= fair | buys computer= no) = 2/5= 0.400

Using these probabilities, we obtain

P(X|buys computer= yes)= P(age= youth | buys computer= yes) P(income=medium | buys computer= yes) P(student= yes | buys computer= yes) P(credit rating= fair | buys computer= yes)= 0.222 0.444 0.667 0.667= 0.044.

Similarly,

P(X|buys computer= no)= 0.600 0.400 0.200 0.400= 0.019.To find the class, Ci , that maximizes P(X|Ci)P(Ci), we compute

P(X|buys computer= yes)P(buys computer= yes)= 0.044 0.643= 0.028P(X|buys computer= no)P(buys computer= no)= 0.019 0.357= 0.007

Therefore, the nave Bayesian classifier predicts buys computer= yes for tuple X.

What if I encounter probability values of zero? Recall that in Eq. (8.12), we esti-mate P(X|Ci) as the product of the probabilities P(x1|Ci), P(x2|Ci), . . . , P(xn|Ci), basedon the assumption of class-conditional independence. These probabilities can be esti-mated from the training tuples (step 4). We need to compute P(X|Ci) for each class (i =1,2, . . . ,m) to find the class Ci for which P(X|Ci)P(Ci) is the maximum (step 5). Letsconsider this calculation. For each attributevalue pair (i.e., Ak = xk , for k = 1,2, . . . ,n)in tuple X, we need to count the number of tuples having that attributevalue pair, perclass (i.e., per Ci , for i = 1, . . . , m). In Example 8.4, we have two classes (m= 2), namelybuys computer = yes and buys computer = no. Therefore, for the attributevalue pairstudent= yes of X, say, we need two countsthe number of customers who are studentsand for which buys computer = yes (which contributes to P(X|buys computer= yes))and the number of customers who are students and for which buys computer = no(which contributes to P(X|buys computer= no)).

But what if, say, there are no training tuples representing students for the classbuys computer= no, resulting in P(student= yes|buys computer= no)= 0? In otherwords, what happens if we should end up with a probability value of zero for someP(xk|Ci)? Plugging this zero value into Eq. (8.12) would return a zero probability forP(X|Ci), even though, without the zero probability, we may have ended up with a highprobability, suggesting that X belonged to class Ci! A zero probability cancels the effectsof all the other (posteriori) probabilities (on Ci) involved in the product.

There is a simple trick to avoid this problem. We can assume that our training data-base, D, is so large that adding one to each count that we need would only make anegligible difference in the estimated probability value, yet would conveniently avoid the

HAN 15-ch08-327-392-9780123814791 2011/6/1 3:21 Page 355 #29

8.4 Rule-Based Classification 355

case of probability values of zero. This technique for probability estimation is known asthe Laplacian correction or Laplace estimator, named after Pierre Laplace, a Frenchmathematician who lived from 1749 to 1827. If we have, say, q counts to which we eachadd one, then we must remember to add q to the corresponding denominator used inthe probability calculation. We illustrate this technique in Example 8.5.

Example 8.5 Using the Laplacian correction to avoid computing probability values of zero. Sup-pose that for the class buys computer = yes in some training database, D, containing1000 tuples, we have 0 tuples with income= low, 990 tuples with income=medium, and10 tuples with income = high. The probabilities of these events, without the Laplaciancorrection, are 0, 0.990 (from 990/1000), and 0.010 (from 10/1000), respectively. Usingthe Laplacian correction for the three quantities, we pretend that we have 1 more tuplefor each income-value pair. In this way, we instead obtain the following probabilities(rounded up to three decimal places):

1

1003= 0.001, 991

1003= 0.988, and 11

1003= 0.011,

respectively. The corrected probability estimates are close to their uncorrectedcounterparts, yet the zero probability value is avoided.

8.4 Rule-Based ClassificationIn this section, we look at rule-based classifiers, where the learned model is representedas a set of IF-THEN rules. We first examine how such rules are used for classification(Section 8.4.1). We then study ways in which they can be generated, either from a deci-sion tree (Section 8.4.2) or directly from the training data using a sequential coveringalgorithm (Section 8.4.3).

8.4.1 Using IF-THEN Rules for ClassificationRules are a good way of representing information or bits of knowledge. A rule-basedclassifier uses a set of IF-THEN rules for classification. An IF-THEN rule is an expres-sion of the form

IF condition THEN conclusion.

An example is rule R1,

R1: IF age= youth AND student= yes THEN buys computer= yes.The IF part (or left side) of a rule is known as the rule antecedent or precondition.The THEN part (or right side) is the rule consequent. In the rule antecedent, thecondition consists of one or more attribute tests (e.g., age = youth and student = yes)

Front Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes

Chapter 2. Getting to Know Your Data2.1 Data Objects and Attribute Types2.2 Basic Statistical Descriptions of Data2.3 Data Visualization2.4 Measuring Data Similarity and Dissimilarity2.5 Summary2.6 Exercises2.7 Bibliographic Notes

Chapter 3. Data Preprocessing3.1 Data Preprocessing: An Overview3.2 Data Cleaning3.3 Data Integration3.4 Data Reduction3.5 Data Transformation and Data Discretization3.6 Summary3.7 Exercises3.8 Bibliographic Notes

Chapter 4. Data Warehousing and Online Analytical Processing4.1 Data Warehouse: Basic Concepts4.2 Data Warehouse Modeling: Data Cube and OLAP4.3 Data Warehouse Design and Usage4.4 Data Warehouse Implementation4.5 Data Generalization by Attribute-Oriented Induction4.6 Summary4.7 Exercises4.8 Bibliographic Notes

Chapter 5. Data Cube Technology5.1 Data Cube Computation: Preliminary Concepts5.2 Data Cube Computation Methods5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology5.4 Multidimensional Data Analysis in Cube Space5.5 Summary5.6 Exercises5.7 Bibliographic Notes

Chapter 6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods6.1 Basic Concepts6.2 Frequent Itemset Mining Methods6.3 Which Patterns Are Interesting?Pattern Evaluation Methods6.4 Summary6.5 Exercises6.6 Bibliographic Notes

Chapter 7. Advanced Pattern Mining7.1 Pattern Mining: A Road Map7.2 Pattern Mining in Multilevel, Multidimensional Space7.3 Constraint-Based Frequent Pattern Mining7.4 Mining High-Dimensional Data and Colossal Patterns7.5 Mining Compressed or Approximate Patterns7.6 Pattern Exploration and Application7.7 Summary7.8 Exercises7.9 Bibliographic Notes

Chapter 8. Classification: Basic Concepts8.1 Basic Concepts8.2 Decision Tree Induction8.3 Bayes Classification Methods8.4 Rule-Based Classification8.5 Model Evaluation and Selection8.6 Techniques to Improve Classification Accuracy8.7 Summary8.8 Exercises8.9 Bibliographic Notes

Chapter 9. Classification: Advanced Methods9.1 Bayesian Belief Networks9.2 Classification by Backpropagation9.3 Support Vector Machines9.4 Classification Using Frequent Patterns9.5 Lazy Learners (or Learning from Your Neighbors)9.6 Other Classification Methods9.7 Additional Topics Regarding Classification9.8 Summary9.9 Exercises9.10 Bibliographic Notes

Chapter 10. Cluster Analysis: Basic Concepts and Methods10.1 Cluster Analysis10.2 Partitioning Methods10.3 Hierarchical Methods10.4 Density-Based Methods10.5 Grid-Based Methods10.6 Evaluation of Clustering10.7 Summary10.8 Exercises10.9 Bibliographic Notes

Chapter 11. Advanced Cluster Analysis11.1 Probabilistic Model-Based Clustering11.2 Clustering High-Dimensional Data11.3 Clustering Graph and Network Data11.4 Clustering with Constraints11.5 Summary11.6 Exercises11.7 Bibliographic Notes

Chapter 12. Outlier Detection12.1 Outliers and Outlier Analysis12.2 Outlier Detection Methods12.3 Statistical Approaches12.4 Proximity-Based Approaches12.5 Clustering-Based Approaches12.6 Classification-Based Approaches12.7 Mining Contextual and Collective Outliers12.8 Outlier Detection in High-Dimensional Data12.9 Summary12.10 Exercises12.11 Bibliographic Notes

Chapter 13. Data Mining Trends and Research Frontiers13.1 Mining Complex Data Types13.2 Other Methodologies of Data Mining13.3 Data Mining Applications13.4 Data Mining and Society13.5 Data Mining Trends13.6 Summary13.7 Exercises13.8 Bibliographic Notes

BibliographyIndexFront Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes

































































BibliographyIndex

03 data mining-bayesian classification

Documents