classification and prediction by yen-hsien lee department of information management college of...

Classification and Prediction

byYen-Hsien Lee

Department of Information ManagementCollege of Management

National Sun Yat-Sen University

March 4, 2003

Outline

• Introduction to Classification• Decision Tree – ID3• Neural Network – Backpropagation• Bayesian Network

• Purpose:Classification is the process that establishes classes with attributes from a set of instances in a database. The class of an object must be one from a finite set of possible, pre-determined class values, while attributes of the object are descriptions of the object potentially affecting its class.

• Techniques:ID3 and its descendants, backpropagation neural network, Bayesian Network, CN2, AQ family, etc.

Classification

ID3 Approach

• ID3 uses an iterative method to build up decision trees, preferring simple trees over complex ones, on the theory that simple trees are more accurate classifiers of future inputs.

• ID3 accomplishes the development of a minimal tree by using an information theoretic approach. By determining the amount of information that can be gained by testing each possible attribute and selecting the one containing the largest amount of information, the decision tree can be optimized.

No. Attributes ClassOutlook Temperature Humidity Windy

1 Sunny Hot High False N2 Sunny Hot High True N3 Overcast Hot High False P4 Rain Mild High False P5 Rain Cool Normal False P6 Rain Cool Normal True N7 Overcast Cool Normal True P8 Sunny Mild High False N9 Sunny Cool Normal False P10 Rain Mild Normal False P11 Sunny Mild Normal True P12 Overcast Mild High True P13 Overcast Hot Normal False P14 Rain Mild High True N

Sample Training Set

Example: Complex Decision Tree

Temperature

Outlook

WindyP

SunnyOvercast

P

Rain

True

N

False

P

Humidity

Outlook

Windy

SunnyOvercast

P

Rain

True

N

False

P

High Normal

P

Windy

True

P

False

N

Windy

True

N

False

Humidity

High Normal

POutlook

SunnyOvercast

P

Rain

N null

Cool Mild Hot

Example: Simple Decision Tree

Outlook

SunnyOvercast

P

Rain

Windy

True

N

False

P

Humidity

High Normal

PN

Entropy Function

• Entropy of a set C of objects (examples):E C P log Pj 2 j

jwherej output class jPj # of objects in class j total objects in theset Clog2

( ) = -

= = /

0 0

Set C (total objects = n = n1+n2+n3+n4)

Class 1 (n1)

Class 2 (n2)

Class 3 (n3)Class 4 (n4)

E(C) = - (n1/n)*log2(n1/n)

- (n2/n)*log2(n2/n)

- (n3/n)*log2(n3/n)

- (n4/n)*log2(n4/n)

Entropy Function (Cont’d)

• Entropy of a partial tree of C if a particular attribute is chosen for partitioning C:

E A (nk / n)E Ci kk

whereCk disjoint set k which is apartition of C by the

theattributeA i according to A i 's values.E(Ck entropy of theset Ckn total number of objects in theset Cnk total number of objects in thesubset Ck

( ) =

=

( )

)

Entropy Function (Cont’d)


Class 1 (n1)

Class 2 (n2)


E(C) = - (n1/n)*log2(n1/n)

- (n2/n)*log2(n2/n)

- (n3/n)*log2(n3/n)

- (n4/n)*log2(n4/n)

E(C1) = - (m1/m)*log2(m1/m)

- (m2/m)*log2(m2/m)

- (m3/m)*log2(m3/m)

- (m4/m)*log2(m4/m)

Subset C1 (m =m1+m2+m3+m4)

Set C is partitionedinto subsets C1, C2, ...

by attribute Ai

E(C2) = - (p1/p)*log2(p1/p)

- (p2/p)*log2(p2/p)

- (p3/p)*log2(p3/p)

- (p4/p)*log2(p4/p)

Subset C2 (p =p1+p2+p3+p4)

. . .

E(Ai) =

(m/n)*E(C1) +

(p/n)*E(C2) +. . .

Class 1(m1)

Class 2(m2)

Class 3(m3)Class 4

(m4)

Class 1(p1)

Class 2(p2)

Class 3(p3)Class 4

(p4)

Information GainDue to Attribute Partition

Gi = E(C) - E A i( )


Class 1 (n1)

Class 2 (n2)


Entropy of Set C= E(C)

Subset C1 (m =m1+m2+m3+m4)

Class 1(m1)

Class 2(m2)

Class 3(m3)Class 4

(m4)

Set C is partitionedinto subsets C1, C2, ...

by attribute Ai

Subset C2 (p =p1+p2+p3+p4)

. . .Entropy of thepartial tree of C(based on attributeAi) = E(Ai)

Thus, the information gain due to the partition by the attribute Ai isGi = E(C) - E(Ai)

Class 1(p1)

Class 2(p2)

Class 3(p3)Class 4

(p4)

ID3 Algorithm1 Start from the root node and assign the root node as

the current node C.2 If all objects in the current node C belong to the sam

e class, then stop (the termination condition for the current node C) else go to step 3.

3 Calculate the entropy E(C) for the node C.4 Calculate the entropy E(Ai) of the partial tree partitio

ned by an attribute Ai which has not been used as classifying attributes of the node C.

5 Compute the information gain Gi for the partial tree (i.e., Gi =E(C) - E(Ai)).

ID3 Algorithm (Cont’d)6 Repeat step 4 and 5 for each attribute which has not

been used as classifying attributes of the node C.7 Select the attribute with the maximum information g

ain (max Gi) as the classifying attribute for the node C.

8 Create child nodes C1, C2, ..., and Cn (assume the selected attribute has n values) for the node C; and assign objects in the node C to appropriate child nodes according to the values of the classifying attribute.

9 Mark the selected attribute as a classifying attribute of each node Ci. For each node C1, assign it as the current node and go to step 2.

Example (See Slide 5)

• Current node C = root node of the tree.• Entropy of the node C = E(C) =

-(9/14)log2(9/14) - (5/14)log2(5/14) = 0.940

Class P:Class P:Objects 3, 4, 5, 7, 9, 10, 11, 12, 13

Class N:Class N:Objects 1, 2, 6, 8, 14

Example (Cont’d)

• Entropy of the partial tree based on the Outlook attribute:E(Outlook=Sunny) =

-(3/5)log2(3/5) - (2/5)log2(2/5) = 0.971E(Outlook=Overcast) =

-(0/4)log2(0/4) - (4/4)log2(4/4) = 0E(Outlook=Rain) =

-(2/5)log2(2/5) - (3/5)log2(3/5) = 0.971E(Outlook) =

(5/14)*E(Outlook=Sunny) +(4/14)*E(Outlook=Overcast) +(5/14)*E(Outlook=Rain) = 0.694

Example (Cont’d)

• Information gain due to the partition by the Outlook attribute:G(Outlook) = E(C) - E(Outlook) = 0.246

• Similarly, the information gains due to the partition by the Temperature, Humidity and Windy attributes, respectively, are:G(Temperature) = 0.029G(Humidity) = 0.151G(Windy) = 0.048

• Thus, the Outlook attribute is selected as the classifying attribute for the current node C since its information gain is the largest among all of the attributes.

Example (Cont’d)

Outlook

SunnyOvercast

P

Rain

• The resulted partial decision tree is:

• The analysis continues for the node C1 and C2 until all of the leaf nodes are associated with objects of the same class.

Objects:1, 2, 8, 9, 11

Objects:4, 5, 6, 10, 14

Objects:3, 7, 12, 13

Example (Cont’d)

Outlook

Sunny Overcast

P

Rain

Windy

True

N

False

P

Humidity

High Normal

PN

• The resulted final decision tree is:

Objects:3, 7, 12, 13

Objects:1, 2, 8

Objects:9, 11

Objects:6, 14

Objects:4, 5, 10

Issues of Decision Tree

• How to deal with continuous attribute.• Pruning tree to make it not case-sensitive.• A better metric than information gains to

evaluate tree expansion. Information gains would prefer to attribute with more attribute-value.

Characteristics of Neural Network (“Connectionist”)

Architecture• Neural network consists of many simple inter

connected processing elements.• The processing elements are often grouped t

ogether into linear arrays called “layers”.• A neural network always has an input layer a

nd an output layer and may have or may not have “hidden” layers.

• Each processing elements has a number of input xi, which carry various wji. The processing element sums the weighted inputs wjixi and computes a single output signal yj that is a function f of that weighted sum.

Characteristics of Neural Network (“Connectionist”)

Architecture (Cont’d)

• The function f, called the transfer function, is fixed for the life of the processing element. A typical transfer function is the sigmod function.

• The function f is the object of a design decision and cannot dynamically be changed. On the other hand, the weights wji are variables and can dynamically be adjusted to produce a given output. This dynamic modification of weights is what allows a neural network to memorize information, to adapt, and to learn.

Neural Network Processing Element

f

x1

x2

xi

yj

wj1

wj2

wj1

. . .

Sigmod Function

y f x 11 e x = ( ) = + -

0

0.2

0.4

0.6

0.8

1

1.2

-15 -10 -5 0 5 10 15

Architecture of Three-Layer Neural Network

. . .

. . .

. . .

Output Layer

Hidden Layer

Input Layer

Backpropagation Network• A fully connected, layered, feedforward and tra

in backward neural network.• Each unit (processing element) in one layer is c

onnected in the forward direction to every unit in the next layer.

• A backpropagation network typically starts out with a random set of weights.

• The network adjusts its weights each time it sees an input-output pair. Each pair requires two stages: a forward pass and backward pass.

• The forward pass involves presenting a sample input to the network and letting activations flow until they reach the output layer.

Backpropagation Network (Cont’d)

• During the backward pass, the network’s actual output (from the forward pass) is compared with the target output and error estimates are computed for the output units. The weights connected to the output units can be adjusted in order to reduce those errors.

• We can then use the error estimates of the output units to derive error estimates for the units in the hidden layers. Finally, errors are propagated back to the connections stemming from the input units.

Issues of Backpropagation Network

• How to present data.• How to decide number of layers.• Learning strategy.

Bayesian Classification

• Bayesian classification is based on Bayes theorem.

• Bayesian classifier predict class membership probabilities, such as the probability that a given sample belongs to a particular class.

• Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes.

• Bayesian belief networks are graphical models, which unlike naïve Bayesian classifiers, allow the representation of dependencies among subsets of attributes.

• Let H be hypothesis, and X be a data sample

• P(H|X) is posterior probability of H given X.• P(X|H) is posterior probability of X given H.• P(H) is prior probability of H.• P(X), P(H), and P(X|H) may be estimated from

the given data.

Bayes Theorem

)(

)()|()|(

XP

HPHXPXHP

• Assume there being a n attributes, unknown class, data sample X = (x1, x2,…, xn). The process to predict the class (C1, C2, …, Cm) X belongs to in Naïve Bayesian Classifier is as follows:

1. Compute the posterior probability, conditioned on X,

for each class.2. Assign X to the class that has the highest posterior

probability, i.e.

P(Ci|X) > P(Cj|X) for 1 j m, j i

Naïve Bayesian Classification

• Due to , and P(X) is constant

for all classes, only P(X|Ci)P(Ci) need be maximized.

• Besides, the naïve Bayesian Classifier assume that there are no dependence relationships among the attributes. Thus,

Naïve Bayesian Classification (Cont’d)

)(

)()|()|(

XP

CPCXPXCP ii

i

n

kiki CXPCXP

1

)|()|(

• To classify data sample X = (Outlook = Sunny, Temperature = Hot, Humidity = Normal, Windy = False), we need to maximize P(X|Ci)P(Ci).– Compute P(Ci)

P(Class = P) = 9/14 = 0.643P(Class = N) = 5/14 = 0.357– Compute P(Xk|Ci)

P(Outlook = Sunny | Class = P) = 2/9 = 0.222P(Outlook = Sunny | Class = N) = 3/5 = 0.600P(Temperature = Hot | Class = P) = 2/9 = 0.222P(Temperature = Hot | Class = N) = 2/5 = 0.400P(Humidity = Normal | Class = P) = 6/9 = 0.667P(Humidity = Normal | Class = N) = 1/5 = 0.200P(Windy = False | Class = P) = 6/9 = 0.667P(Windy = False | Class = N) = 2/5 = 0.400

Example

– Compute P(X|Ci)P(X | Class = P) = 0.222 x 0.222 x 0.667 x 0.667 =

0.022P(X | Class = N) = 0.600 x 0.400 x 0.200 x 0.400 =

0.019– Compute P(X|Ci)P(Ci)

P(X | Class = P)P(Class = P) = 0.022 x 0.643 = 0.014

P(X | Class = N)P(Class = N) = 0.019 x 0.357 = 0.007

• Conclude: X belongs to Class P

Example (Cont’d)

classification and prediction by yen-hsien lee department of information management college of...

Documents

p subset c

ec ea i class

entropy of set c

subsets c

ec subset c

partitioning c

partial tree of c

set c of objects examples