decision tree algorithmdecision tree algorithmtwang/595dm/slides/week4.pdf · team homework...

21
Decision Tree Algorithm Decision Tree Algorithm Week 4 1

Upload: lyanh

Post on 01-Feb-2018

253 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Decision Tree AlgorithmDecision Tree Algorithm

Week 4

1

Page 2: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Team Homework Assignment #5Team Homework Assignment #5

R d 105 117 f h b k• Read pp. 105 – 117 of the text book.• Do Examples 3.1, 3.2, 3.3 and Exercise 3.4 (a). Prepare for the 

results of the homework assignment.results of the homework assignment.• Due date

– beginning of the lecture on Friday February 25th. 

Page 3: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Team Homework Assignment #6Team Homework Assignment #6

D id d h i l f f h k• Decide a data warehousing tool for your future homework assignments

• Play the data warehousing toolPlay the data warehousing tool• Due date

– beginning of the lecture on Friday February 25th. 

Page 4: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Classification - A Two-Step ProcessClassification A Two Step Process

• Model usage: classifying future or unknown objects– Estimate accuracy of the model

• The known label of test data is compared with the classified result from the modelA t i th t f t t t l• Accuracy rate is the percentage of test set samples that are correctly classified by the model

– If the accuracy is acceptable, use the model to classify y p , ydata tuples whose class labels are not known

4

Page 5: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

ess (1): Model Constructioness (1): Model Construction

Figure 6.1  The data classification process: (a) Learning: Training data are analyzed by a classification algorithm Here the class label attribute is loan decision and the

5

a classification algorithm. Here, the class label attribute is loan_decision, and the learned model or classifier is represented in the form of classification rules.

Page 6: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Figure 6.1  The data classification process: (b) Classification: Test data are used to estimate the accuracy of the classification rules. If the accuracy is considered 

t bl th l b li d t th l ifi ti f d t t lacceptable, the rules can be applied to the classification of new data tuples.

6

Page 7: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Decision Tree Classification E lExample

7

Page 8: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Decision Tree Learning OverviewDecision Tree Learning Overview

• Decision Tree learning is one of the most widely used and• Decision Tree learning is one of the most widely used and practical methods for inductive inference over supervised data.

• A decision tree represents a procedure for classifying• A decision tree represents a procedure for classifying categorical data based on their attributes.

• It is also efficient for processing large amount of data, so i ft d i d t i i li tiis often used in data mining application.

• The construction of decision tree does not require any domain knowledge or parameter setting, and therefore appropriate for exploratory knowledge discovery.

• Their representation of acquired knowledge in tree form is intuitive and easy to assimilate by humansy y

8

Page 9: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Decision Tree Algorithm – ID3Decision Tree Algorithm ID3

• Decide hich attrib te (splitting point) to test at• Decide which attribute (splitting‐point) to test at node N by determining the “best” way to separate or partition the tuples in D intoseparate or partition the tuples in D into individual classes

• The splitting criteria is determined so that• The splitting criteria is determined so that, ideally, the resulting partitions at each branch are as “pure” as possibleas  pure  as possible. – A partition is pure if all of the tuples in it belong to the same class

9

Page 10: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Figure 6.3  Basic algorithm for inducing a decision tree from training examples.10

Page 11: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

What is Entropy?What is Entropy?

• The entropy is a • The entropy is a measure of the uncertainty associated

ith d i blwith a random variable• As uncertainty and or

randomness increases for a result set so does the entropyValues range from 0 1 • Values range from 0 – 1 to represent the entropy of information

c

11

)(log)( 21

i

c

ii ppDEntropy ∑

=

−≡

Page 12: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Entropy Example (1)Entropy Example (1)

12

Page 13: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Entropy Example (2)Entropy Example (2)

13

Page 14: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Entropy Example (3)Entropy Example (3)

14

Page 15: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Entropy Example (4)Entropy Example (4)

15

Page 16: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Information Gain

• Information gain is used as an attribute selection

Information Gain

Information gain is used as an attribute selection measure

• Pick the attribute that has the highest Information gGain

)(DEntropyD(D)EntropyA)(DGain j

vj || , ∑−= )(py

D( )py)( j

j ||,

1∑=

D: A given data partitionA: Attributev: Suppose we were partition the tuples in D on some attribute A having v distinct valuesD is split into v partition or subsets, {D1, D2, … Dj}, where Dj contains those tupes in Dthat have outcome aj of A.

16

Page 17: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Table 6 1 Class‐labeled training tuples from AllElectronics customer databaseTable 6.1  Class labeled training tuples from AllElectronics customer database.

17

Page 18: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

• Class P: buys_computer = “yes”

• Class N: buys computer = “no”Class N: buys_computer =  no

940.0)145(log

145)

149(log

149)( 22 =−−=DEntropy

• Compute the expected information requirement for each attribute: start with the attribute age

14141414

)(||)(

),(

−= ∑ vv SEntropySDEntropy

DageGain

)(145)(

144)(

145)(

)(||

)(

_

},,{

−−−=

= ∑−∈

senioragedmiddleyouth

v

SenioragedMiddleYouthv

SEntropySEntropySEntropyDEntropy

SEntropyS

DEntropy

246.0141414

=

029.0),( =DincomeGain

048.0),_(151.0),(

==

DratingcreditGainDstudentGain

18

Page 19: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Figure 6.5 The attribute age has the highest information gain and therefore becomes theFigure 6.5  The attribute age has the highest information gain and therefore becomes the splitting attribute at the root node of the decision tree. Branches are grown for each outcome of age. The tuples are shown partitioned accordingly.

19

Page 20: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

Figure 6.2  A decision tree for the concept buys_computer, indicating whether a customer at AllElectronics is likely to purchase a computer. Each internal (nonleaf) node represents a test 

ib E h l f d l ( i h bon an attribute. Each leaf node represents a class (either buys_computer = yes or buy_computers = no.

20

Page 21: Decision Tree AlgorithmDecision Tree Algorithmtwang/595DM/Slides/Week4.pdf · Team Homework Assignment #6Team Homework Assignment #6 • DidDecide a data warehihousing tool for your

ExerciseExerciseConstruct a decision tree to classify “golf play.”

Weather and Possibility of Golf PlayWeather Temperature Humidity Wind Golf Play

fine hot high none nofine hot high few nofine hot high few no

cloud hot high none yesrain warm high none yesrain cold midiam none yesrain cold midiam few no

cloud cold midiam few yesfine warm high none nofine cold midiam none yesfine cold midiam none yesrain warm midiam none yesfine warm midiam few yes

cloud warm high few yesl d h t idicloud hot midiam none yesrain warm high few no

21