cs690l data mining: classification reference: j. han and m. kamber, data mining: concepts and...
TRANSCRIPT
![Page 1: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/1.jpg)
CS690LData Mining: Classification
Reference:
J. Han and M. Kamber, Data Mining: Concepts and Techniques
Yong Fu: http://web.umr.edu/~yongjian/cs401dm/
![Page 2: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/2.jpg)
Classification
• Classification determine the class or category of an object based on its properties• Two stages
– Learning stage: construction of a classification function or model– Classification stage: prediction of classes of objects using the
function or model• Tools for classification
– Decision tree– Bayesian networks– Neural networks– Regression
• Problem– Given a set of objects whose classes are known called training set
derive a classification model which can correctly classify future objects
![Page 3: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/3.jpg)
Classification: Decision Tree
• Classification model: decision tree• Method: Top Down Induction of Decision Trees• Data representation:
– Every object is represented by a vector of values on a fixed set of attributes. If a relation is defined on the attributes an object is a tuple in the relation.
– A special attribute called class attribute tells the group/category the object belongs to which is the dependent attribute to be predicted
• Learning stage:– Induction of a decision tree that classifies the training set
• Classification stage: – The decision tree will classify new objects.
![Page 4: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/4.jpg)
An Example
• Definitions A decision tree is a tree in which each non-leaf node corresponds to an attribute of objects and each branch from a non-leaf node to its children represents a value of the attribute. Each leaf node in a decision tree is labeled by a class of the objects
• Classification using decision trees Starting from the root an object follows the path to a leaf node which gives the class of the object taking branches according to its values along the way
• Alternative view of decision tree • Node/Branch: discrimination test• Node: subset of objects satisfying test
![Page 5: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/5.jpg)
Decision Tree Induction
• Induction of decision trees:
Starting from a training set recursively selecting attributes to split nodes thus partitioning the objects– Termination condition: when to stop splitting a node– Selection of attribute for splitting testing:
• Best split• A measure for splitting?
• ID3 algorithm– Selection: attribute information gain– Termination condition: all objects are in a single class
![Page 6: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/6.jpg)
ID3 Algorithm
![Page 7: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/7.jpg)
ID3 Algorithm (Cont)
![Page 8: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/8.jpg)
Example
![Page 9: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/9.jpg)
• Information content of C (Expected information for the classification)
I(P) = Ent(C)= - {(9/14) log2 ( 9/14) + (5/14)log2 (5/14)} = 0.940
• For each Attribute Ai– Step 1: Compute the entropy for a given attribute Ai
Ent(Sunny) = - ((2/5 log2 2/5) + (3/5 log2 3/5)) = 0.97Ent(Rainy) = 0.97Ent(Overcast) = 0
– Step 2: Compute the Entropy (expected information based on the partitioning into Subsets by A)
Ent(C, Outlook) = (5/14)Ent(Sunny) + (5/14)Ent(Rainy) + (4/14)Ent(Overcast) = (5/14)(0.97) + (5/14)(0.97) + (4/14)(0) = 0.69
– Step 3: Gain(C, Outlook) = Ent(C) – Ent(C, Outlook) = 0.940 – 0.69 = 0.25
• Select the attribute that maximize information gain• Build a node for the selected attribute
Recursively build nodes.
Example: Decision Tree Building
![Page 10: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/10.jpg)
Level1: Decision Tree Building
Outlook
Temp Hum Wind Class
85 85 False DP
80 90 True DP
72 95 False DP
69 70 False P
75 70 True P
Temp Hum Wind Class
83 88 False P
64 65 True P
72 90 False P
81 75 False P
Temp Hum Wind Class70 96 False P71 80 False P72 70 True DP75 80 False P71 96 True DP
RainySunny
Overcast
![Page 11: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/11.jpg)
Decision Tree
![Page 12: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/12.jpg)
Generated Rules
![Page 13: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/13.jpg)
ID3 Algorithm
![Page 14: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/14.jpg)
C4.5 Extensions to ID3• Gain ratio: Gain favors attributes with many values GainRatio (C, A) = Gain(C, A)/Ent(P) where P = (|T1|/|C|, |T2|/C, … |Tm|/|C|)
and Ti are partitions of C based on object’s value of A. e.g. GainRatio (Outlook) = Gain(Outlook)/ {(5/14) log2 (5/14) + (5/14)log2 (5/14)
+ (4/14)log2 (4/14) } • Missing values:
– consider only objects without the attribute is defined. • Continuous attributes:
– consider all binary splits A <= ai and A > ai where ai is the ith values of A. – compute the gain or gain ratio and choose the split that maximizes the gain
or gain ratio• Over-fitting: Change the termination condition. If a subtree is dominated by a
class stop splitting• Tree pruning: replacing a subtree by a single leaf node. When the expected
classification error can be reduced• Rule deriving: A rule basically corresponds to a path from root to a leaf The
LHS is the conjunction of testing and the RHS is the class prediction• Rule simplification: removing some conditions in the LHS
![Page 15: CS690L Data Mining: Classification Reference: J. Han and M. Kamber, Data Mining: Concepts and Techniques Yong Fu: yongjian/cs401dm](https://reader035.vdocuments.net/reader035/viewer/2022070403/56649f295503460f94c426ca/html5/thumbnails/15.jpg)
Evaluation of Decision Tree Methods
• Complexity
• Expressive power
• Robustness
• Effectiveness