1 acm student chapter, heritage institute of technology 3 rd february, 2012 sigkdd presentation by...

86
Decision Tree Learning 1 ACM Student Chapter, Heritage Institute of Technology 3 rd February, 2012 SIGKDD Presentation by Satarupa Guha Sudipto Banerjee Ashish Baheti

Upload: susan-francis

Post on 27-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • 1 ACM Student Chapter, Heritage Institute of Technology 3 rd February, 2012 SIGKDD Presentation by Satarupa Guha Sudipto Banerjee Ashish Baheti
  • Slide 2
  • Machine Learning A computer program is said to learn from experience E with respect to a class of tasks T and performance measure P if its performances at tasks T as measured by P, improves with experience E. 2
  • Slide 3
  • An Example: Checkers learning Problem Task T : Playing checkers Performance P : Percent of games won by opponents Experience E : Gained by playing against itself 3
  • Slide 4
  • Concept Learning Concept learning can be formulated as a problem of searching through a predefined space of potential hypotheses for the hypothesis that best fits the training examples. Much of learning involves acquiring general concepts from specific training examples. 4
  • Slide 5
  • Representing Hypotheses Let H be a hypothesis space. For each h belonging to H, h is a conjunction of literals. Let X be a set of possible instances each described by a set of attributes. Example- Target function C: X-> {0,1} Training examples D: positive and negative examples of the target function.,, 5
  • Slide 6
  • Types of Training Examples Positive Examples: those training examples that satisfy the target function, ie. For which c(x)=1 or TRUE. Negative Examples: those training examples that do not satisfy the target function, ie. For which c(x)=0 or FALSE. 6
  • Slide 7
  • Attribute Types Nominal / Categorical Ordinal Continuous 7
  • Slide 8
  • Inductive learning hypothesis Any hypothesis found to approximate the target function well over a sufficiently large set of training examples, will also approximate the target function well over other unobserved examples. Any hypothesis h is said to be consistent with a set of training examples D of target concept c iff h(x)=c(x) for each training example 8
  • Slide 9
  • Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Nave Bayes and Bayesian Belief Networks Support Vector Machines
  • Slide 10
  • Decision Tree Goal is to create a model that predicts the value of a target variable based on several input variables. 10
  • Slide 11
  • Decision tree representation Each internal node tests an attribute. Each branch corresponds to an attribute value. Each leaf node assigns a classification. 11
  • Slide 12
  • A quick recap CNF = Conjunctive Normal Form DNF = Disjunctive Normal Form 12
  • Slide 13
  • Disjunctive Normal Form In Boolean Algebra, a formula is in DNF if it is a disjunction of clauses, where a clause is a conjunction of literals. Also known as Sum of Products. Example: (A ^ B ^ C) V (B ^ C)
  • Slide 14
  • Conjunctive Normal Form In Boolean Algebra, a formula is in CNF if it is a conjunction of clauses, where a clause is a disjunction of literals. Also known as Product of Sum. Example: (A V B V C) ^ (B V C)
  • Slide 15
  • Decision Tree: contd. Decision trees represent a disjunction(OR) of conjunctions(AND) of constraints on the attribute values of instances, Each path from the tree root to a leaf corresponds to a conjunction of attribute tests, and the tree itself to a disjunction of these conjunctions. Hence, DT represents a DNF. 15
  • Slide 16
  • Attribute splitting 2- way split Multi- way split 16
  • Slide 17
  • Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. CarType Family Sports Luxury CarType {Family, Luxury} {Sports} CarType {Sports, Luxury} {Family} OR
  • Slide 18
  • Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. Splitting Based on Ordinal Attributes Size Small Medium Large Size {Medium, Large} {Small} Size {Small, Medium } {Large} OR
  • Slide 19
  • Splitting Based on Continuous Attributes
  • Slide 20
  • Example of a Decision Tree categorical continuous class Refund MarSt TaxInc YES NO YesNo Married Single, Divorced < 80K> 80K Splitting Attributes Training Data Model: Decision Tree
  • Slide 21
  • DT Classification Task 21
  • Slide 22
  • Measures of Node Impurity Entropy GINI Index Misclassification Error
  • Slide 23
  • Entropy It characterizes the impurity of an arbitrary collection of examples. It is a measure of randomness. Entropy(S)= -p log p - p log p Where, S is a collection containing positive & negative examples of some target concept. P is the proportion of positive ex in S. P is the proportion of negative ex in S. 23
  • Slide 24
  • An example of Entropy Let S is a collection of 14 examples, including 9 positive & 5 negative examples, denoted by [9+, 5-]. Then entropy[9+, 5-] = -9/14 log(9/14) 5/14 log(5/14) = 0.94 24
  • Slide 25
  • More on Entropy In a more general sense, Entropy= 0, if all members belong to the same class. = 1, if collection contains equal no. of positive & negative examples. = lies between 0 & 1, if there are unequal no. of positive & negative examples. 25
  • Slide 26
  • GINI Index GINI Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information
  • Slide 27
  • Examples for computing GINI P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 (0)^2 (1)^2 = 1-0-1= 1 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6)^2 (5/6)^2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6)^2 (4/6)^2 =0.444
  • Slide 28
  • Splitting Based on GINI When a node p is split into k partitions (children), the quality of split is computed as, where,ni = number of records at child i, n = number of records at node p.
  • Slide 29
  • Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for. B? YesNo Node N1Node N2 GINI(N1) = 1 (5/7) 2 (2/7) 2 = 0.408 GINI (N2) = 1 (1/5) 2 (4/5) 2 = 0.320 GINI (Children) = 7/12 * 0.408+ 5/12 * 0.320 = 0.371
  • Slide 30
  • Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values)
  • Slide 31
  • A set of training examples Day outlook humidity wind play tennis D1 sunny high weak no D2sunnyhigh strongno D3overcasthigh weakyes D4rainhigh weakyes D5rainnormal weakyes D6rainnormal strongno D7overcastnormal strongyes D8sunnyhigh weakno D9sunnynormal weakyes D10rainnormal weakyes 31
  • Slide 32
  • Decision Tree Learning Algorithms Variations of a core algorithm that employs a top- down, greedy search through the space of possible decision trees. Examples are Hunts Algorithm, CART, ID3, C4.5, SLIQ,SPRINT, Mars. 32
  • Slide 33
  • Algorithm ID3 33
  • Slide 34
  • Algorithm ID3 Greedy algorithm that grows the tree top-down. Begins with the question "which attribute should be tested at the root of the tree? A statistical property called information gain is used. 34
  • Slide 35
  • Information Gain Expected reduction in entropy caused by partitioning the example according to a particular attribute. Gain of an attribute A relative to a collection of example S is defined as- Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv) v->Values(A) where Values(A): set of all positive values of an attribute A. Sv: subset of S for which attribute A has value v. 35
  • Slide 36
  • Information Gain: contd. Gain(S,A) is the information provided about the target function value, given the value of some other attribute A. E xample: S is a collection described by attributes including Wind, which can have the values Weak or Strong. Assume S has 14 examples. Then S=[9+, 5-] S weak = [6+, 2-] S strong = [3+, 3-] 36
  • Slide 37
  • Information Gain: contd. Gain(S, wind) = Entropy(S) (8/14) Entropy(S weak ) (6/14) Entropy(S strong ) = 0.94 (8/14) 0.811 = 0.048 37
  • Slide 38
  • Play Tennis example: revisited Day outlook humidity wind play tennis D1 sunny high weak no D2sunnyhigh strongno D3overcasthigh weakyes D4rainhigh weakyes D5rainnormal weakyes D6rainnormal strongno D7overcastnormal strongyes D8sunnyhigh weakno D9sunnynormal weakyes D10rainnormal weakyes 38
  • Slide 39
  • Application of ID3 on Play Tennis There are 3 attributes- Outlook, humidity and Wind We need to choose one of them as the root of the tree. We make this choice based on the information gain(IG) of each of the attributes. The one with the highest IG gets to be the root. The calculations are shown in the following slides. 39
  • Slide 40
  • Quick recap of formulae Entropy: p log (1/p ) + p log ( 1/p ) Information Gain: Gain(S,A)= Entropy(S) - |Sv|/|S| Entropy(Sv) v->Values(A) Where S is the collection, A is a particular attribute. Values(A): set of all positive values of an attribute A. Sv: subset of S for which attribute A has value v. 40
  • Slide 41
  • Calculations: For Outlook: The training set has 6 positive and 4 negative examples. Hence entropy =4/10* lg(10/4) + 6/10* lg(10/6)= 0.970 Outlook can have 3 values- sunny [ 1+, 3- ] rain [ 3+, 1- ] overcast.[ 1+ ] Entropy of sunny= * lg 4 + * lg (4/3) =0.324 Entropy of rain = * lg (4/3) + * lg 4 =0.324 Entropy of overcast= 2/4* lg (2/2) =0 41
  • Slide 42
  • Calculations: Sv/S for each of them are as follows: sunny- 4/10 (means 4 out of 10 examples have sunny as their outlook) rain - 4/10 overcast- 2/10 Hence, Information gain of outlook = 0.970( 4/10 *0.324*2 + 2/10*0) = 0.711 42
  • Slide 43
  • Calculations: For Humidity: The training set has 6 positive and 4 negative examples. Hence entropy = 0.970 Humidity can take 2 values- High [3+,2-] Normal [4+, 1-] Entropy of High = 3/5* lg (5/3) + 2/5* lg (5/2) = 0.970 Entropy of Normal = 1/5* lg 5 + 4/5* lg(5/4) = 0.7195 43
  • Slide 44
  • Calculations: Sv/S for High = 5/10 for normal = 5/10 Hence IG( Humidity) =0.970 (5/10*0.970 + 5/10*0.7195) =0.125 Similarly for Wind, the IG is 0.0910. Hence, IG(Outlook)=0.7110 IG(Humidity)=0.125 IG(Wind)= 0.0910 Comparing the IG s of the 3 attributes, we find Outlook has got the highest IG(0.7110) 44
  • Slide 45
  • Partially formed tree Hence outlook is chosen as the root of the decision tree. The partially formed decision tree is as follows: 45 Outlook Sunny [1+,3-] Overcast [2+] Rain [3+,1-] yes
  • Slide 46
  • Further calculations: Since sunny and rain have both positive and negative examples, they have fair degrees of randomness and hence need to be classified further. For sunny : As computed earlier, Entropy of sunny= 0.324 Now, we need to find the corresponding humidity and wind for those training examples who have outlook=sunny. 46
  • Slide 47
  • Further calculations Day Outlook Humidity wind Play tennis D1 sunny high weak no D2 sunny high strongno D8 sunny high weakno D9 sunny normal weakyes For Humidity: Sv/S* Entropy(high)= *0=0 Sv/S* Entropy(low)=1/4*0=0 47
  • Slide 48
  • Calculations: Zero because there is no randomness. All the examples that have Humidity= high have Play tennis= No and those having Humidity=low have Play Tennis=yes. IG (S sunny, Humidity)= 0.324- 0 = 0.324 For Wind: Sv/S * Entropy(weak)=*(2/3*lg 3/2) + 1/3*lg 3=0.687. Sv/S* Entropy(strong)=1/4* 0=0 IG(S sunny, Wind) = 0.324 0.687 = -0.363 Clearly, humidity has a higher IG. 48
  • Slide 49
  • 49 Outlook Humidity yes No Sunny[1+,3-] Rain[3+,1-] Overcast[2+] High[3+]Normal[1+]
  • Slide 50
  • Further Calculations Now for Rain[ 3+,1-], Day Outlook Humidity Wind Play tennis D4 Rainhighweakyes D5 Rainnormalweakyes D6 Rainnormalstrongno D10 Rainnormalweakyes 50
  • Slide 51
  • Further Calculations Entropy of rain= * lg (4/3) + * log 4=0.810 Checking IG for Wind: Entropy(weak)= 3/3* lg(3/3) =0 Entropy(strong)= lg 1 =0 Hence, IG(S sunny, wind) = 0.810 0 0 = 0.810 Checking IG of Humidity: Entropy(High)= 1* lg 1 =0 Entropy(Normal)=1/3* lg 3 + 2/3* lg (3/2)=0.917 Hence, IG(S sunny, wind) =0.810 0 0.917*(3/4)=0.122 51
  • Slide 52
  • Final Decision Tree Outlook Humidity Wind Sunny Rain YES High Normal YESNO YES 52 Overcast
  • Slide 53
  • Play Tennis : contd. This decision tree corresponds to the following expression: (Outlook = Sunny ^ Humidity = Normal) V (Outlook = Overcast) V (Outlook = Rain ^ Wind = Weak) As we can see this is actually a disjunction of conjunctions ( DNF ). 53
  • Slide 54
  • Features of ID3 Maintains only a single current hypothesis as it searches the space of decision trees. No backtracking at any step. Uses all training examples at each step in the search. 54
  • Slide 55
  • Inductive bias in decision tree learning Inductive bias is the set of assumptions that together with the training data, deductively justify the classifications assigned by the learner to future instances. 55
  • Slide 56
  • Inductive bias in decision tree learning Roughly, ID3 search strategy- Selects in favor of shorter trees over longer ones. Selects trees that place attributes with highest information gain closest to the root. ID3 employs preference bias. 56
  • Slide 57
  • Occams razor Prefer the simplest hypothesis that fits the data. Justification: there are fewer short(hence simple) hypotheses than long ones, so it is less likely that one will find a short hypothesis that coincidentally fits the training data. So, we might believe a 5- node tree is less likely to be a statistical coincidence and prefer this hypothesis over the 500-node hypothesis. 57
  • Slide 58
  • Slide 59
  • A few terms: ------ACCURACY ------ERROR RATE ------PRECISION ------RECALL
  • Slide 60
  • An Example: SUPPOSE THERE ARE 2 CLASSES: CLASS 1 CLASS 0
  • Slide 61
  • CLASS 1CLASS 0 CLASS 1 f 11 f 10 CLASS 0 f 01 f 00 PREDICTED CLASS ACTUAL CLASS
  • Slide 62
  • Suppose let us assume that class 1 is +ve, class 0 is ve So f 11 means that a +ve output predicted as +ve f 10 means that a +ve output- predicted as -ve f 01 means that a ve output- predicted as +ve f 00 means that a ve output- predicted as -ve Meanings of the terms
  • Slide 63
  • Hence f 11 and f 00 are two cases which have been accurately predicted and f 10 and f 01 are the two cases which are predicted with an error. ACCURACY=(f 11 +f 00) )/( f 11 + f 10 + f 01 + f 00 ) ERROR RATE=(f 01 + f 10 )/( f 11 + f 10 + f 01 + f 00 ) Clearly accuracy + error rate=1
  • Slide 64
  • Precision = f ff/ /( f 11 +f 01 ) Recall = f 11 /(f 11 + f 10 ) Example- let there be 8 batsmen present . Now the prediction is made and said that there are 7 batsmen. It is found that out of these 7,5 are batsmen and 2 were bowlers. So precision is 5/7 and recall is 5/8
  • Slide 65
  • Over-fitting A hypothesis over-fits the training examples if some other hypothesis that fits the training example less well, actually performs better over the entire distribution of instances. Given a hypothesis space H, a hypothesis h H is said to over-fit the training data if there exists some alternative hypothesis h H such that h has smaller error than h over the training ex, but h has a smaller error than h over the entire distribution of instances. 65
  • Slide 66
  • Over-fitting Solid line : accuracy over training data Broken line : accuracy over independent set of test examples, not included in training examples 66
  • Slide 67
  • Causes of Over-fitting When training examples contain random errors/ noise. When the training data consists of small number of examples. 67
  • Slide 68
  • 1. Approach that stops growing the tree earlier, before it reached the point where it perfectly classifies the training data 2. Approach that first over-fits the data, then post prunes the tree.
  • Slide 69
  • How to determine the correct final size of tree Use a separate set of examples, called Validation set distinct from the training examples, to evaluate the utility of post-pruning nodes from the tree. Use all the available data for training, but apply a statistical test to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set.
  • Slide 70
  • Pruning the decision tree There are many methods by which a decision tree can be pruned: 1. Reduced Error Pruning 2. Rule Post Pruning
  • Slide 71
  • Reduced Error Pruning Consider each of the decision nodes in the tree to be candidates for pruning. Pruning a node consists of removing the sub-tree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node.
  • Slide 72
  • Reduced Error Pruning (contd.) Nodes are pruned iteratively, always choosing the node whose removal most increases the decision tree accuracy over the validation set. Pruning is done until further pruning results in reducing the accuracy of decision tree over the validation set.
  • Slide 73
  • Reduced Error Pruning Performance
  • Slide 74
  • Rule Post Pruning The steps of Post Pruning are: Infer the decision tree from the training set, and allow over-fitting to occur. Convert the learned tree into an equivalent set of rules by creating one rule for each path.
  • Slide 75
  • Rule Post Pruning Prune each rule by removing any preconditions and check its estimated accuracy. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.
  • Slide 76
  • Post Pruning (contd.) In rule post-pruning, one rule is generated for each leaf node in the tree. Each attribute test along the path from the root to the leaf becomes a rule antecedent (pre-condition) and the classification at the leaf node becomes the rule consequent (post-condition). Let us consider an example:
  • Slide 77
  • Step1: Infer the decision tree from the training set, growing the tree until the training data is fit as well as possible and allowing over- fitting to occur.
  • Slide 78
  • Step 2: Convert the learned tree into an equivalent set of rules by creating one rule for each path from the root node to a leaf node.
  • Slide 79
  • Step 3: Prune (generalize) each rule by removing any preconditions that result in improving its estimated accuracy.
  • Slide 80
  • Step 3 :contd.
  • Slide 81
  • Step 4. Sort the pruned rules by their estimated accuracy, and consider them in this sequence when classifying subsequent instances.
  • Slide 82
  • Post Pruning for Binary Case S S1S2Sm Error(S1) Error(S2) Error(Sm) P1 P2 Pm E(S) BackUpError(S) For any node S which is not a leaf node we can calculate BackUpError(S) = Pi Error(Si) i Error(S) = MIN {} P i = Num of examples in Si Num of examples in S For leaf nodes S i Error(S i ) = E(S i ) E(S) BackUpError(S) Decision: Prune at S if BackUpError(S) Error(S)
  • Slide 83
  • Example of Post Pruning Before Pruning a [6, 4] b [4, 2] c [2, 2] d [1, 2] [x, y] means x YES cases and y NO cases We underline Error(Sk) [3, 2] 0.429 [1, 0] 0.333 [1, 1] 0.5 [0, 1] 0.333 [1, 0] 0.333 0.375 0.413 0.417 0.378 0.5 0.383 0.4 0.444 PRUNE PRUNE means cut the sub- tree below this point
  • Slide 84
  • Result of Pruning After Pruning a [6, 4] [4, 2] c [2, 2] [1, 2] [1, 0]
  • Slide 85
  • Advantages of DT: Simple to understand and interpret. Requires little data preparation. Able to handle both numerical and categorical data. Possible to validate a model using statistical tests. Robust Performs well with large data in a short time. 85
  • Slide 86
  • Limitations of DT The problem of learning an optimal decision tree is known to be NP-complete. Prone to over-fitting. There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. For data including categorical variables with different number of levels, information gain in DT are biased in favor of those attributes with more levels. 86