machine learning lecture 3

92
Lecture No. 3 Ravi Gupta AU-KBC Research Centre, MIT Campus, Anna University Date: 12.3.2008

Upload: srinivasan-r

Post on 17-May-2015

6.157 views

Category:

Education


4 download

DESCRIPTION

Machine learning lecture series by Ravi Gupta, AU-KBC in MIT

TRANSCRIPT

Page 1: Machine learning Lecture 3

Lecture No. 3

Ravi GuptaAU-KBC Research Centre,

MIT Campus, Anna University

Date: 12.3.2008

Page 2: Machine learning Lecture 3

Today’s Agenda

• Recap of ID3 Algorithm• Machine Learning Bias• Occam’s razor principle• Handling ID3 problems

Page 3: Machine learning Lecture 3

Decision Trees

• Decision tree learning is a method for approximating discrete value target functions, in which the learned function is represented by a decision tree.

• Decision trees can also be represented by if-then-else rule.

• Decision tree learning is one of the most widely used approach for inductive inference .

Page 4: Machine learning Lecture 3

Decision TreesEdges: Attribute value

Intermediate Nodes: Attributes

Attribute: A1

Attribute: A2 Attribute: A3

Attribute value

Attribute value

Attribute value

Attribute value

Attribute value

Attribute value

Attribute value

Output value

Output value

Output value

Output value

Output value

Leave node: Output value

Page 5: Machine learning Lecture 3

Decision Trees Representation

conjunctiondisjunction

Page 6: Machine learning Lecture 3

Decision Trees as If-then-else ruleconjunction

disjunction

•If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes•If (Outlook = Overcast) then PlayTennis = Yes•If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes

Page 7: Machine learning Lecture 3

Problems Suitable for Decision Trees

• Instances are represented by attribute-value pairs

• The target function has discrete output values

• Disjunctive descriptions may be required

• The training data may contain errors

• The training data may contain missing attribute values

Page 8: Machine learning Lecture 3

Building Decision Tree

Attribute: A1

Attribute: A2 Attribute: A3

Attribute valueAttribute value

Attribute

value

Attribute value Attribute value Attribute valueAttribute value

Output value Output value

Output value

Output value Output value

Page 9: Machine learning Lecture 3

Building Decision Tree

OutlookTemperature

HumidityWind

Which attribute to select ?????

Root node

Page 10: Machine learning Lecture 3

Entropy

Given a collection S, containing positive and negative examples of some target concept, the entropy of S relative to this booleanclassification (yes/no) is

where is the proportion of positive examples in S and pӨ, is the proportion of negative examples in S. In all calculations involving entropy we define 0 log 0 to be 0.

Page 11: Machine learning Lecture 3

Information Gain Measure

Information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute.

More precisely, the information gain, Gain(S, A) of an attribute A, relative to a collection of examples S, is defined as

where Values(A) is the set of all possible values for attribute A, and Sv, is the subset of S for which attribute A has value v, i.e.,

Page 12: Machine learning Lecture 3

Information Gain Measure

Entropy of S Entropy of S after partition

Gain(S, A) is the expected reduction in entropy caused by knowing the value of attribute A.

Gain(S, A) is the information provided about the target &action value, given the value of some other attribute A. The value of Gain(S, A) is the number of bits saved when encoding the target value of an arbitrary member of S, by knowing the value of attribute A.

Page 13: Machine learning Lecture 3

Example

There are 14 examples. 9 positive and 5 negative examples [9+, 5-].

The entropy of S relative to this boolean (yes/no) classification is

Page 14: Machine learning Lecture 3

Gain (S, Attribute = Wind)

Page 15: Machine learning Lecture 3

Final Decision Tree

Page 16: Machine learning Lecture 3

Some Insights into Capabilities and Limitations of ID3 Algorithm

• ID3’s algorithm searches complete hypothesis space. [Advantage]

• ID3 maintain only a single current hypothesis as it searches through the space of decision trees. By determining only as single hypothesis, ID3 loses the capabilities that follows explicitly representing all consistent hypothesis. [Disadvantage]

• ID3 in its pure form performs no backtracking in its search. Once it selects an attribute to test at a particular level in the tree, it never backtracks to reconsider this choice. Therefore, it is susceptible to the usual risks of hill-climbing search without backtracking: converging to locally optimal solutions that are not globally optimal. [Disadvantage]

Page 17: Machine learning Lecture 3

Some Insights into Capabilities and Limitations of ID3 Algorithm

• ID3 uses all training examples at each step in the search to make statistically based decisions regarding how to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training examples (e.g., FIND-S or CANDIDATE-ELIMINATION). One advantage of using statistical properties of all the examples (e.g., information gain) is that the resulting search is much less sensitive to errors in individual training examples. [Advantage]

Page 18: Machine learning Lecture 3

Machine Learning Biases

• Language Bias/Restriction Bias: Restriction on the type of hypothesis to be learned. (Limits the set of hypothesis to be learned/expressed).

• Preference Bias/Search Bias: A preference for certain hypothesis over others (e.g., shorter hypothesis), with no hard restriction on the hypothesis space.

Page 19: Machine learning Lecture 3

CANDIDATE-ELIMINATION Algorithm

Page 20: Machine learning Lecture 3

CANDIDATE-ELIMINATION Algorithm

Hypothesis was assumed to be conjunction of Attributes

Page 21: Machine learning Lecture 3

CANDIDATE-ELIMINATION Algorithm

Candidate-Elimination algorithm is Language biased

Page 22: Machine learning Lecture 3

CANDIDATE-ELIMINATION Algorithm

The problem is the algorithm considers (biased) only conjunctive space.

The following example requires a more expressive hypothesis space

Page 23: Machine learning Lecture 3

Building Decision Tree

Attribute: A1

Attribute: A2 Attribute: A3

Attribute valueAttribute value

Attribute

value

Attribute value Attribute value Attribute valueAttribute value

Output value Output value

Output value

Output value Output value

Page 24: Machine learning Lecture 3

Decision Tree

ID3 algorithm has Preference/Search Bias

Page 25: Machine learning Lecture 3

ID3 Strategy for Selecting Hypothesis

• Selects trees that place the attributes with highest information gain closest to the root.

• Selects in favor of shorter trees over longer ones.

Page 26: Machine learning Lecture 3

Preference Bias or Restriction Bias ?

A preference bias is more desirable than a restriction bias, because it allows the learner to work within a complete hypothesis space that is assured to contain the unknown target function.

In contrast, a restriction bias that strictly limits the set of potential hypotheses is generally less desirable, because it introduces the possibility of excluding the unknown target function altogether.

Page 27: Machine learning Lecture 3

Preference Bias or Restriction Bias ?

ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION a purely restriction bias, some learning systems combine both.

Page 28: Machine learning Lecture 3

Preference Bias AND Restriction Bias ?

Page 29: Machine learning Lecture 3

Preference Bias AND Restriction Bias ?

• Task T: playing checkers• Performance measure P: % of games won in the world

tournament• Training experience E: games played against itself• Target function: F : Board → R• Target function representation

F'(b) = w0 + w1x1+ w2x2 + w3x3 + w4x4 + w5x5 + w6x6

A linear combination of variables (Language Bias/Restriction Bias)

Page 30: Machine learning Lecture 3

Preference Bias AND Restriction Bias ?

( )

2train

,

( ) (F ( ) '( ))train bb F training examples

E Error b F b< >∈

≡ −∑

Preference Bias (Because weights are found based on Least Mean Square technique)

Page 31: Machine learning Lecture 3

Issues in Decision Tree Learning

• Determining how deeply to grow the decision tree• Handling continuous attributes• Choosing an appropriate attribute• Selection measure• Handling training data with missing attribute values• Handling attributes with differing costs, and improving

computational efficiency

Page 32: Machine learning Lecture 3

Occam’s Razor

Occam's razor (sometimes spelled Ockham's razor) is a principle attributed to the 14th-century English logician and Franciscan friar William of Ockham.

The principle states that the explanation of any phenomenon should make as few assumptions as possible, eliminating those that make no difference in the observable predictions of the explanatory hypothesis or theory.

Page 33: Machine learning Lecture 3

Occam’s Razor

This is often paraphrased as "All other things being equal, the simplest solution is the best."

In other words, when multiple competing theories are equal in other respects, the principle recommends selecting the theory that introduces the fewest assumptions and postulates the fewest entities. It is in this sense that Occam's razor is usually understood.

Prefer the simplest hypothesis that fits the data

Page 34: Machine learning Lecture 3

Why it’s called Occam’s Razor

Tom M. Mitchell say’s…. Occam got this idea during shaving

Wikipedia say’s….. The term razor refers to the act of shaving away unnecessary assumptions to get to the simplest explanation.

Page 35: Machine learning Lecture 3

ID3 Strategy for Selecting Hypothesis

• Selects trees that place the attributes with highest information gain closest to the root.

• Selects in favor of shorter trees over longer ones.

Page 36: Machine learning Lecture 3

Problem with Occam’s Razor

Why should simplest hypothesis that fits the data is best solution. Why not second simplest or third simplest hypothesis.

The size of a hypothesis is determined by the particular representation used internally by the learner. Two learners using different internal representations could therefore arrive at different hypotheses, both justifying their contradictory conclusions by Occam's razor!

Page 37: Machine learning Lecture 3

Training and Testing

For classification problems, a classifier’s performance is measured in terms of the error rate.

The classifier predicts the class of each instance: if it is correct, that is counted as a success; if not, it is an error.

The error rate is just the proportion of errors made over a whole set of instances, and it measures the overall performance of theclassifier.

Page 38: Machine learning Lecture 3

Training and Testing

We are interested in is the likely future performance on newdata, not the past performance on old data. We already know the classifications of each instance in the training set, which after all is why we can use it for training.

We are not generally interested in learning about those classifications—although we might be if our purpose is data cleansing rather than prediction.

So the question is, is the error rate on old data likely to be a good indicator of the error rate on new data?

The answer is a resounding no—not if the old data was used during the learning process to train the classifier.

Page 39: Machine learning Lecture 3

Training and Testing

Error rate on the training set is not likely to be a good indicator of future performance.

Page 40: Machine learning Lecture 3

Training and Testing

Self-consistency Test: When training and test dataset are same

The error rate on the training data is called the resubstitution error, because it is calculated by resubstituting the training instances into a classifier that was constructed from them.

Page 41: Machine learning Lecture 3

Training and Testing

Hold out Strategy: Holdout method reserves a certain amount for testing and uses the remainder for training (and sets part of that aside for validation, if required).

In practical scenario we have limited number of example with us…….

Page 42: Machine learning Lecture 3

Training and Testing

K-fold Cross validation technique:

In the k-fold cross-validation, the dataset was partitioned randomly into k equal-sized sets. The training and testing of each classifier were carried out k times using one distinct set for testing and other k-1 sets for training.

Page 43: Machine learning Lecture 3

4-Fold Cross-validation

Page 44: Machine learning Lecture 3

4-Fold Cross-validation

Test Dataset

ACC1

Training Dataset

Page 45: Machine learning Lecture 3

4-Fold Cross-validation

ACC2

Test DatasetTraining Dataset

Page 46: Machine learning Lecture 3

4-Fold Cross-validation

ACC3

Test DatasetTraining Dataset

Page 47: Machine learning Lecture 3

4-Fold Cross-validation

Test Dataset

ACC4

Training Dataset

Page 48: Machine learning Lecture 3

4-Fold Cross-validation

ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4

Page 49: Machine learning Lecture 3

Issues in Decision Tree Learning

• Determining how deeply to grow the decision tree• Handling continuous attributes• Choosing an appropriate attribute• Selection measure• Handling training data with missing attribute values• Handling attributes with differing costs, and improving

computational efficiency

Page 50: Machine learning Lecture 3

Avoiding Overfitting in Decision Trees…..

• A hypothesis is said to be over-fitting the training examples if some other hypothesis that fits the training examples less well actually performs better over the entire distribution of instances (i.e., including instances beyond the training set).

Page 51: Machine learning Lecture 3

Overfitting

H: Hypothesis Space

Page 52: Machine learning Lecture 3

Overfitting

Negative examplePositive

example

Page 53: Machine learning Lecture 3

Overfitting

h1 h2

Page 54: Machine learning Lecture 3

Overfitting

h1 h2h1 is more accurate than h2 on the training examples

Page 55: Machine learning Lecture 3

Overfitting

h1 h2h1 is less accurate than h2 on the unseen (test) examples

Page 56: Machine learning Lecture 3

Overfitting

Is h1 more accuratethan h2 on training

examples

yes

No over-fitting

Is h1 more accuratethan h2 on test

examples

yes No

Over-fitting

Is h1 more accuratethan h2 on test

examples

no

Over-fitting

yes No

No over-fitting

Page 57: Machine learning Lecture 3

Overfitting

Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the accuracy of the tree measured over the training examples increases monotonically. However, when measured over a set of test examples independent of the training examples, accuracy first increases, then decreases.

Page 58: Machine learning Lecture 3

Overfitting in Decision Tree

Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the accuracy of the tree measured over the training examples increases monotonically. However, when measured over a set of test examples independent of the training examples, accuracy first increases, then decreases.

Page 59: Machine learning Lecture 3

Why Overfitting Happens in Decision Tree Learning?

• Presence of error in the training examples. (In general in machine learning)

• When small numbers of examples are associated with leaf node.

Page 60: Machine learning Lecture 3

Presence of Error and Over-fitting

Page 61: Machine learning Lecture 3

Presence of Error and Over-fitting

Page 62: Machine learning Lecture 3

Presence of Error and Over-fitting

More Complex

Tree depth is more

Page 63: Machine learning Lecture 3

Presence of Error and Over-fitting

Page 64: Machine learning Lecture 3

How to avoid Overfitting…

• Stop growing the tree earlier, before it reaches the point where it perfectly classifies the training data

• Allow the tree to overfit the data, and then post-prune the tree.

Page 65: Machine learning Lecture 3

How to avoid Overfitting…

• Post-pruning overfit trees has been found to be more successful in practice. This is due to the difficulty in the first approach of estimating precisely when to stop growing the tree.

Page 66: Machine learning Lecture 3

How to avoid Overfitting…

• Regardless of whether the correct tree size is found by stopping early or by post-pruning, a key question is what criterion is to be used to determine the correct final tree size.

Page 67: Machine learning Lecture 3

Determining correct final tree size

• Use a separate set of examples for training and testing. [Training and Validation] <for pruning method>

• Use all the available data for training, but apply a statistical test (for e.g., Chi-square test) to estimate whether expanding (or pruning) a particular node is likely to produce an improvement beyond the training set. <for pruning method>

• Use an explicit measure of the complexity for encoding the training examples and the decision tree, halting growth of the tree when this encoding size is minimized. This approach, based on a heuristic called the Minimum Description Length principle (MDL).

Page 68: Machine learning Lecture 3

Pruning Methods

• Reduced-error pruning (Quinlan 1987)

• Rule post-pruning (Quinlan 1993)

Page 69: Machine learning Lecture 3

Reduced Error Pruning

• Pruning a decision node consists of removing the subtree rooted at that node, making it a leaf node, and assigning it the most common classification of the training examples affiliated with that node.

• Nodes are removed only if the resulting pruned tree performs no worse than-the original over the validation set.

Page 70: Machine learning Lecture 3

Reduced Error Pruning

Page 71: Machine learning Lecture 3

Reduced Error Pruning

Page 72: Machine learning Lecture 3

Drawback of Training and Validation Method

Using a separate set of data to guide pruning is an effective approach provided a large amount of data is available. The major drawback of this approach is that when data is limited.

Page 73: Machine learning Lecture 3

Rule Post-Pruning

In practice, it is one quite successful method for finding high accuracy hypotheses in post-pruning of decision tree.

Page 74: Machine learning Lecture 3

Rule Post-Pruning (Step 1)

1

Page 75: Machine learning Lecture 3

Rule Post-Pruning (Step 2)

2

1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No

2: IF (Outlook = sunny and Temperature = Cold) THEN PlayTennis = Yes

3: IF (Outlook = sunny and Temperature = Mild and Humidity=High) THEN PlayTennis = No

4: IF (Outlook = sunny and Temperature = Mild and Humidity=Normal) THEN PlayTennis = Yes

5: IF (Outlook = overcast) THEN PlayTennis = Yes

6: IF (Outlook = rain and Wind = Strong) THEN PlayTennis = No

7: IF (Outlook = rain and Wind = Weak) THEN PlayTennis = Yes

Page 76: Machine learning Lecture 3

Rule Post-Pruning (Step 3)

31: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No

IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No

IF (Temperature = Hot) THEN PlayTennis = No

IF (Outlook = sunny) THEN PlayTennis = No Test Dataset(Validation examples)

Page 77: Machine learning Lecture 3

Rule Post-Pruning (Step 3)

3

IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No

IF (Temperature = Hot) THEN PlayTennis = No

IF (Outlook = sunny) THEN PlayTennis = No Test Dataset(Validation examples)Acc3

Acc2

Acc1

If Acc3 > Acc2 & Acc1

1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No

IF (Temperature = Hot) THEN PlayTennis = No

Page 78: Machine learning Lecture 3

Rule Post-Pruning (Step 4)

4S1: Acc1

S2: Acc2

S3: Acc3

S4: Acc4

.

.

.

S11: Acc11

S12: Acc12

S13: Acc13

S14: Acc14

R1: Acc1

R2: Acc2

R3: Acc3

R4: Acc4

.

.

.

R11: Acc11

R12: Acc12

R13: Acc13

R14: Acc14

Sort rules in descending order of their accuracy on test dataset or validation examples

S1: Acc1 >= S2: Acc2 >= S3: Acc3 >= S4: Acc4 >= … >= S11: Acc11 >= S12: Acc12 >= S13: Acc13 >= S14: Acc14

Page 79: Machine learning Lecture 3

Handling Continuous-Valued Attribute

Page 80: Machine learning Lecture 3

Handling Continuous-Valued Attribute

Page 81: Machine learning Lecture 3

Handling Continuous-Valued Attribute

We have dynamically defining new discrete valued attributes so that it partition the continuous attribute value into a discrete set of intervals.

Page 82: Machine learning Lecture 3

Alternative Measures for Selecting Attributes

There is a natural bias in the information gain measure that favors attributes with many values over those with few values.

Consider the attribute Date, which has a very large number of possible values (e.g., March 11,2008).

If we were to add this as a attribute to the data, it would have the highest information gain of any of the attributes. This is because Date alone perfectly predicts the target attribute over the training data. Thus, it would be selected as the decision attribute for the root node of the tree and lead to a (quite broad) tree of depth one, which perfectly classifies the training data.

However, this decision tree would fare poorly on subsequent examples, because it is not a useful predictor despite the fact that it perfectly separates the training data.

Page 83: Machine learning Lecture 3

Alternative Measures for Selecting Attributes

What is wrong with the attribute Date?It has so many possible values that it is bound to separate the training examples into very small subsets. Because of this, it will have a very high information gain relative to the training examples, despite being a very poor predictor of the target function over unseen instances.

One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986). The gain ratio measure penalizes attributes such as Date by incorporating a term, called split information, that is sensitive to how broadly and uniformly the attribute splits the data.

Page 84: Machine learning Lecture 3

Alternative Measures for Selecting Attributes

What is wrong with the attribute Date?It has so many possible values that it is bound to separate the training examples into very small subsets. Because of this, it will have a very high information gain relative to the training examples, despite being a very poor predictor of the target function over unseen instances.

One way to avoid this difficulty is to select decision attributes based on some measure other than information gain. One alternative measure that has been used successfully is the gain ratio (Quinlan 1986). The gain ratio measure penalizes attributes such as Date by incorporating a term, called split information, that is sensitive to how broadly and uniformly the attribute splits the data.

Page 85: Machine learning Lecture 3

Alternative Measures for Selecting Attributes

where S1 through Sc, are the c subsets of examples resulting from partitioning S by the c-valued attribute A.

Splitlnformation is actually the entropy of S with respect to the values of attribute A. This is in contrast to our previous uses of entropy, in which we considered only the entropy of S with respect to the target attribute whose value is to be predicted by the learned tree.

Page 86: Machine learning Lecture 3

Alternative Measures for Selecting Attributes

The Splitlnformation term discourages the selection of attributes withmany uniformly distributed values.

For example, consider a collection of n examples that are completely separated by attribute A (e.g., Date). In this case, the Splitlnformationvalue will be logn. In contrast, a boolean attribute B that splits the same n examples exactly in half will have Splitlnformation of 1. If attributes A and B produce the same information gain, then clearly B will score higher according to the Gain Ratio measure.

Page 87: Machine learning Lecture 3

Handling Missing Attributes

In certain cases, the available data may be missing values for some attributes. For example, in a medical domain in which we wish topredict patient outcome based on various laboratory tests, it may be that the lab test Blood-Test-Result is available only for a subset of the patients. In such cases, it is common to estimate the missing attribute value based on other examples for which this attribute has a known value.

Page 88: Machine learning Lecture 3

Handling Missing Attributes

• One strategy for dealing with the missing attribute value is to assign it the value that is most common among training examples at node n.

• Alternatively, we might assign it the most common value among examples at node n that have the classification c(x)

A more complex procedure is to assign a probability to each of the possible values of A rather than simply assigning the most common value to A(x). These probabilities can be estimated again based on the observed frequencies of the various values for A among the examples at node n. This method for handling missing attribute values is used in C4.5 (Quinlan 1993).

Page 89: Machine learning Lecture 3

Handling Attributes with Different Cost

In some learning tasks the instance attributes may have associated costs. For example, in learning to classify medical diseases we might describe patients in terms of attributes such as Temperature, BiopsyResult, Pulse, BloodTestResults, etc.

These attributes vary significantly in their costs, both in terms of monetary cost and cost to patient comfort.

In such tasks, we would prefer decision trees that use low-cost attributes where possible, relying on high-cost attributes only when needed to produce reliable classifications.

Page 90: Machine learning Lecture 3

Handling Attributes with Different Cost

ID3 can be modified to take into account attribute costs by introducing a cost term into the attribute selection measure. For example, we might divide the Gain by the cost of the attribute, so that lower-cost attributes would be preferred.

However, such cost-sensitive measures do not guarantee finding an optimal cost-sensitive decision tree, they do bias the search in favor of low-cost attributes.

( , )( )

Gain S ACost A

Page 91: Machine learning Lecture 3

Handling Attributes with Different Cost

Tan and Schlimmer (1990) and Tan (1993) describe one such approach and apply it to a robot perception task in which the robot must learn to classify different objects according to how they can be grasped by the robot's manipulator. In this case the attributes correspond to different sensor readings obtained by a movable sonar on the robot.

Attribute cost is measured by the number of seconds required to obtain the attribute value by positioning and operating the sonar. Theydemonstrate that more efficient recognition strategies are learned, without sacrificing classification accuracy, by replacing the information gain attribute selection measure by the following measure.

Page 92: Machine learning Lecture 3

Handling Attributes with Different Cost

Nunez (1988) describes a related approach and its application tolearning medical diagnosis rules. Here the attributes are different symptoms and laboratory tests with differing costs. His system uses a somewhat different attribute selection measure