1 data mining chapter 6 implementations: real machine learning schemes kirk scott

109
1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

Upload: roy-pope

Post on 27-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

1

Data MiningChapter 6

Implementations: Real Machine Learning Schemes

Kirk Scott

Page 2: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

2

The Little Brown Bat

Page 3: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

3

A Zombie Fly laying eggs inside a Honey Bee

Page 4: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

4

Argyria

Page 5: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

5

Page 6: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

6

Methemoglobinemia(from Wikipedia)

• Methemoglobinemia (or methaemoglobinaemia) is a disorder characterized by the presence of a higher than normal level of methemoglobin (metHb, i.e., ferric[Fe3+] rather than ferrous [Fe2+] haemoglobin) in the blood. Methemoglobin is an oxidized form of hemoglobin that has a decreased affinity for oxygen, resulting in an increased affinity of oxygen to other heme sites within the same red blood cell.

Page 7: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

7

• This leads to an overall reduced ability of the red blood cell to release oxygen to tissues, with the associated oxygen–hemoglobin dissociation curve therefore shifted to the left. When methemoglobin concentration is elevated in red blood cells, tissue hypoxia can occur.

• …

Page 8: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

8

• Carriers• The Fugates, a family that lived in the hills of

Kentucky, are the most famous example of this hereditary genetic condition. They are known as the "Blue Fugates." Martin Fugate settled near Hazard, Kentucky, circa 1800. His wife was a carrier of the recessive methemoglobinemia (met-H) gene, as was a nearby clan with whom the Fugates intermarried. As a result, many descendants of the Fugates were born with met-H.[7][8][9]

Page 9: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

9

• The "blue men of Lurgan" were a pair of Lurgan men suffering from what was described as "familial idiopathic methaemoglobinaemia" who were treated by Dr. James Deeny in 1942. Deeny, who would later become the Chief Medical Officer of the Republic of Ireland, prescribed a course of ascorbic acid and sodium bicarbonate. In case one, by the eighth day of treatment there was a marked change in appearance and by the twelfth day of treatment the patient's complexion was normal. In case two, the patient's complexion reached normality over a month-long duration of treatment.[10]

Page 10: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

10

Back to the Topic at Hand

• Chapter 4 provided an introduction to data mining algorithms and the motivations underlying them

• Chapter 5 provided a relatively in-depth treatment of how results are evaluated

Page 11: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

11

• Some evaluation features that exist in Weka were brought out, like lift charts

• Doubtless, other features have also been implemented in Weka

• When you are doing your project, you will have to look more closely into the evaluation tools in Weka

Page 12: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

12

• With the foregoing background, chapter 6 covers issues surrounding various data mining algorithms in some detail

• My goal is to present this information at a level where you would be an informed user of Weka

Page 13: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

13

• When doing your project, you will be using Weka

• If issues come up you will be informed enough to recognize and will be able to search around in Weka for how to make a decision about them

• You will have to become sort of an expert on the data mining algorithms you choose to use

Page 14: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

14

• At the end of the chapter, in section 6.11, the book lists all of the implementations in Weka

• I think it will be useful to list all of the implementations up front

• This provides a preview of what you’ll find in Weka

• It also provides context for the discussion of the issues in sections 6.1-6.10

Page 15: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

15

Subsections of the Chapter with Implementations in Weka

Page 16: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

16

6.1 Decision Trees

• J48 (implementation of C4.5)• SimpleCart (minimum cost-complexity

pruning a la CART)• REPTree (reduced-error pruning)

Page 17: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

17

6.2 Classification Rules

• (For classifiers, see Section 11.4 and Table 11.5.)

• JRip (RIPPER rule learner)• Part (rules from partial decision trees)• Ridor (ripple-down rule learner)

Page 18: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

18

6.3 Association Rules

• (see Section 11.7 and Table 11.8)• FPGrowth (frequent-pattern trees)• GeneralizedSequentialPatterns (find large

item trees in sequential data)

Page 19: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

19

6.4 Linear Models and Extensions

• SMO and variants for learning support vector machines

• LibSVM (uses third-party libsvm library)• MultilayerPerceptron• RBFNetwork (radial-basis function

network)• Spegasos (SVM using stochastic gradient

descent)

Page 20: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

20

6.5 Instance-Based Learning

• IBk (k-nearest neighbor classifier)• KStar (generalized distance functions)• NNge (rectangular generalizations)

Page 21: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

21

6.6 Numeric Prediction

• M5P (model trees)• M5Rules (rules from model trees)• LWL (locally weighted learning)

Page 22: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

22

6.7 Bayesian Networks

• BayesNet• AODE, WAODE (averaged one-

dependence network)

Page 23: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

23

6.8 Clustering

• (For clustering methods, see Section 11.6 and Table 11.7.)

• Xmeans• Cobweb (includes Classit)• EM

Page 24: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

24

6.9 Semisupervised Learning

• No separate data mining implementations are listed for this section

Page 25: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

25

6.10 Multi-Instance Learning

• MISVM (iterative method for learning SVM by relabeling instances)

• MISMO (SVM with multi-instance kernel)• CitationKNN (nearest-neighbor method

with Hausdorff distance)

Page 26: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

26

• MILR (logistic regression for multi-instance data)

• MIOptimalBall (learning balls for multi-instance classification)

• MIDD (the diverse-density method using the noisy-OR function)

Page 27: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

27

6.1 Decision Trees

Page 28: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

28

Algorithm C4.5

• This was the algorithm introduced in chapter 4

• It is divide and conquer• Splitting decisions are greedy, based on

the purity/information function value of the results

Page 29: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

29

A Review of the Algorithm for Nominal Attributes

• 1. The fundamental question at each level of the tree is always which attribute to split on

• In other words, given attributes x1, x2, x3…, do you branch first on x1 or x2 or x3…?

• Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on?

Page 30: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

30

• 2. Suppose you can come up with a function, the information (info) function

• This function is a measure of how much information is needed in order to make a decision at each node in a tree

• 3. You split on the attribute that gives the greatest information gain from level to level

Page 31: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

31

• 4. A split is good if it means that little information will be needed at the next level down

• You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level

Page 32: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

32

Numeric Attributes

• Because most data sets include numeric attributes, the algorithm needs to be extended

• Obviously, numeric attributes fall into a range; they don’t fall into predefined categories

• That means you need to decide where to split (branch) on them

Page 33: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

33

• In general you handle numeric attributes by ordering the instances by value and splitting <, > at a single value

• The information function was used to determine which attribute to split on for the nominal case

• The information function can also be used to choose the best split point for a numeric attribute

Page 34: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

34

Nominal vs. Numeric Splitting

• Differences between splitting on nominal and numeric:

• Nominal—split once on that attribute• Numeric—may be split again at every

succeeding level• Or, may do a multi-way split on a numeric

at a given level

Page 35: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

35

Numeric Cost Implications

• Computational cost/implementation question for numerics:

• For a numeric over a range there is a potentially infinite number of possible split points

• Whether you split into multiple branches at one level or split multiple times on the same attribute at different levels, the cost of deciding can be high

Page 36: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

36

• Also, if you split at different levels, you have a practical consideration:

• Do the instances have to be re-sorted on the attribute at every level?

• A suitable implementation can preserve the initial sorting so it’s available at all lower levels

Page 37: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

37

Missing Values in Trees

• As mentioned in chapter 4, you can handle missing values as a separate branch

• Logically, this makes sense if the absence of a value means something

• If the absence doesn’t mean anything, it makes sense to assign instances to the branches proportionally

Page 38: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

38

• Recall that in simple terms, something that the best information gain outcome is a leaf that is pure

• Practically speaking, this observation about missing values is worth noting:

• The information function and gain computations can be applied in situations where some of the attribute values are missing

Page 39: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

39

Pruning

• This wasn’t discussed in detail in chapter 4• It turns out to be a big deal• There are two kinds of pruning: Pre-

pruning and post-pruning

Page 40: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

40

• Pre-pruning is a bit of a misnomer• It means that the tree building algorithm

includes heuristics that decide not to expand down a given branch

• This is the less common approach

Page 41: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

41

• Post-pruning means:• Create a complete tree following the rules• After the tree is finished, evaluate it,

potentially removing nodes and branches• This is more common

Page 42: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

42

• With post-pruning, on the one hand, you’ve wasted work in developing branches that are pruned

• On the other hand, you don’t throw anything out without having fully developed and evaluated it

• A pre-pruning algorithm will use fewer computational resources, but it may throw out something useful

Page 43: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

43

Why Pruning, and How?

• Pruning is important because it goes back to the concepts of training, overfitting, using a test set, and algorithm/result evaluation

• You develop a tree with a training set• You potentially prune it with a test set• The end result, obviously, is a smaller tree• Hopefully it’s also a better tree

Page 44: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

44

• By definition, a completed tree will be fitted by the algorithm as closely as possible to the training set

• Pruning involves creating pruned versions of the tree and applying them to the test set

• The error rates for different, pruned versions of the tree are checked with the each other and the original

Page 45: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

45

• It turns out that the error rate on the test set may be less if branches in the overfitted tree are merged or removed

• We don’t know exactly how to prune yet• But notice that this is entirely pragmatic

and there is a logic to it

Page 46: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

46

• Up until now, the concept of overfitting has simply been asserted

• It may have seemed illogical that a “less well fitted” tree might be better

• But who knows—chopping bits out of the tree and trying it on the test set might give better results

• Why not try and see?

Page 47: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

47

The Two Kinds of Post-Pruning

• There are two kinds of post-pruning:• Subtree replacement• Subtree raising

Page 48: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

48

Subtree Replacement

• Subtree replacement refers to collapsing a subtree (branch) into a single leaf node

• The subtree replacement algorithm is bottom up

• Work from the leaves up, looking for branches where performance on the test set is better if the branch is collapsed

• See Figure 1.3 on the following overhead for illustration

Page 49: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

49

Subtree replacement, from (b) to (a), the whole left branch is replaced

Page 50: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

50

• Subtree replacement is not too computationally costly

• All of the instances from the collapsed branch go into the leaf that replaces it

• No additional computation is needed for this step

Page 51: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

51

Subtree Raising

• Subtree raising refers to collapsing an internal node and raising one of its children to replace it

• The other children of the original have to be reapportioned into the branches of the replacement

• Typically only the child with the most descendants is a candidate for raising

• See Figure 6.1 on the following overhead for illustration

Page 52: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

52

Subtree raising, from (a) to (b), B is replaced by C, and the instances in 4 and 5 have to

be reapportioned into 1, 2, and 3

Page 53: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

53

• The subtree raising algorithm is more computationally intensive than subtree replacement

• The expense comes from reapportioning the instances into the new branches/leaves and recalculating the purity/error rate

Page 54: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

54

Estimating Error Rates

• As explained earlier, a pruning algorithm can be based on error rates on test sets

• Generally, the test set is smaller than the training set

• It may not be representative of the overall population

• It may undo the overfitting from the training set, while not being perfect itself

Page 55: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

55

• It turns out that the C4.5 algorithm doesn’t actually use a test set

• Using certain statistical assumptions, it bases error estimates on the training set

• All we need to understand is that C4.5 does include pruning

• The statistics are explained in a box and, as usual, there’s no need to know the details

Page 56: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

56

Complexity of Decision Tree Induction

• The deep details of the derivation of the computational complexity of the algorithm are not important

• However, it’s worth noting that the complexity is tractable

Page 57: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

57

• The book gives this as the overall figure for tree induction (creation) with (followed by post-) pruning:

• O(mn log n) + O(n(log n)2)• m = attributes• n = instances

Page 58: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

58

The Cost of Building the Tree

• O(mn log n)• Informally:• log n is the number of levels of the tree for

some log base = average degree of branching

• At every level, in the worst case, you have to consider all n instances

• You do this for all of the m attributes

Page 59: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

59

The Cost of Pruning the Tree with Subtree Replacement

• O(n)• This is smaller than the order for subtree

raising, so it is not included separately in the formula

Page 60: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

60

The Cost of Pruning the Tree with Subtree Raising

• O(n(log n)2)• n instances potentially reclassified at every

level of the tree gives O(n(log n))• Reclassification itself is O(log n)• Therefore, the total order of complexity is

that shown above

Page 61: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

61

• All we really need to know is that decision tree induction can be implemented in a way that it runs in log/polynomial time

• It is a computationally practical algorithm

Page 62: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

62

From Trees to Rules

• As noted in chapter 4, following every branch of a tree gives a complete set of rules for it

• Rule sets can be pruned just like a tree can

• A specific approach of making rules from trees will come up in the next numbered section

Page 63: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

63

C4.5: Choices and Options

• C4.5 has some tunable parameters• They apply to both nominal and numeric

attributes• The parameters are:• Confidence value• Minimum outcomes and minimum

instances• To prune or not to prune

Page 64: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

64

Confidence Value

• As noted already, the C4.5 algorithm in Weka uses statistical tools with the training set instead of the test set to calculate error rates

• And the details are beyond the scope of this set of overheads

• The confidence rate in Weka is 25%, and for lack of a better understanding of what it means, we’ll just accept it

Page 65: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

65

Minimum Outcomes, Minimum Instances

• The minimum outcomes and instances are easier to understand

• What good is a splitting condition on an attribute that doesn’t have at least two outcomes?

• And what good is a splitting condition that doesn’t have at least two instances per branch?

• The default values for these parameters in Weka are 2 and 2

Page 66: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

66

• It is apparent that these defaults are rock bottom values

• It is of some interest what effect changing them would have

• For lack of a better understanding, you might accept these defaults

• Or you might experiment with other values and see what effect that has on the results

Page 67: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

67

To Prune or Not to Prune

• Pruning can be turned off in C4.5 in Weka in order to obtain a more or less complete tree

• However, due to some parts of the algorithm as implemented, even with explicit post-pruning turned off, the output may have been pruned in some way

Page 68: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

68

Cost-Complexity Pruning

• The pruning algorithm in C4.5 is fast• However, it doesn’t always prune enough• CART = Classification and Regression

Trees• This scheme has a more advanced,

stringent, and costly approach to pruning• It might profitably be applied to a C4.5

derived tree, giving a smaller, better result

Page 69: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

69

Discussion

• Tree induction is presented first in this chapter because it’s probably the most studied of the data mining schemes

• As presented up to this point, decision nodes have been on one attribute

• CART supports decision nodes on >1 (nominal) attributes at a time

Page 70: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

70

• For numeric attributes, a decision can be made based on a function of >1 attribute at a node

• Multivariate numeric test conditions are hyperplanes, not parallel to an axis like a single attribute compared to a constant

• Fancier schemes will take longer to run• The results may be more compact, but

also harder for humans to understand

Page 71: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

71

What’s in Weka?

• Implementation of C4.5: J48 in Weka• Reduced-error pruning: REPTree in Weka• Minimum cost-complexity pruning a la

CART (classification and regression trees): SimpleCart in Weka

Page 72: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

72

6.2 Classification Rules

• Recall the basic idea presented in chapter 4:

• You can make rules by trying to “cover” classifications in the data

• Recall, also, that the question of whether to accept “imperfect” rules or only “perfect” rules came up

• It is one of the aspects that will be discussed more here

Page 73: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

73

• The basic question with rules is the same as with trees:

• A rule-producing algorithm will tend to overfit the training data

• That will mean that it is not such a good predictor

• How do you evaluate the error rate of a rule on a test set and decide whether it is good enough to keep?

Page 74: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

74

Criteria for Choosing Tests

• Recall that in chapter 4, tests, namely conditions, are added to a rule with AND under this criterion:

• p stands for the number of correct classifications (p = positive)

• t stands for the total number covered (t = total)

• You wanted to maximize p/t

Page 75: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

75

• Maximizing p/t is what led to “perfection”• If there was a condition that gave p/t = 1, it

would be chosen• This is not necessarily ideal• Which rule is better:• A rule that covers one instance with p/t = 1• Or a rule that covers 1,000 cases with p/t

= 999/1000?

Page 76: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

76

An Alternative Rule Evaluation Criterion

• Let P and T be the values for a rule before a new condition is added

• Let p and t be the values after a condition is added

• You could compare different conditions by finding this product based on information gain:

• p * (log p/t – log P/T)

Page 77: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

77

• This new criterion skews the judgment from perfection to the number of cases covered

• If you run an algorithm that quits only when ultimate perfection is achieved, you’ll get there eventually by selecting rules based on either criterion

Page 78: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

78

• There is no absolute best criterion for selecting rules

• The real problem is still trimming the rule set back until is useful for prediction

Page 79: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

79

Missing Values, Numeric Attributes

• Covering algorithms tend to handle missing values pretty well (assuming that the majority of the values aren’t missing…)

• Informally, you could say the algorithm builds rules for positive hits

• It effectively ignores missing values• Separate and conquer means that you

slowly narrow down to a remainder of instances

Page 80: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

80

• Either instances with missing values are handled earlier in the process based on attributes with values

• Or at the end, the few remaining instances will be handled as special cases

• Conditions will be added to rules so that exceptional instances are classified—on attributes with values

• Handling these exceptional cases may actually constitute overfitting…

Page 81: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

81

• Numeric valued attributes can be handled for rules just like for trees

• Instances can be ordered on the attribute value and all candidate rules based on a <, > comparison or split can be evaluated

Page 82: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

82

Generating Good Rules

• This goes back to the idea that an imperfect rule is not overfitted and might make a good predictor

• The approach is similar to the approach with trees

• Divide the data 2/3, 1/3 into a growing set and a pruning set, for example

Page 83: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

83

• The growing set is used to make rules by adding conditions

• The pruning set is used to simplify rules by removing conditions

• The criterion for removing rules is a reduction in the error rate on the pruning set

• This is called reduced-error pruning

Page 84: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

84

• One algorithm for doing this is incremental reduced-error pruning

• The algorithm goes like this:• For a given classification, grow a complete

covering rule for it• Now test the error rate of the rule on the

test set and compare this with the error rate for all “sub-rules” with conditions removed

Page 85: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

85

• For that class, hold onto whichever rule has the lowest error rate

• Do this for all classes• Compare the error rates for the rules for

each class• Keep the one rule with the lowest error

rate• Remove the covered instances• Repeat

Page 86: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

86

• Why do you have to repeat• Because there is a difference between the

training set and the test set and you are developing rules that may be imperfect for the training set

• You remove the instances covered by the accepted rule from the training set and then go again

• That’s what they mean by incremental

Page 87: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

87

• A non-incremental version of the algorithm would build a complete rule set first and then prune it

• This is more time-consuming

Page 88: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

88

Evaluating Error Rate to Select Rules

• The book suggests various alternatives to comparing on the basis of p/t or expressions including log p/t – log P/T

• They all have the same problem:• We’ve decided to accept imperfect rules,

not just perfect ones• There is no perfect balancing point

between percent covered correctly and total number covered correctly

Page 89: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

89

Algorithm Performance/Refinements

• Incremental reduced-error pruning produces good rule sets quickly

• It can be speeded up by simply picking a rule for each class in order of size, from smallest to largest

• It will also run more quickly with a suitable stopping condition

Page 90: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

90

• For example, once a rule is accepted with a sufficiently low accuracy, stop searching for more refinements

• Unfortunately, such a stopping condition may cause better solutions to be overlooked

• A better stopping condition may be based on the MDL principle

Page 91: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

91

Using Global Optimization

• Everything we’ve talked about is heuristic• We can’t claim we’re finding optimal trees

or rule sets• After finishing an algorithm, further

heuristics may be applied, which may lead to better (but not actually optimal) solutions

• In this context the idea is to run the incremental algorithm; then try to improve, taking all of the derived rules into account

Page 92: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

92

What is RIPPER

• RIPPER stands for repeated incremental pruning to produce error reduction

• This is the name of a rule generation scheme

• In short, it has most of the bells and whistles noted above built in, in order to improve the rule sets generated

Page 93: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

93

Obtaining Rules from Partial Decision Trees

• The book asserts that rule building schemes tend to prune too much

• Tree building schemes tend to err in the opposite direction, pruning too little

• A balancing approach is to use trees to develop rule sets

Page 94: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

94

• In gross form, you could build a complete tree

• Then pick the best rule by tracing all the branches

• Then remove the covered instances and repeat

• However, building a complete tree each time is wasteful and unnecessary

Page 95: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

95

• The alternative is to build a partial decision tree

• In brief, what you do is a form of pre-pruning during tree creation

• You use metrics that tell you there’s no need to explore certain branches further

• The decision about which branches merit further expansion is based on their entropy (information function) values

Page 96: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

96

• Once the tree development is complete, you pick the best rule among those branches that can be traced to a leaf

• Then you throw out the partial tree and repeat the process with the instances that weren’t covered by the rule

Page 97: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

97

• This technique is simpler than other schemes in this sense:

• It doesn’t have a global optimization stage at the end

• It can give rule sets that match the performance of schemes that do require global optimization

Page 98: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

98

Rules with Exceptions

• Rules with exceptions get a more complete and friendlier presentation here than in chapter 4

• They are not a logical abomination• They have a logic of their own

Page 99: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

99

• Imagine starting with a default case—namely the majority classification

• All instances which don’t fall into this classification are exceptions

• Among the exceptions, let the majority classification be the new default

• Then those instances which don’t fall into this classification are exceptions

Page 100: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

100

• With other techniques it might be the nth iteration (node split, condition added) before you ultimately nail something

• Here, you’re always thinking of the majority case first

• Within one or two levels at the top you have a good picture of the situation overall

• You go down for finer levels of detail

Page 101: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

101

• Effectively, this is an alternative way of building a representation of the data

• You could say that at every level, your thinking is, “All else being equal…”

• It is no accident that the organizing principle of going from broad default to fine exception mirrors human thinking in some problem domains

Page 102: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

102

Discussion

• This is a repetition of the rule building algorithms and the implementations mentioned in section 6.11

• Simple rule building for relatively noise-free data, covering, separate and conquer: PRISM

Page 103: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

103

• Incremental reduced-error pruning, RIPPER: JRip in Weka

• Rules from partial decision trees: Part in Weka

• Rules with exceptions, ripple down rules: Ridor in Weka

Page 104: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

104

6.3 Association Rules

• The algorithm given in chapter 4 for finding association rules is known as the apriori algorithm

• It was essentially an exhaustive search• It was made somewhat more tolerable by

observing that if a weaker rule didn’t meet the threshold for acceptance, a stronger one wouldn’t either

Page 105: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

105

• The bottom line turns out to be that there’s got to be a better way

• There is—it’s known as a frequent pattern or FP-tree implementation (FP not to be confused with false positive)

• Essentially, the FP-tree is based on a special kind of data structure, a prefix tree, a tree with additional information attached

Page 106: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

106

• The details of the FP-tree implementation are of no interest

• However, there is a side comment of some interest

• The authors mention that it is desirable to use a data structure small enough to be memory resident

Page 107: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

107

• Compare this idea with things that came up in db systems, B-trees and hash join

• Even a high complexity algorithm is relatively tolerable if it can be executed in memory

• Even a low complexity algorithm is relatively costly if it involves access to secondary storage

Page 108: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

108

Association Rule Mining In Weka

• FPGrowth (frequent pattern trees)• GeneralizedSequentialPatterns (GSP)• GSP is an application of the idea of apriori

rule generation to databases of event sequences

• If, by chance, you are considering such a database of sequential events, GSP might just the piece of software to use

Page 109: 1 Data Mining Chapter 6 Implementations: Real Machine Learning Schemes Kirk Scott

109

The End