1 data mining chapter 6 implementations: real machine learning schemes kirk scott

1

Data MiningChapter 6

Implementations: Real Machine Learning Schemes

Kirk Scott

2

The Little Brown Bat

3

A Zombie Fly laying eggs inside a Honey Bee

4

Argyria

6

Methemoglobinemia(from Wikipedia)

• Methemoglobinemia (or methaemoglobinaemia) is a disorder characterized by the presence of a higher than normal level of methemoglobin (metHb, i.e., ferric[Fe3+] rather than ferrous [Fe2+] haemoglobin) in the blood. Methemoglobin is an oxidized form of hemoglobin that has a decreased affinity for oxygen, resulting in an increased affinity of oxygen to other heme sites within the same red blood cell.

http://en.wikipedia.org/wiki/Methemoglobin

http://en.wikipedia.org/wiki/Blood

http://en.wikipedia.org/wiki/Hemoglobin

7

• This leads to an overall reduced ability of the red blood cell to release oxygen to tissues, with the associated oxygen–hemoglobin dissociation curve therefore shifted to the left. When methemoglobin concentration is elevated in red blood cells, tissue hypoxia can occur.

• …

http://en.wikipedia.org/wiki/Oxygen%E2%80%93haemoglobin_dissociation_curve

http://en.wikipedia.org/wiki/Red_blood_cells

http://en.wikipedia.org/wiki/Tissue_hypoxia

8

• Carriers• The Fugates, a family that lived in the hills of

Kentucky, are the most famous example of this hereditary genetic condition. They are known as the "Blue Fugates." Martin Fugate settled near Hazard, Kentucky, circa 1800. His wife was a carrier of the recessive methemoglobinemia (met-H) gene, as was a nearby clan with whom the Fugates intermarried. As a result, many descendants of the Fugates were born with met-H.[7][8][9]

http://en.wikipedia.org/wiki/Hazard,_Kentucky

http://en.wikipedia.org/wiki/Recessive_gene

http://en.wikipedia.org/wiki/Gene

http://en.wikipedia.org/wiki/Methemoglobinemia

9

• The "blue men of Lurgan" were a pair of Lurgan men suffering from what was described as "familial idiopathic methaemoglobinaemia" who were treated by Dr. James Deeny in 1942. Deeny, who would later become the Chief Medical Officer of the Republic of Ireland, prescribed a course of ascorbic acid and sodium bicarbonate. In case one, by the eighth day of treatment there was a marked change in appearance and by the twelfth day of treatment the patient's complexion was normal. In case two, the patient's complexion reached normality over a month-long duration of treatment.[10]

http://en.wikipedia.org/wiki/Lurgan

http://en.wikipedia.org/wiki/Idiopathic

http://en.wikipedia.org/wiki/Republic_of_Ireland

http://en.wikipedia.org/wiki/Ascorbic_acid

http://en.wikipedia.org/wiki/Sodium_bicarbonate

http://en.wikipedia.org/wiki/Methemoglobinemia

10

Back to the Topic at Hand

• Chapter 4 provided an introduction to data mining algorithms and the motivations underlying them

• Chapter 5 provided a relatively in-depth treatment of how results are evaluated

11

• Some evaluation features that exist in Weka were brought out, like lift charts

• Doubtless, other features have also been implemented in Weka

• When you are doing your project, you will have to look more closely into the evaluation tools in Weka

12

• With the foregoing background, chapter 6 covers issues surrounding various data mining algorithms in some detail

• My goal is to present this information at a level where you would be an informed user of Weka

13

• When doing your project, you will be using Weka

• If issues come up you will be informed enough to recognize and will be able to search around in Weka for how to make a decision about them

• You will have to become sort of an expert on the data mining algorithms you choose to use

14

• At the end of the chapter, in section 6.11, the book lists all of the implementations in Weka

• I think it will be useful to list all of the implementations up front

• This provides a preview of what you’ll find in Weka

• It also provides context for the discussion of the issues in sections 6.1-6.10

15

Subsections of the Chapter with Implementations in Weka

16

6.1 Decision Trees

• J48 (implementation of C4.5)• SimpleCart (minimum cost-complexity

pruning a la CART)• REPTree (reduced-error pruning)

17

6.2 Classification Rules

• (For classifiers, see Section 11.4 and Table 11.5.)

• JRip (RIPPER rule learner)• Part (rules from partial decision trees)• Ridor (ripple-down rule learner)

18

6.3 Association Rules

• (see Section 11.7 and Table 11.8)• FPGrowth (frequent-pattern trees)• GeneralizedSequentialPatterns (find large

item trees in sequential data)

19

6.4 Linear Models and Extensions

• SMO and variants for learning support vector machines

• LibSVM (uses third-party libsvm library)• MultilayerPerceptron• RBFNetwork (radial-basis function

network)• Spegasos (SVM using stochastic gradient

descent)

20

6.5 Instance-Based Learning

• IBk (k-nearest neighbor classifier)• KStar (generalized distance functions)• NNge (rectangular generalizations)

21

6.6 Numeric Prediction

• M5P (model trees)• M5Rules (rules from model trees)• LWL (locally weighted learning)

22

6.7 Bayesian Networks

• BayesNet• AODE, WAODE (averaged one-

dependence network)

23

6.8 Clustering

• (For clustering methods, see Section 11.6 and Table 11.7.)

• Xmeans• Cobweb (includes Classit)• EM

24

6.9 Semisupervised Learning

• No separate data mining implementations are listed for this section

25

6.10 Multi-Instance Learning

• MISVM (iterative method for learning SVM by relabeling instances)

• MISMO (SVM with multi-instance kernel)• CitationKNN (nearest-neighbor method

with Hausdorff distance)

26

• MILR (logistic regression for multi-instance data)

• MIOptimalBall (learning balls for multi-instance classification)

• MIDD (the diverse-density method using the noisy-OR function)

27

6.1 Decision Trees

28

Algorithm C4.5

• This was the algorithm introduced in chapter 4

• It is divide and conquer• Splitting decisions are greedy, based on

the purity/information function value of the results

29

A Review of the Algorithm for Nominal Attributes

• 1. The fundamental question at each level of the tree is always which attribute to split on

• In other words, given attributes x1, x2, x3…, do you branch first on x1 or x2 or x3…?

• Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on?

30

• 2. Suppose you can come up with a function, the information (info) function

• This function is a measure of how much information is needed in order to make a decision at each node in a tree

• 3. You split on the attribute that gives the greatest information gain from level to level

31

• 4. A split is good if it means that little information will be needed at the next level down

• You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level

32

Numeric Attributes

• Because most data sets include numeric attributes, the algorithm needs to be extended

• Obviously, numeric attributes fall into a range; they don’t fall into predefined categories

• That means you need to decide where to split (branch) on them

33

• In general you handle numeric attributes by ordering the instances by value and splitting <, > at a single value

• The information function was used to determine which attribute to split on for the nominal case

• The information function can also be used to choose the best split point for a numeric attribute

34

Nominal vs. Numeric Splitting

• Differences between splitting on nominal and numeric:

• Nominal—split once on that attribute• Numeric—may be split again at every

succeeding level• Or, may do a multi-way split on a numeric

at a given level

35

Numeric Cost Implications

• Computational cost/implementation question for numerics:

• For a numeric over a range there is a potentially infinite number of possible split points

• Whether you split into multiple branches at one level or split multiple times on the same attribute at different levels, the cost of deciding can be high

36

• Also, if you split at different levels, you have a practical consideration:

• Do the instances have to be re-sorted on the attribute at every level?

• A suitable implementation can preserve the initial sorting so it’s available at all lower levels

37

Missing Values in Trees

• As mentioned in chapter 4, you can handle missing values as a separate branch

• Logically, this makes sense if the absence of a value means something

• If the absence doesn’t mean anything, it makes sense to assign instances to the branches proportionally

38

• Recall that in simple terms, something that the best information gain outcome is a leaf that is pure

• Practically speaking, this observation about missing values is worth noting:

• The information function and gain computations can be applied in situations where some of the attribute values are missing

39

Pruning

• This wasn’t discussed in detail in chapter 4• It turns out to be a big deal• There are two kinds of pruning: Pre-

pruning and post-pruning

40

• Pre-pruning is a bit of a misnomer• It means that the tree building algorithm

includes heuristics that decide not to expand down a given branch

• This is the less common approach

41

• Post-pruning means:• Create a complete tree following the rules• After the tree is finished, evaluate it,

potentially removing nodes and branches• This is more common

42

• With post-pruning, on the one hand, you’ve wasted work in developing branches that are pruned

• On the other hand, you don’t throw anything out without having fully developed and evaluated it

• A pre-pruning algorithm will use fewer computational resources, but it may throw out something useful

43

Why Pruning, and How?

• Pruning is important because it goes back to the concepts of training, overfitting, using a test set, and algorithm/result evaluation

• You develop a tree with a training set• You potentially prune it with a test set• The end result, obviously, is a smaller tree• Hopefully it’s also a better tree

44

• By definition, a completed tree will be fitted by the algorithm as closely as possible to the training set

• Pruning involves creating pruned versions of the tree and applying them to the test set

• The error rates for different, pruned versions of the tree are checked with the each other and the original

45

• It turns out that the error rate on the test set may be less if branches in the overfitted tree are merged or removed

• We don’t know exactly how to prune yet• But notice that this is entirely pragmatic

and there is a logic to it

46

• Up until now, the concept of overfitting has simply been asserted

• It may have seemed illogical that a “less well fitted” tree might be better

• But who knows—chopping bits out of the tree and trying it on the test set might give better results

• Why not try and see?

47

The Two Kinds of Post-Pruning

• There are two kinds of post-pruning:• Subtree replacement• Subtree raising

48

Subtree Replacement

• Subtree replacement refers to collapsing a subtree (branch) into a single leaf node

• The subtree replacement algorithm is bottom up

• Work from the leaves up, looking for branches where performance on the test set is better if the branch is collapsed

• See Figure 1.3 on the following overhead for illustration

49

Subtree replacement, from (b) to (a), the whole left branch is replaced

50

• Subtree replacement is not too computationally costly

• All of the instances from the collapsed branch go into the leaf that replaces it

• No additional computation is needed for this step

51

Subtree Raising

• Subtree raising refers to collapsing an internal node and raising one of its children to replace it

• The other children of the original have to be reapportioned into the branches of the replacement

• Typically only the child with the most descendants is a candidate for raising

• See Figure 6.1 on the following overhead for illustration

52

Subtree raising, from (a) to (b), B is replaced by C, and the instances in 4 and 5 have to

be reapportioned into 1, 2, and 3

53

• The subtree raising algorithm is more computationally intensive than subtree replacement

• The expense comes from reapportioning the instances into the new branches/leaves and recalculating the purity/error rate

54

Estimating Error Rates

• As explained earlier, a pruning algorithm can be based on error rates on test sets

• Generally, the test set is smaller than the training set

• It may not be representative of the overall population

• It may undo the overfitting from the training set, while not being perfect itself

55

• It turns out that the C4.5 algorithm doesn’t actually use a test set

• Using certain statistical assumptions, it bases error estimates on the training set

• All we need to understand is that C4.5 does include pruning

• The statistics are explained in a box and, as usual, there’s no need to know the details

56

Complexity of Decision Tree Induction

• The deep details of the derivation of the computational complexity of the algorithm are not important

• However, it’s worth noting that the complexity is tractable

57

• The book gives this as the overall figure for tree induction (creation) with (followed by post-) pruning:

• O(mn log n) + O(n(log n)2)• m = attributes• n = instances

58

The Cost of Building the Tree

• O(mn log n)• Informally:• log n is the number of levels of the tree for

some log base = average degree of branching

• At every level, in the worst case, you have to consider all n instances

• You do this for all of the m attributes

59

The Cost of Pruning the Tree with Subtree Replacement

• O(n)• This is smaller than the order for subtree

raising, so it is not included separately in the formula

60

The Cost of Pruning the Tree with Subtree Raising

• O(n(log n)2)• n instances potentially reclassified at every

level of the tree gives O(n(log n))• Reclassification itself is O(log n)• Therefore, the total order of complexity is

that shown above

61

• All we really need to know is that decision tree induction can be implemented in a way that it runs in log/polynomial time

• It is a computationally practical algorithm

62

From Trees to Rules

• As noted in chapter 4, following every branch of a tree gives a complete set of rules for it

• Rule sets can be pruned just like a tree can

• A specific approach of making rules from trees will come up in the next numbered section

63

C4.5: Choices and Options

• C4.5 has some tunable parameters• They apply to both nominal and numeric

attributes• The parameters are:• Confidence value• Minimum outcomes and minimum

instances• To prune or not to prune

64

Confidence Value

• As noted already, the C4.5 algorithm in Weka uses statistical tools with the training set instead of the test set to calculate error rates

• And the details are beyond the scope of this set of overheads

• The confidence rate in Weka is 25%, and for lack of a better understanding of what it means, we’ll just accept it

65

Minimum Outcomes, Minimum Instances

• The minimum outcomes and instances are easier to understand

• What good is a splitting condition on an attribute that doesn’t have at least two outcomes?

• And what good is a splitting condition that doesn’t have at least two instances per branch?

• The default values for these parameters in Weka are 2 and 2

66

• It is apparent that these defaults are rock bottom values

• It is of some interest what effect changing them would have

• For lack of a better understanding, you might accept these defaults

• Or you might experiment with other values and see what effect that has on the results

67

To Prune or Not to Prune

• Pruning can be turned off in C4.5 in Weka in order to obtain a more or less complete tree

• However, due to some parts of the algorithm as implemented, even with explicit post-pruning turned off, the output may have been pruned in some way

68

Cost-Complexity Pruning

• The pruning algorithm in C4.5 is fast• However, it doesn’t always prune enough• CART = Classification and Regression

Trees• This scheme has a more advanced,

stringent, and costly approach to pruning• It might profitably be applied to a C4.5

derived tree, giving a smaller, better result

69

Discussion

• Tree induction is presented first in this chapter because it’s probably the most studied of the data mining schemes

• As presented up to this point, decision nodes have been on one attribute

• CART supports decision nodes on >1 (nominal) attributes at a time

70

• For numeric attributes, a decision can be made based on a function of >1 attribute at a node

• Multivariate numeric test conditions are hyperplanes, not parallel to an axis like a single attribute compared to a constant

• Fancier schemes will take longer to run• The results may be more compact, but

also harder for humans to understand

71

What’s in Weka?

• Implementation of C4.5: J48 in Weka• Reduced-error pruning: REPTree in Weka• Minimum cost-complexity pruning a la

CART (classification and regression trees): SimpleCart in Weka

72

6.2 Classification Rules

• Recall the basic idea presented in chapter 4:

• You can make rules by trying to “cover” classifications in the data

• Recall, also, that the question of whether to accept “imperfect” rules or only “perfect” rules came up

• It is one of the aspects that will be discussed more here

73

• The basic question with rules is the same as with trees:

• A rule-producing algorithm will tend to overfit the training data

• That will mean that it is not such a good predictor

• How do you evaluate the error rate of a rule on a test set and decide whether it is good enough to keep?

74

Criteria for Choosing Tests

• Recall that in chapter 4, tests, namely conditions, are added to a rule with AND under this criterion:

• p stands for the number of correct classifications (p = positive)

• t stands for the total number covered (t = total)

• You wanted to maximize p/t

75

• Maximizing p/t is what led to “perfection”• If there was a condition that gave p/t = 1, it

would be chosen• This is not necessarily ideal• Which rule is better:• A rule that covers one instance with p/t = 1• Or a rule that covers 1,000 cases with p/t

= 999/1000?

76

An Alternative Rule Evaluation Criterion

• Let P and T be the values for a rule before a new condition is added

• Let p and t be the values after a condition is added

• You could compare different conditions by finding this product based on information gain:

• p * (log p/t – log P/T)

77

• This new criterion skews the judgment from perfection to the number of cases covered

• If you run an algorithm that quits only when ultimate perfection is achieved, you’ll get there eventually by selecting rules based on either criterion

78

• There is no absolute best criterion for selecting rules

• The real problem is still trimming the rule set back until is useful for prediction

79

Missing Values, Numeric Attributes

• Covering algorithms tend to handle missing values pretty well (assuming that the majority of the values aren’t missing…)

• Informally, you could say the algorithm builds rules for positive hits

• It effectively ignores missing values• Separate and conquer means that you

slowly narrow down to a remainder of instances

80

• Either instances with missing values are handled earlier in the process based on attributes with values

• Or at the end, the few remaining instances will be handled as special cases

• Conditions will be added to rules so that exceptional instances are classified—on attributes with values

• Handling these exceptional cases may actually constitute overfitting…

81

• Numeric valued attributes can be handled for rules just like for trees

• Instances can be ordered on the attribute value and all candidate rules based on a <, > comparison or split can be evaluated

82

Generating Good Rules

• This goes back to the idea that an imperfect rule is not overfitted and might make a good predictor

• The approach is similar to the approach with trees

• Divide the data 2/3, 1/3 into a growing set and a pruning set, for example

83

• The growing set is used to make rules by adding conditions

• The pruning set is used to simplify rules by removing conditions

• The criterion for removing rules is a reduction in the error rate on the pruning set

• This is called reduced-error pruning

84

• One algorithm for doing this is incremental reduced-error pruning

• The algorithm goes like this:• For a given classification, grow a complete

covering rule for it• Now test the error rate of the rule on the

test set and compare this with the error rate for all “sub-rules” with conditions removed

85

• For that class, hold onto whichever rule has the lowest error rate

• Do this for all classes• Compare the error rates for the rules for

each class• Keep the one rule with the lowest error

rate• Remove the covered instances• Repeat

86

• Why do you have to repeat• Because there is a difference between the

training set and the test set and you are developing rules that may be imperfect for the training set

• You remove the instances covered by the accepted rule from the training set and then go again

• That’s what they mean by incremental

87

• A non-incremental version of the algorithm would build a complete rule set first and then prune it

• This is more time-consuming

88

Evaluating Error Rate to Select Rules

• The book suggests various alternatives to comparing on the basis of p/t or expressions including log p/t – log P/T

• They all have the same problem:• We’ve decided to accept imperfect rules,

not just perfect ones• There is no perfect balancing point

between percent covered correctly and total number covered correctly

89

Algorithm Performance/Refinements

• Incremental reduced-error pruning produces good rule sets quickly

• It can be speeded up by simply picking a rule for each class in order of size, from smallest to largest

• It will also run more quickly with a suitable stopping condition

90

• For example, once a rule is accepted with a sufficiently low accuracy, stop searching for more refinements

• Unfortunately, such a stopping condition may cause better solutions to be overlooked

• A better stopping condition may be based on the MDL principle

91

Using Global Optimization

• Everything we’ve talked about is heuristic• We can’t claim we’re finding optimal trees

or rule sets• After finishing an algorithm, further

heuristics may be applied, which may lead to better (but not actually optimal) solutions

• In this context the idea is to run the incremental algorithm; then try to improve, taking all of the derived rules into account

92

What is RIPPER

• RIPPER stands for repeated incremental pruning to produce error reduction

• This is the name of a rule generation scheme

• In short, it has most of the bells and whistles noted above built in, in order to improve the rule sets generated

93

Obtaining Rules from Partial Decision Trees

• The book asserts that rule building schemes tend to prune too much

• Tree building schemes tend to err in the opposite direction, pruning too little

• A balancing approach is to use trees to develop rule sets

94

• In gross form, you could build a complete tree

• Then pick the best rule by tracing all the branches

• Then remove the covered instances and repeat

• However, building a complete tree each time is wasteful and unnecessary

95

• The alternative is to build a partial decision tree

• In brief, what you do is a form of pre-pruning during tree creation

• You use metrics that tell you there’s no need to explore certain branches further

• The decision about which branches merit further expansion is based on their entropy (information function) values

96

• Once the tree development is complete, you pick the best rule among those branches that can be traced to a leaf

• Then you throw out the partial tree and repeat the process with the instances that weren’t covered by the rule

97

• This technique is simpler than other schemes in this sense:

• It doesn’t have a global optimization stage at the end

• It can give rule sets that match the performance of schemes that do require global optimization

98

Rules with Exceptions

• Rules with exceptions get a more complete and friendlier presentation here than in chapter 4

• They are not a logical abomination• They have a logic of their own

99

• Imagine starting with a default case—namely the majority classification

• All instances which don’t fall into this classification are exceptions

• Among the exceptions, let the majority classification be the new default

• Then those instances which don’t fall into this classification are exceptions

100

• With other techniques it might be the nth iteration (node split, condition added) before you ultimately nail something

• Here, you’re always thinking of the majority case first

• Within one or two levels at the top you have a good picture of the situation overall

• You go down for finer levels of detail

101

• Effectively, this is an alternative way of building a representation of the data

• You could say that at every level, your thinking is, “All else being equal…”

• It is no accident that the organizing principle of going from broad default to fine exception mirrors human thinking in some problem domains

102

Discussion

• This is a repetition of the rule building algorithms and the implementations mentioned in section 6.11

• Simple rule building for relatively noise-free data, covering, separate and conquer: PRISM

103

• Incremental reduced-error pruning, RIPPER: JRip in Weka

• Rules from partial decision trees: Part in Weka

• Rules with exceptions, ripple down rules: Ridor in Weka

104

6.3 Association Rules

• The algorithm given in chapter 4 for finding association rules is known as the apriori algorithm

• It was essentially an exhaustive search• It was made somewhat more tolerable by

observing that if a weaker rule didn’t meet the threshold for acceptance, a stronger one wouldn’t either

105

• The bottom line turns out to be that there’s got to be a better way

• There is—it’s known as a frequent pattern or FP-tree implementation (FP not to be confused with false positive)

• Essentially, the FP-tree is based on a special kind of data structure, a prefix tree, a tree with additional information attached

106

• The details of the FP-tree implementation are of no interest

• However, there is a side comment of some interest

• The authors mention that it is desirable to use a data structure small enough to be memory resident

107

• Compare this idea with things that came up in db systems, B-trees and hash join

• Even a high complexity algorithm is relatively tolerable if it can be executed in memory

• Even a low complexity algorithm is relatively costly if it involves access to secondary storage

108

Association Rule Mining In Weka

• FPGrowth (frequent pattern trees)• GeneralizedSequentialPatterns (GSP)• GSP is an application of the idea of apriori

rule generation to databases of event sequences

• If, by chance, you are considering such a database of sequential events, GSP might just the piece of software to use

109

The End

1 data mining chapter 6 implementations: real machine learning schemes kirk scott

Documents