1 data mining chapter 6 implementations: real machine learning schemes kirk scott
TRANSCRIPT
1
Data MiningChapter 6
Implementations: Real Machine Learning Schemes
Kirk Scott
2
The Little Brown Bat
3
A Zombie Fly laying eggs inside a Honey Bee
4
Argyria
5
6
Methemoglobinemia(from Wikipedia)
• Methemoglobinemia (or methaemoglobinaemia) is a disorder characterized by the presence of a higher than normal level of methemoglobin (metHb, i.e., ferric[Fe3+] rather than ferrous [Fe2+] haemoglobin) in the blood. Methemoglobin is an oxidized form of hemoglobin that has a decreased affinity for oxygen, resulting in an increased affinity of oxygen to other heme sites within the same red blood cell.
7
• This leads to an overall reduced ability of the red blood cell to release oxygen to tissues, with the associated oxygen–hemoglobin dissociation curve therefore shifted to the left. When methemoglobin concentration is elevated in red blood cells, tissue hypoxia can occur.
• …
8
• Carriers• The Fugates, a family that lived in the hills of
Kentucky, are the most famous example of this hereditary genetic condition. They are known as the "Blue Fugates." Martin Fugate settled near Hazard, Kentucky, circa 1800. His wife was a carrier of the recessive methemoglobinemia (met-H) gene, as was a nearby clan with whom the Fugates intermarried. As a result, many descendants of the Fugates were born with met-H.[7][8][9]
9
• The "blue men of Lurgan" were a pair of Lurgan men suffering from what was described as "familial idiopathic methaemoglobinaemia" who were treated by Dr. James Deeny in 1942. Deeny, who would later become the Chief Medical Officer of the Republic of Ireland, prescribed a course of ascorbic acid and sodium bicarbonate. In case one, by the eighth day of treatment there was a marked change in appearance and by the twelfth day of treatment the patient's complexion was normal. In case two, the patient's complexion reached normality over a month-long duration of treatment.[10]
10
Back to the Topic at Hand
• Chapter 4 provided an introduction to data mining algorithms and the motivations underlying them
• Chapter 5 provided a relatively in-depth treatment of how results are evaluated
11
• Some evaluation features that exist in Weka were brought out, like lift charts
• Doubtless, other features have also been implemented in Weka
• When you are doing your project, you will have to look more closely into the evaluation tools in Weka
12
• With the foregoing background, chapter 6 covers issues surrounding various data mining algorithms in some detail
• My goal is to present this information at a level where you would be an informed user of Weka
13
• When doing your project, you will be using Weka
• If issues come up you will be informed enough to recognize and will be able to search around in Weka for how to make a decision about them
• You will have to become sort of an expert on the data mining algorithms you choose to use
14
• At the end of the chapter, in section 6.11, the book lists all of the implementations in Weka
• I think it will be useful to list all of the implementations up front
• This provides a preview of what you’ll find in Weka
• It also provides context for the discussion of the issues in sections 6.1-6.10
15
Subsections of the Chapter with Implementations in Weka
16
6.1 Decision Trees
• J48 (implementation of C4.5)• SimpleCart (minimum cost-complexity
pruning a la CART)• REPTree (reduced-error pruning)
17
6.2 Classification Rules
• (For classifiers, see Section 11.4 and Table 11.5.)
• JRip (RIPPER rule learner)• Part (rules from partial decision trees)• Ridor (ripple-down rule learner)
18
6.3 Association Rules
• (see Section 11.7 and Table 11.8)• FPGrowth (frequent-pattern trees)• GeneralizedSequentialPatterns (find large
item trees in sequential data)
19
6.4 Linear Models and Extensions
• SMO and variants for learning support vector machines
• LibSVM (uses third-party libsvm library)• MultilayerPerceptron• RBFNetwork (radial-basis function
network)• Spegasos (SVM using stochastic gradient
descent)
20
6.5 Instance-Based Learning
• IBk (k-nearest neighbor classifier)• KStar (generalized distance functions)• NNge (rectangular generalizations)
21
6.6 Numeric Prediction
• M5P (model trees)• M5Rules (rules from model trees)• LWL (locally weighted learning)
22
6.7 Bayesian Networks
• BayesNet• AODE, WAODE (averaged one-
dependence network)
23
6.8 Clustering
• (For clustering methods, see Section 11.6 and Table 11.7.)
• Xmeans• Cobweb (includes Classit)• EM
24
6.9 Semisupervised Learning
• No separate data mining implementations are listed for this section
25
6.10 Multi-Instance Learning
• MISVM (iterative method for learning SVM by relabeling instances)
• MISMO (SVM with multi-instance kernel)• CitationKNN (nearest-neighbor method
with Hausdorff distance)
26
• MILR (logistic regression for multi-instance data)
• MIOptimalBall (learning balls for multi-instance classification)
• MIDD (the diverse-density method using the noisy-OR function)
27
6.1 Decision Trees
28
Algorithm C4.5
• This was the algorithm introduced in chapter 4
• It is divide and conquer• Splitting decisions are greedy, based on
the purity/information function value of the results
29
A Review of the Algorithm for Nominal Attributes
• 1. The fundamental question at each level of the tree is always which attribute to split on
• In other words, given attributes x1, x2, x3…, do you branch first on x1 or x2 or x3…?
• Having chosen the first to branch on, which of the remaining ones do you branch on next, and so on?
30
• 2. Suppose you can come up with a function, the information (info) function
• This function is a measure of how much information is needed in order to make a decision at each node in a tree
• 3. You split on the attribute that gives the greatest information gain from level to level
31
• 4. A split is good if it means that little information will be needed at the next level down
• You measure the gain by subtracting the amount of information needed at the next level down from the amount needed at the current level
32
Numeric Attributes
• Because most data sets include numeric attributes, the algorithm needs to be extended
• Obviously, numeric attributes fall into a range; they don’t fall into predefined categories
• That means you need to decide where to split (branch) on them
33
• In general you handle numeric attributes by ordering the instances by value and splitting <, > at a single value
• The information function was used to determine which attribute to split on for the nominal case
• The information function can also be used to choose the best split point for a numeric attribute
34
Nominal vs. Numeric Splitting
• Differences between splitting on nominal and numeric:
• Nominal—split once on that attribute• Numeric—may be split again at every
succeeding level• Or, may do a multi-way split on a numeric
at a given level
35
Numeric Cost Implications
• Computational cost/implementation question for numerics:
• For a numeric over a range there is a potentially infinite number of possible split points
• Whether you split into multiple branches at one level or split multiple times on the same attribute at different levels, the cost of deciding can be high
36
• Also, if you split at different levels, you have a practical consideration:
• Do the instances have to be re-sorted on the attribute at every level?
• A suitable implementation can preserve the initial sorting so it’s available at all lower levels
37
Missing Values in Trees
• As mentioned in chapter 4, you can handle missing values as a separate branch
• Logically, this makes sense if the absence of a value means something
• If the absence doesn’t mean anything, it makes sense to assign instances to the branches proportionally
38
• Recall that in simple terms, something that the best information gain outcome is a leaf that is pure
• Practically speaking, this observation about missing values is worth noting:
• The information function and gain computations can be applied in situations where some of the attribute values are missing
39
Pruning
• This wasn’t discussed in detail in chapter 4• It turns out to be a big deal• There are two kinds of pruning: Pre-
pruning and post-pruning
40
• Pre-pruning is a bit of a misnomer• It means that the tree building algorithm
includes heuristics that decide not to expand down a given branch
• This is the less common approach
41
• Post-pruning means:• Create a complete tree following the rules• After the tree is finished, evaluate it,
potentially removing nodes and branches• This is more common
42
• With post-pruning, on the one hand, you’ve wasted work in developing branches that are pruned
• On the other hand, you don’t throw anything out without having fully developed and evaluated it
• A pre-pruning algorithm will use fewer computational resources, but it may throw out something useful
43
Why Pruning, and How?
• Pruning is important because it goes back to the concepts of training, overfitting, using a test set, and algorithm/result evaluation
• You develop a tree with a training set• You potentially prune it with a test set• The end result, obviously, is a smaller tree• Hopefully it’s also a better tree
44
• By definition, a completed tree will be fitted by the algorithm as closely as possible to the training set
• Pruning involves creating pruned versions of the tree and applying them to the test set
• The error rates for different, pruned versions of the tree are checked with the each other and the original
45
• It turns out that the error rate on the test set may be less if branches in the overfitted tree are merged or removed
• We don’t know exactly how to prune yet• But notice that this is entirely pragmatic
and there is a logic to it
46
• Up until now, the concept of overfitting has simply been asserted
• It may have seemed illogical that a “less well fitted” tree might be better
• But who knows—chopping bits out of the tree and trying it on the test set might give better results
• Why not try and see?
47
The Two Kinds of Post-Pruning
• There are two kinds of post-pruning:• Subtree replacement• Subtree raising
48
Subtree Replacement
• Subtree replacement refers to collapsing a subtree (branch) into a single leaf node
• The subtree replacement algorithm is bottom up
• Work from the leaves up, looking for branches where performance on the test set is better if the branch is collapsed
• See Figure 1.3 on the following overhead for illustration
49
Subtree replacement, from (b) to (a), the whole left branch is replaced
50
• Subtree replacement is not too computationally costly
• All of the instances from the collapsed branch go into the leaf that replaces it
• No additional computation is needed for this step
51
Subtree Raising
• Subtree raising refers to collapsing an internal node and raising one of its children to replace it
• The other children of the original have to be reapportioned into the branches of the replacement
• Typically only the child with the most descendants is a candidate for raising
• See Figure 6.1 on the following overhead for illustration
52
Subtree raising, from (a) to (b), B is replaced by C, and the instances in 4 and 5 have to
be reapportioned into 1, 2, and 3
53
• The subtree raising algorithm is more computationally intensive than subtree replacement
• The expense comes from reapportioning the instances into the new branches/leaves and recalculating the purity/error rate
54
Estimating Error Rates
• As explained earlier, a pruning algorithm can be based on error rates on test sets
• Generally, the test set is smaller than the training set
• It may not be representative of the overall population
• It may undo the overfitting from the training set, while not being perfect itself
55
• It turns out that the C4.5 algorithm doesn’t actually use a test set
• Using certain statistical assumptions, it bases error estimates on the training set
• All we need to understand is that C4.5 does include pruning
• The statistics are explained in a box and, as usual, there’s no need to know the details
56
Complexity of Decision Tree Induction
• The deep details of the derivation of the computational complexity of the algorithm are not important
• However, it’s worth noting that the complexity is tractable
57
• The book gives this as the overall figure for tree induction (creation) with (followed by post-) pruning:
• O(mn log n) + O(n(log n)2)• m = attributes• n = instances
58
The Cost of Building the Tree
• O(mn log n)• Informally:• log n is the number of levels of the tree for
some log base = average degree of branching
• At every level, in the worst case, you have to consider all n instances
• You do this for all of the m attributes
59
The Cost of Pruning the Tree with Subtree Replacement
• O(n)• This is smaller than the order for subtree
raising, so it is not included separately in the formula
60
The Cost of Pruning the Tree with Subtree Raising
• O(n(log n)2)• n instances potentially reclassified at every
level of the tree gives O(n(log n))• Reclassification itself is O(log n)• Therefore, the total order of complexity is
that shown above
61
• All we really need to know is that decision tree induction can be implemented in a way that it runs in log/polynomial time
• It is a computationally practical algorithm
62
From Trees to Rules
• As noted in chapter 4, following every branch of a tree gives a complete set of rules for it
• Rule sets can be pruned just like a tree can
• A specific approach of making rules from trees will come up in the next numbered section
63
C4.5: Choices and Options
• C4.5 has some tunable parameters• They apply to both nominal and numeric
attributes• The parameters are:• Confidence value• Minimum outcomes and minimum
instances• To prune or not to prune
64
Confidence Value
• As noted already, the C4.5 algorithm in Weka uses statistical tools with the training set instead of the test set to calculate error rates
• And the details are beyond the scope of this set of overheads
• The confidence rate in Weka is 25%, and for lack of a better understanding of what it means, we’ll just accept it
65
Minimum Outcomes, Minimum Instances
• The minimum outcomes and instances are easier to understand
• What good is a splitting condition on an attribute that doesn’t have at least two outcomes?
• And what good is a splitting condition that doesn’t have at least two instances per branch?
• The default values for these parameters in Weka are 2 and 2
66
• It is apparent that these defaults are rock bottom values
• It is of some interest what effect changing them would have
• For lack of a better understanding, you might accept these defaults
• Or you might experiment with other values and see what effect that has on the results
67
To Prune or Not to Prune
• Pruning can be turned off in C4.5 in Weka in order to obtain a more or less complete tree
• However, due to some parts of the algorithm as implemented, even with explicit post-pruning turned off, the output may have been pruned in some way
68
Cost-Complexity Pruning
• The pruning algorithm in C4.5 is fast• However, it doesn’t always prune enough• CART = Classification and Regression
Trees• This scheme has a more advanced,
stringent, and costly approach to pruning• It might profitably be applied to a C4.5
derived tree, giving a smaller, better result
69
Discussion
• Tree induction is presented first in this chapter because it’s probably the most studied of the data mining schemes
• As presented up to this point, decision nodes have been on one attribute
• CART supports decision nodes on >1 (nominal) attributes at a time
70
• For numeric attributes, a decision can be made based on a function of >1 attribute at a node
• Multivariate numeric test conditions are hyperplanes, not parallel to an axis like a single attribute compared to a constant
• Fancier schemes will take longer to run• The results may be more compact, but
also harder for humans to understand
71
What’s in Weka?
• Implementation of C4.5: J48 in Weka• Reduced-error pruning: REPTree in Weka• Minimum cost-complexity pruning a la
CART (classification and regression trees): SimpleCart in Weka
72
6.2 Classification Rules
• Recall the basic idea presented in chapter 4:
• You can make rules by trying to “cover” classifications in the data
• Recall, also, that the question of whether to accept “imperfect” rules or only “perfect” rules came up
• It is one of the aspects that will be discussed more here
73
• The basic question with rules is the same as with trees:
• A rule-producing algorithm will tend to overfit the training data
• That will mean that it is not such a good predictor
• How do you evaluate the error rate of a rule on a test set and decide whether it is good enough to keep?
74
Criteria for Choosing Tests
• Recall that in chapter 4, tests, namely conditions, are added to a rule with AND under this criterion:
• p stands for the number of correct classifications (p = positive)
• t stands for the total number covered (t = total)
• You wanted to maximize p/t
75
• Maximizing p/t is what led to “perfection”• If there was a condition that gave p/t = 1, it
would be chosen• This is not necessarily ideal• Which rule is better:• A rule that covers one instance with p/t = 1• Or a rule that covers 1,000 cases with p/t
= 999/1000?
76
An Alternative Rule Evaluation Criterion
• Let P and T be the values for a rule before a new condition is added
• Let p and t be the values after a condition is added
• You could compare different conditions by finding this product based on information gain:
• p * (log p/t – log P/T)
77
• This new criterion skews the judgment from perfection to the number of cases covered
• If you run an algorithm that quits only when ultimate perfection is achieved, you’ll get there eventually by selecting rules based on either criterion
78
• There is no absolute best criterion for selecting rules
• The real problem is still trimming the rule set back until is useful for prediction
79
Missing Values, Numeric Attributes
• Covering algorithms tend to handle missing values pretty well (assuming that the majority of the values aren’t missing…)
• Informally, you could say the algorithm builds rules for positive hits
• It effectively ignores missing values• Separate and conquer means that you
slowly narrow down to a remainder of instances
80
• Either instances with missing values are handled earlier in the process based on attributes with values
• Or at the end, the few remaining instances will be handled as special cases
• Conditions will be added to rules so that exceptional instances are classified—on attributes with values
• Handling these exceptional cases may actually constitute overfitting…
81
• Numeric valued attributes can be handled for rules just like for trees
• Instances can be ordered on the attribute value and all candidate rules based on a <, > comparison or split can be evaluated
82
Generating Good Rules
• This goes back to the idea that an imperfect rule is not overfitted and might make a good predictor
• The approach is similar to the approach with trees
• Divide the data 2/3, 1/3 into a growing set and a pruning set, for example
83
• The growing set is used to make rules by adding conditions
• The pruning set is used to simplify rules by removing conditions
• The criterion for removing rules is a reduction in the error rate on the pruning set
• This is called reduced-error pruning
84
• One algorithm for doing this is incremental reduced-error pruning
• The algorithm goes like this:• For a given classification, grow a complete
covering rule for it• Now test the error rate of the rule on the
test set and compare this with the error rate for all “sub-rules” with conditions removed
85
• For that class, hold onto whichever rule has the lowest error rate
• Do this for all classes• Compare the error rates for the rules for
each class• Keep the one rule with the lowest error
rate• Remove the covered instances• Repeat
86
• Why do you have to repeat• Because there is a difference between the
training set and the test set and you are developing rules that may be imperfect for the training set
• You remove the instances covered by the accepted rule from the training set and then go again
• That’s what they mean by incremental
87
• A non-incremental version of the algorithm would build a complete rule set first and then prune it
• This is more time-consuming
88
Evaluating Error Rate to Select Rules
• The book suggests various alternatives to comparing on the basis of p/t or expressions including log p/t – log P/T
• They all have the same problem:• We’ve decided to accept imperfect rules,
not just perfect ones• There is no perfect balancing point
between percent covered correctly and total number covered correctly
89
Algorithm Performance/Refinements
• Incremental reduced-error pruning produces good rule sets quickly
• It can be speeded up by simply picking a rule for each class in order of size, from smallest to largest
• It will also run more quickly with a suitable stopping condition
90
• For example, once a rule is accepted with a sufficiently low accuracy, stop searching for more refinements
• Unfortunately, such a stopping condition may cause better solutions to be overlooked
• A better stopping condition may be based on the MDL principle
91
Using Global Optimization
• Everything we’ve talked about is heuristic• We can’t claim we’re finding optimal trees
or rule sets• After finishing an algorithm, further
heuristics may be applied, which may lead to better (but not actually optimal) solutions
• In this context the idea is to run the incremental algorithm; then try to improve, taking all of the derived rules into account
92
What is RIPPER
• RIPPER stands for repeated incremental pruning to produce error reduction
• This is the name of a rule generation scheme
• In short, it has most of the bells and whistles noted above built in, in order to improve the rule sets generated
93
Obtaining Rules from Partial Decision Trees
• The book asserts that rule building schemes tend to prune too much
• Tree building schemes tend to err in the opposite direction, pruning too little
• A balancing approach is to use trees to develop rule sets
94
• In gross form, you could build a complete tree
• Then pick the best rule by tracing all the branches
• Then remove the covered instances and repeat
• However, building a complete tree each time is wasteful and unnecessary
95
• The alternative is to build a partial decision tree
• In brief, what you do is a form of pre-pruning during tree creation
• You use metrics that tell you there’s no need to explore certain branches further
• The decision about which branches merit further expansion is based on their entropy (information function) values
96
• Once the tree development is complete, you pick the best rule among those branches that can be traced to a leaf
• Then you throw out the partial tree and repeat the process with the instances that weren’t covered by the rule
97
• This technique is simpler than other schemes in this sense:
• It doesn’t have a global optimization stage at the end
• It can give rule sets that match the performance of schemes that do require global optimization
98
Rules with Exceptions
• Rules with exceptions get a more complete and friendlier presentation here than in chapter 4
• They are not a logical abomination• They have a logic of their own
99
• Imagine starting with a default case—namely the majority classification
• All instances which don’t fall into this classification are exceptions
• Among the exceptions, let the majority classification be the new default
• Then those instances which don’t fall into this classification are exceptions
100
• With other techniques it might be the nth iteration (node split, condition added) before you ultimately nail something
• Here, you’re always thinking of the majority case first
• Within one or two levels at the top you have a good picture of the situation overall
• You go down for finer levels of detail
101
• Effectively, this is an alternative way of building a representation of the data
• You could say that at every level, your thinking is, “All else being equal…”
• It is no accident that the organizing principle of going from broad default to fine exception mirrors human thinking in some problem domains
102
Discussion
• This is a repetition of the rule building algorithms and the implementations mentioned in section 6.11
• Simple rule building for relatively noise-free data, covering, separate and conquer: PRISM
103
• Incremental reduced-error pruning, RIPPER: JRip in Weka
• Rules from partial decision trees: Part in Weka
• Rules with exceptions, ripple down rules: Ridor in Weka
104
6.3 Association Rules
• The algorithm given in chapter 4 for finding association rules is known as the apriori algorithm
• It was essentially an exhaustive search• It was made somewhat more tolerable by
observing that if a weaker rule didn’t meet the threshold for acceptance, a stronger one wouldn’t either
105
• The bottom line turns out to be that there’s got to be a better way
• There is—it’s known as a frequent pattern or FP-tree implementation (FP not to be confused with false positive)
• Essentially, the FP-tree is based on a special kind of data structure, a prefix tree, a tree with additional information attached
106
• The details of the FP-tree implementation are of no interest
• However, there is a side comment of some interest
• The authors mention that it is desirable to use a data structure small enough to be memory resident
107
• Compare this idea with things that came up in db systems, B-trees and hash join
• Even a high complexity algorithm is relatively tolerable if it can be executed in memory
• Even a low complexity algorithm is relatively costly if it involves access to secondary storage
108
Association Rule Mining In Weka
• FPGrowth (frequent pattern trees)• GeneralizedSequentialPatterns (GSP)• GSP is an application of the idea of apriori
rule generation to databases of event sequences
• If, by chance, you are considering such a database of sequential events, GSP might just the piece of software to use
109
The End