1 mining decision trees from data streams tong suk man ivy csis db seminar february 12, 2003
Post on 20-Dec-2015
222 views
TRANSCRIPT
1
Mining Decision Trees fromData Streams
Tong Suk Man Ivy
CSIS DB SeminarFebruary 12, 2003
2
Contents
Introduction: problems in mining data streams Classification of stream data
VFDT algorithm
Window approach CVFDT algorithm
Experimental results Conclusions Future work
3
Data Streams
Characteristics Large volume of ordered data points, possibly infinite Arrive continuously Fast changing
Appropriate model for many applications: Phone call records Network and security monitoring Financial applications (stock exchange) Sensor networks
4
Problems in Mining Data Streams
Traditional data mining techniques usually require Entire data set to be present Random access (or multiple passes) to the data Much time per data item
Challenges of stream mining Impractical to store the whole data Random access is expensive Simple calculation per data due to time and space
constraints
5
Classification of Stream Data
VFDT algorithm “Mining High-Speed Data Streams”, KDD 2000.
Pedro Domingos, Geoff Hulten
CVFDT algorithm (window approach) “Mining Time-Changing Data Streams”, KDD 2001.
Geoff Hulten, Laurie Spencer, Pedro Domingos
6
Hoeffding Trees
7
Definitions
A classification problem is defined as: N is a set of training examples of the form (x, y) x is a vector of d attributes y is a discrete class label
Goal: To produce from the examples a model y=f(x) that predict the classes y for future examples x with high accuracy
8
Decision Tree Learning One of the most effective and
widely-used classification methods
Induce models in the form of decision trees Each node contains a test on the
attribute Each branch from a node
corresponds to a possible outcome of the test
Each leaf contains a class prediction A decision tree is learned by
recursively replacing leaves by test nodes, starting at the root
Age<30?
Car Type=Sports Car?
No
Yes
Yes
Yes No
No
9
Challenges
Classic decision tree learners assume all training data can be simultaneously stored in main memory
Disk-based decision tree learners repeatedly read training data from disk sequentially Prohibitively expensive when learning complex trees
Goal: design decision tree learners that read each example at most once, and use a small constant time to process it
10
Key Observation
In order to find the best attribute at a node, it may be sufficient to consider only a small subset of the training examples that pass through that node. Given a stream of examples, use the first ones to choose the
root attribute. Once the root attribute is chosen, the successive examples
are passed down to the corresponding leaves, and used to choose the attribute there, and so on recursively.
Use Hoeffding bound to decide how many examples are enough at each node
11
Hoeffding Bound
Consider a random variable a whose range is R Suppose we have n observations of a Mean: Hoeffding bound states:
With probability 1- , the true mean of a is at least
, where
n
R
2
)/1ln(2
_
a
_
a
12
How many examples are enough? Let G(Xi) be the heuristic measure used to choose test
attributes (e.g. Information Gain, Gini Index) Xa : the attribute with the highest attribute evaluation
value after seeing n examples. Xb : the attribute with the second highest split
evaluation function value after seeing n examples. Given a desired , if after
seeing n examples at a node, Hoeffding bound guarantees the true , with
probability 1-. This node can be split using Xa, the succeeding examples will
be passed to the new leaves.
0 GG
n
R
2
)/1ln(2
)()( ba XGXGG
13
Algorithm
Calculate the information gain for the attributes and determines the best two attributes Pre-pruning: consider a “null” attribute that consists of not
splitting the node At each node, check for the condition
)()( ba XGXGG
If condition satisfied, create child nodes based on the test at the node
If not, stream in more examples and perform calculations till condition satisfied
14
Data Stream
Data Stream
(Gender)-Type) (Car_
GG_
Age<30?
Yes
Yes No
Age<30?
Car Type=Sports Car?
No
Yes
Yes
Yes No
No
15
Performance Analysis
p: probability that an example passed through DT
to level i will fall into a leaf at that point The expected disagreement between the tree
produced by Hoeffding tree algorithm and that produced using infinite examples at each node is no greater than /p.
Required memory: O(leaves * attributes * values * classes)
16
VFDT
17
VFDT (Very Fast Decision Tree)
A decision-tree learning system based on the Hoeffding tree algorithm
Split on the current best attribute, if the difference is less than a user-specified threshold Wasteful to decide between identical attributes
Compute G and check for split periodically Memory management
Memory dominated by sufficient statistics Deactivate or drop less promising leaves when needed
Bootstrap with traditional learner Rescan old data when time available
18
VFDT(2)
Scales better than pure memory-based or pure disk-based learners Access data sequentially Use subsampling to potentially require much less than
one scan
VFDT is incremental and anytime New examples can be quickly incorporated as they
arrive A usable model is available after the first few examples
and then progressively defined
19
Experiment Results (VFDT vs. C4.5)
Compared VFDT and C4.5 (Quinlan, 1993) Same memory limit for both (40 MB)
100k examples for C4.5
VFDT settings: δ= 10-7, τ= 5%, nmin=200
Domains: 2 classes, 100 binary attributes Fifteen synthetic trees 2.2k – 500k leaves Noise from 0% to 30%
20
Experiment Results
Accuracy as a function of the number of training examples
21
Experiment Results
Tree size as a function of number of training examples
22
Mining Time-Changing Data Stream
Most KDD systems, include VFDT, assume training data is a sample drawn from stationary distribution
Most large databases or data streams violate this assumption Concept Drift: data is generated by a time-changing
concept function, e.g. Seasonal effects Economic cycles
Goal: Mining continuously changing data streams Scale well
23
Window Approach
Common Approach: when a new example arrives, reapply a traditional learner to a sliding window of w most recent examples Sensitive to window size
If w is small relative to the concept shift rate, assure the availability of a model reflecting the current concept
Too small w may lead to insufficient examples to learn the concept
If examples arrive at a rapid rate or the concept changes quickly, the computational cost of reapplying a learner may be prohibitively high.
24
CVFDT
25
CVFDT
CVFDT (Concept-adapting Very Fast Decision Tree learner) Extend VFDT Maintain VFDT’s speed and accuracy Detect and respond to changes in the example-
generating process
26
Observations
With a time-changing concept, the current splitting attribute of some nodes may not be the best any more.
An outdated subtree may still be better than the best single leaf, particularly if it is near the root. Grow an alternative subtree with the new best attribute
at its root, when the old attribute seems out-of-date. Periodically use a bunch of samples to evaluate
qualities of trees. Replace the old subtree when the alternate one
becomes more accurate.
27
CVFDT algorithm
Alternate trees for each node in HT start as empty. Process examples from the stream indefinitely. For
each example (x, y), Pass (x, y) down to a set of leaves using HT and all
alternate trees of the nodes (x, y) passes through. Add (x, y) to the sliding window of examples. Remove and forget the effect of the oldest examples, if the
sliding window overflows. CVFDTGrow CheckSplitValidity if f examples seen since last checking of
alternate trees. Return HT.
28
CVFDT algorithm: process each example
Pass example down to leaves
add example to sliding window
Window overflow? Forget oldest example
CVFDTGrow
CheckSplitValidty
f examples since last checking?
Yes No
No
Yes
Read new example
29
CVFDT algorithm: process each example
Pass example down to leaves
add example to sliding window
Window overflow? Forget oldest example
CVFDTGrow
CheckSplitValidty
f examples since last checking?
Yes No
No
Yes
Read new example
30
CVFDTGrow
For each node reached by the example in HT, Increment the corresponding statistics at the node. For each alternate tree Talt of the node,
CVFDTGrow
If enough examples seen at the leaf in HT which the example reaches, Choose the attribute that has the highest average
value of the attribute evaluation measure (information gain or gini index).
If the best attribute is not the “null” attribute, create a node for each possible value of this attribute
31
CVFDT algorithm: process each example
Pass example down to leaves
add example to sliding window
Window overflow? Forget oldest example
CVFDTGrow
CheckSplitValidty
f examples since last checking?
Yes No
No
Yes
Read new example
32
Forget old example
Maintain the sufficient statistics at every node in HT to monitor the validity of its previous decisions. VFDT only maintain such statistics at leaves.
HT might have grown or changed since the example was initially incorporated. Assigned each node a unique, monotonically increasing ID
as they are created. forgetExample (HT, example, maxID)
For each node reached by the old example with node ID no larger than the max leave ID the example reaches,
Decrement the corresponding statistics at the node. For each alternate tree Talt of the node, forget(Talt, example, maxID).
33
CVFDT algorithm: process each example
Read new example
Pass example down to leaves
add example to sliding window
Window overflow? Forget oldest example
CVFDTGrow
CheckSplitValidty
f examples since last checking?
Yes No
No
Yes
34
CheckSplitValidtiy
Periodically scans the internal nodes of HT. Start a new alternate tree when a new winning
attribute is found. Tighter criteria to avoid excessive alternate tree
creation. Limit the total number of alternate trees.
35
Smoothly adjust to concept drift
Alternate trees are grown the same way HT is.
Periodically each node with non-empty alternate trees enter a testing mode. M training examples to
compare accuracy. Prune alternate trees
with non-increasing accuracy over time.
Replace if an alternate tree is more accurate.
No
Age<30?
Car Type=Sports Car?
No
Yes
Yes
Yes
No
Married?
Yes No
Yes No
Experience<1 year?
No Yes
Yes No
36
Adjust to concept drift(2)
Dynamically change the window size Shrink the window when many nodes gets
questionable or data rate changes rapidly. Increase the window size when few nodes are
questionable.
37
Performance
Require memory O(nodes * attributes * attribute values * classes). Independent of the total number of examples.
Running time O(Lc * attributes * attribute values * number of classes). Lc : the longest length an example passes through * number
of alternate trees. Model learned by CVFDT vs. the one learned by
VFDT-Window: Similar in accuracy O(1) vs. O(window size) per new example.
38
Experiment Results
Compare CVFDT, VFDT, VFDT-Window 5 million training examples Concept changed at every 50k examples Drift Level: average percentage of the test points that
changes label at each concept change. About 8% of test points change label each drift
100,000 examples in window 5% noise Test the model every 10k examples throughout the run,
averaged these results.
39
Experiment Results (CVFDT vs. VFDT)
Error rate as a function of number of attributes
drift level
40
Experiment Results (CVFDT vs. VFDT)
Tree size as a function of number of attributes
41
Experiment Results (CVFDT vs. VFDT)
Error rates of learners as a function of the numberof examples seen
Portion of data setthat is labelled -ve
42
Experiment Results (CVFDT vs. VFDT)
Error rates as a function of the amount of concept drift
43
Experiment Results
CVFDT’s drift characteristics
44
Experiment Results (CVFDT vs. VFDT vs. VFDT-window)
Error rates over time of CVFDT, VFDT, and VFDT-window
Stimulated by running VFDT on W for every 100K examples instead of every example
Error Rate: VFDT: 19.4%CVFDT: 16.3%VFDT-Window: 15.3%
Running Time:VFDT: 10 minutesCVFDT: 46 minutesVFDT-Window: expect 548 days
45
Experiment Results
CVFDT not use too much RAM D=50, CVFDT never uses more than 70MB
Use as little as half the RAM of VFDT VFDT often had twice as many leaves as the
number of nodes in CVFDT’s HT and alternate subtrees combined
Reason: VFDT considers many more outdated examples and is forced to grow larger trees to make up for its earlier wrong decisions due to concept drift
46
Conclusions
CVFDT – a decision-tree induction system capable of learning accurate models from high speed, concept-drifting data streams
Grow an alternative subtree whenever an old one becomes questionable
Replace the old subtree when the new more accurate
Similar in accuracy to applying VFDT to a moving window of examples
47
Future Work
Concepts changed periodically and removed subtrees may become useful again
Comparisons with related systems Continuous attributes Weighting examples
48
Reference List
P. Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2000.
G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001
V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring evolving data. In Proceedings of the Sixteenth International Conference on Data Engineering, 2000.
J. Gehrke, V. Ganti, R. Ramakrishnan, and W.L. Loh. BOAT: optimistic decision tree construction. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, 1999.
49
The end
Q & A
50
Thank You!