mining adaptively frequent closed unlabeled rooted trees in data streams
DESCRIPTION
Talk about tree mining on evolving data streams.TRANSCRIPT
Mining Adaptively Frequent Closed UnlabeledRooted Trees in Data Streams
Albert Bifet and Ricard Gavaldà
Universitat Politècnica de Catalunya
14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD’08)
2008 Las Vegas, USA
Data StreamsSequence is potentiallyinfiniteHigh amount of data:sublinear spaceHigh speed of arrival:sublinear time perexample
Tree MiningMining frequent trees isbecoming an importanttaskApplications:
chemical informaticscomputer visiontext retrievalbioinformaticsWeb analysis.
Many link-basedstructures may bestudied formally bymeans of unorderedtrees
Introduction: Data Streams
Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order
Task: Determine the missing number
Introduction: Data Streams
Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order
Task: Determine the missing number
Use a n-bitvector tomemorize all thenumbers (O(n)space)
Introduction: Data Streams
Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:O(log(n)) space.
Introduction: Data Streams
Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived
ExamplePuzzle: Finding Missing Numbers
Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order
Task: Determine the missing number
Data Streams:O(log(n)) space.Store
n(n +1)
2−∑
j≤iπ−1[j].
Introduction: Trees
Our trees are:UnlabeledOrdered and Unordered
Our subtrees are:Induced
Two different ordered treesbut the same unordered tree
Introduction
Induced subtrees: obtained by repeatedly removing leafnodes
Embedded subtrees: obtained by contracting some of theedges
Introduction
What Is Tree Pattern Mining?
Given a dataset of trees, find the complete set of frequentsubtrees
Frequent Tree Pattern (FS):
Include all the trees whose support is no less than min_sup
Closed Frequent Tree Pattern (CS):
Include no tree which has a super-tree with the samesupport
CS ⊆ FSClosed Frequent Tree Mining provides a compactrepresentation of frequent trees without loss of information
Introduction
Unordered Subtree Mining
A: B: X: Y:X: Y:
D = {A,B},min_sup = 2
# Closed Subtrees : 2# Frequent Subtrees: 9
Closed Subtrees: X, Y
Frequent Subtrees:
Introduction
ProblemGiven a data stream D of rooted, unlabelled and unorderedtrees, find frequent closed trees.
D
We provide three algorithms,of increasing power
IncrementalSliding WindowAdaptive
Outline
1 Introduction
2 Data Streams
3 ADWIN : Concept Drift Mining
4 Adaptive Closed Frequent Tree Mining
5 Summary
Data Streams
Data StreamsAt any time t in the data stream, we would like the per-itemprocessing time and storage to be simultaneouslyO(logk (N, t)).
Approximation algorithmsSmall error rate with high probabilityAn algorithm (ε,δ )−approximates F if it outputs F̃ forwhich Pr[|F̃ −F |> εF ] < δ .
Data Streams Approximation Algorithms
1011000111 1010101
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms
10110001111 0101011
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms
101100011110 1010111
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms
1011000111101 0101110
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms
10110001111010 1011101
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Data Streams Approximation Algorithms
101100011110101 0111010
Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1
εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter
M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002
Outline
1 Introduction
2 Data Streams
3 ADWIN : Concept Drift Mining
4 Adaptive Closed Frequent Tree Mining
5 Summary
ADWIN: Adaptive sliding window
ADWIN
An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.
ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates
ADWIN using a Data Stream Sliding Window Model,can provide the exact counts of 1’s in O(1) time per point.tries O(logW ) cutpointsuses O(1
εlogW ) memory words
the processing time per example is O(logW ) (amortizedand worst-case).
Time Change Detectors and Predictors: A GeneralFramework
-xt
Estimator
-Estimation
Time Change Detectors and Predictors: A GeneralFramework
-xt
Estimator
-Estimation
- -Alarm
Change Detect.
Time Change Detectors and Predictors: A GeneralFramework
-xt
Estimator
-Estimation
- -Alarm
Change Detect.
Memory-
6
6?
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
1 01010110111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
10 1010110111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
101 010110111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
1010 10110111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
10101 0110111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
101010 110111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
1010101 10111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
10101011 0111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
101010110 111111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
1010101101 11111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
10101011011 1111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
101010110111 111
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
1010101101111 11
Window Management Models
W = 101010110111111
Equal & fixed sizesubwindows
1010 1011011 1111
[Kifer+ 04]
Equal size adjacentsubwindows
1010101 1011 1111
[Dasu+ 06]
Total window againstsubwindow
10101011011 1111
[Gama+ 04]
ADWIN: All Adjacentsubwindows
10101011011111 1
11
Outline
1 Introduction
2 Data Streams
3 ADWIN : Concept Drift Mining
4 Adaptive Closed Frequent Tree Mining
5 Summary
Pattern Relaxed Support
Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,Yunfeng Liu and Kunqing Xie.CLAIM: An Efficient Method for Relaxed Frequent ClosedItemsets Mining over Stream Data
Linear Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = (n− i)∗ εr ≥ 0,ui = (n− i +1)∗ εr ≤ 1 and i ≤ n.Linear Relaxed closed subpattern t : if and only if thereexists no proper superpattern t ′ of t such that their suportsbelong to the same interval Ii .
Pattern Relaxed Support
As the number of closed frequent patterns is not linear withrespect support, we introduce a new relaxed support:
Logarithmic Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = dc ie, ui = dc i+1−1eand i ≤ n.Logarithmic Relaxed closed subpattern t : if and only ifthere exists no proper superpattern t ′ of t such that theirsuports belong to the same interval Ii .
Galois Lattice of closed set of trees
D
We needa Galoisconnection paira closure operator
1 2 3
12 13 23
123
Incremental mining on closed frequent trees
1 Adding a treetransaction, doesnot decrease thenumber of closedtrees for D .
2 Adding atransaction with aclosed tree, doesnot modify thenumber of closedtrees for D .
1 2 3
12 13 23
123
Sliding Window mining on closed frequent trees
1 Deleting a treetransaction, doesnot increase thenumber of closedtrees for D .
2 Deleting a treetransaction that isrepeated, does notmodify the numberof closed trees forD .
1 2 3
12 13 23
123
Algorithms
AlgorithmsIncremental: INCTREENAT
Sliding Window: WINTREENAT
Adaptive: ADATREENAT Uses ADWIN to monitor change
ADWIN
An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.
ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates
Experimental Validation: TN1
INCTREENAT
CMTreeMiner
Time(sec.)
Size (Milions)2 4 6 8
100
200
300
Figure: Time on experiments on ordered trees on TN1 dataset
Experimental Validation
5
15
25
35
45
0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140
Number of Samples
Nu
mb
er
of
Clo
se
d T
ree
s
AdaTreeInc 1
AdaTreeInc 2
Figure: Number of closed trees maintaining the same number ofclosed datasets on input data
Outline
1 Introduction
2 Data Streams
3 ADWIN : Concept Drift Mining
4 Adaptive Closed Frequent Tree Mining
5 Summary
Summary
ConclusionsNew logarithmic relaxed closed supportUsing Galois Latice Theory, we present methods for miningclosed trees
Incremental: INCTREENATSliding Window: WINTREENATAdaptive: ADATREENAT using ADWIN to monitor change
Future WorkLabeled Trees and XML data.