mining adaptively frequent closed unlabeled rooted trees in data streams

49
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams Albert Bifet and Ricard Gavaldà Universitat Politècnica de Catalunya 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’08) 2008 Las Vegas, USA

Upload: albert-bifet

Post on 08-May-2015

544 views

Category:

Technology


0 download

DESCRIPTION

Talk about tree mining on evolving data streams.

TRANSCRIPT

Page 1: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Mining Adaptively Frequent Closed UnlabeledRooted Trees in Data Streams

Albert Bifet and Ricard Gavaldà

Universitat Politècnica de Catalunya

14th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining (KDD’08)

2008 Las Vegas, USA

Page 2: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data StreamsSequence is potentiallyinfiniteHigh amount of data:sublinear spaceHigh speed of arrival:sublinear time perexample

Tree MiningMining frequent trees isbecoming an importanttaskApplications:

chemical informaticscomputer visiontext retrievalbioinformaticsWeb analysis.

Many link-basedstructures may bestudied formally bymeans of unorderedtrees

Page 3: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Page 4: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Use a n-bitvector tomemorize all thenumbers (O(n)space)

Page 5: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.

Page 6: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction: Data Streams

Data StreamsSequence is potentially infiniteHigh amount of data: sublinear spaceHigh speed of arrival: sublinear time per exampleOnce an element from a data stream has been processedit is discarded or archived

ExamplePuzzle: Finding Missing Numbers

Let π be a permutation of {1, . . . ,n}.Let π−1 be π with one elementmissing.π−1[i] arrives in increasing order

Task: Determine the missing number

Data Streams:O(log(n)) space.Store

n(n +1)

2−∑

j≤iπ−1[j].

Page 7: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction: Trees

Our trees are:UnlabeledOrdered and Unordered

Our subtrees are:Induced

Two different ordered treesbut the same unordered tree

Page 8: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction

Induced subtrees: obtained by repeatedly removing leafnodes

Embedded subtrees: obtained by contracting some of theedges

Page 9: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction

What Is Tree Pattern Mining?

Given a dataset of trees, find the complete set of frequentsubtrees

Frequent Tree Pattern (FS):

Include all the trees whose support is no less than min_sup

Closed Frequent Tree Pattern (CS):

Include no tree which has a super-tree with the samesupport

CS ⊆ FSClosed Frequent Tree Mining provides a compactrepresentation of frequent trees without loss of information

Page 10: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction

Unordered Subtree Mining

A: B: X: Y:X: Y:

D = {A,B},min_sup = 2

# Closed Subtrees : 2# Frequent Subtrees: 9

Closed Subtrees: X, Y

Frequent Subtrees:

Page 11: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Introduction

ProblemGiven a data stream D of rooted, unlabelled and unorderedtrees, find frequent closed trees.

D

We provide three algorithms,of increasing power

IncrementalSliding WindowAdaptive

Page 12: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Page 13: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data Streams

Data StreamsAt any time t in the data stream, we would like the per-itemprocessing time and storage to be simultaneouslyO(logk (N, t)).

Approximation algorithmsSmall error rate with high probabilityAn algorithm (ε,δ )−approximates F if it outputs F̃ forwhich Pr[|F̃ −F |> εF ] < δ .

Page 14: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data Streams Approximation Algorithms

1011000111 1010101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 15: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data Streams Approximation Algorithms

10110001111 0101011

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 16: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data Streams Approximation Algorithms

101100011110 1010111

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 17: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data Streams Approximation Algorithms

1011000111101 0101110

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 18: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data Streams Approximation Algorithms

10110001111010 1011101

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 19: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Data Streams Approximation Algorithms

101100011110101 0111010

Sliding WindowWe can maintain simple statistics over sliding windows, usingO(1

εlog2 N) space, whereN is the length of the sliding windowε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.Maintaining stream statistics over sliding windows. 2002

Page 20: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Page 21: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

ADWIN: Adaptive sliding window

ADWIN

An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates

ADWIN using a Data Stream Sliding Window Model,can provide the exact counts of 1’s in O(1) time per point.tries O(logW ) cutpointsuses O(1

εlogW ) memory words

the processing time per example is O(logW ) (amortizedand worst-case).

Page 22: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Time Change Detectors and Predictors: A GeneralFramework

-xt

Estimator

-Estimation

Page 23: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Time Change Detectors and Predictors: A GeneralFramework

-xt

Estimator

-Estimation

- -Alarm

Change Detect.

Page 24: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Time Change Detectors and Predictors: A GeneralFramework

-xt

Estimator

-Estimation

- -Alarm

Change Detect.

Memory-

6

6?

Page 25: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1 01010110111111

Page 26: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10 1010110111111

Page 27: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101 010110111111

Page 28: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010 10110111111

Page 29: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101 0110111111

Page 30: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101010 110111111

Page 31: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010101 10111111

Page 32: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101011 0111111

Page 33: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101010110 111111

Page 34: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010101101 11111

Page 35: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101011011 1111

Page 36: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

101010110111 111

Page 37: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

1010101101111 11

Page 38: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Window Management Models

W = 101010110111111

Equal & fixed sizesubwindows

1010 1011011 1111

[Kifer+ 04]

Equal size adjacentsubwindows

1010101 1011 1111

[Dasu+ 06]

Total window againstsubwindow

10101011011 1111

[Gama+ 04]

ADWIN: All Adjacentsubwindows

10101011011111 1

11

Page 39: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Page 40: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Pattern Relaxed Support

Guojie Song, Dongqing Yang, Bin Cui, Baihua Zheng,Yunfeng Liu and Kunqing Xie.CLAIM: An Efficient Method for Relaxed Frequent ClosedItemsets Mining over Stream Data

Linear Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = (n− i)∗ εr ≥ 0,ui = (n− i +1)∗ εr ≤ 1 and i ≤ n.Linear Relaxed closed subpattern t : if and only if thereexists no proper superpattern t ′ of t such that their suportsbelong to the same interval Ii .

Page 41: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Pattern Relaxed Support

As the number of closed frequent patterns is not linear withrespect support, we introduce a new relaxed support:

Logarithmic Relaxed Interval:The support space of allsubpatterns can be divided into n = d1/εre intervals, whereεr is a user-specified relaxed factor, and each interval canbe denoted by Ii = [li ,ui), where li = dc ie, ui = dc i+1−1eand i ≤ n.Logarithmic Relaxed closed subpattern t : if and only ifthere exists no proper superpattern t ′ of t such that theirsuports belong to the same interval Ii .

Page 42: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Galois Lattice of closed set of trees

D

We needa Galoisconnection paira closure operator

1 2 3

12 13 23

123

Page 43: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Incremental mining on closed frequent trees

1 Adding a treetransaction, doesnot decrease thenumber of closedtrees for D .

2 Adding atransaction with aclosed tree, doesnot modify thenumber of closedtrees for D .

1 2 3

12 13 23

123

Page 44: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Sliding Window mining on closed frequent trees

1 Deleting a treetransaction, doesnot increase thenumber of closedtrees for D .

2 Deleting a treetransaction that isrepeated, does notmodify the numberof closed trees forD .

1 2 3

12 13 23

123

Page 45: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Algorithms

AlgorithmsIncremental: INCTREENAT

Sliding Window: WINTREENAT

Adaptive: ADATREENAT Uses ADWIN to monitor change

ADWIN

An adaptive sliding window whose size is recomputed onlineaccording to the rate of change observed.

ADWIN has rigorous guarantees (theorems)On ratio of false positives and negativesOn the relation of the size of the current window andchange rates

Page 46: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Experimental Validation: TN1

INCTREENAT

CMTreeMiner

Time(sec.)

Size (Milions)2 4 6 8

100

200

300

Figure: Time on experiments on ordered trees on TN1 dataset

Page 47: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Experimental Validation

5

15

25

35

45

0 21.460 42.920 64.380 85.840 107.300 128.760 150.220 171.680 193.140

Number of Samples

Nu

mb

er

of

Clo

se

d T

ree

s

AdaTreeInc 1

AdaTreeInc 2

Figure: Number of closed trees maintaining the same number ofclosed datasets on input data

Page 48: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Outline

1 Introduction

2 Data Streams

3 ADWIN : Concept Drift Mining

4 Adaptive Closed Frequent Tree Mining

5 Summary

Page 49: Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams

Summary

ConclusionsNew logarithmic relaxed closed supportUsing Galois Latice Theory, we present methods for miningclosed trees

Incremental: INCTREENATSliding Window: WINTREENATAdaptive: ADATREENAT using ADWIN to monitor change

Future WorkLabeled Trees and XML data.