ceminer – an efficient algorithm for mining closed patterns from time interval-based data

33
CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data Yi-Cheng Chen, Wen-Chih Peng and Suh-Yin Lee ICDM 2011

Upload: petula

Post on 23-Feb-2016

63 views

Category:

Documents


1 download

DESCRIPTION

CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based Data. Yi-Cheng Chen, Wen- Chih Peng and Suh -Yin Lee ICDM 2011. Outlines. Motivation Preliminaries Endpoint representation CEMiner algorithm Experimental result Conclusion. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

CEMiner – An Efficient Algorithm for Mining Closed Patterns from Time Interval-based DataYi-Cheng Chen, Wen-Chih Peng

and Suh-Yin Lee

ICDM 2011

Page 2: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

2

Outlines

2012/6/13

Motivation Preliminaries Endpoint representation CEMiner algorithm Experimental result Conclusion

Page 3: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

3

Motivation

2012/6/13

Existing studies only focus on mining closed sequential patterns from time point-based data.

Page 4: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

4

Cont.

2012/6/13

In this paper, we discuss and design an efficient method to discover closed temporal patterns from interval-based data.

Three contributions : We simplify the processing of complex relations. i.e., only “before”, “after” and “equal.”

Endpoint representation

A novel algorithm, CEMiner (Closed Endpoint Temporal Miner).

Page 5: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

5

Preliminaries

2012/6/13

Definition 1. Event interval and event sequence

E = {e1 , e2 ,…, ek } be the set of event symbols :

{A, B, C, D, E } The triplet (ei , si , fi ) is an event interval : (A , 2 , 7) An event sequence is a series of event

interval triplets : <(A, 2 , 7), (B , 5 , 10), …, (E , 18 , 20)>.

Page 6: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

6

Cont.

2012/6/13

Definition 2. Temporal database Database DB = {r1 , r2 , …, rm }, each record ri , consists of a sequence-id, SID and an

event. DB is called a temporal database.

Page 7: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

7

Endpoint representation

2012/6/13

When describing relationships among more than three events, Allen’s temporal logics may suffer several problems.

A suitable representation is very important for describing a temporal pattern.

A new expression, endpoint representation is proposed to address the ambiguous and scalable problem.

Page 8: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

8

Cont.

2012/6/13

Definition 3. Endpoint sequence

event sequence q = <( A , 2 , 7 ), ( B , 5 , 10 ), ( C , 5 , 12 ), ( D , 16 , 22 ), ( E , 18 , 20 )>

Tq = { 2 ,7 ,5 ,10 ,5 ,12 ,16 ,22 ,18 ,20 } endpoint sequence : qe =

<2 ,5 ,5 ,7 ,10 ,12 ,16 ,18 ,20 ,22> endpoint representation : <>

Page 9: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

9

Cont.

2012/6/13

The endpoint representation has several benefits : Scalability

Nonambiguity

Simplicity

Page 10: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

10

CEMiner algorithm

2012/6/13

CEMiner (standing for Closed Endpoint temporal Miner) utilizes the arrangement of endpoints to accomplish the closed temporal pattern mining. Closure Checking

subsequence & supersequence Ex. Given two sequences = <A, B, C>, = <𝛽 A, D, B, C, E>, we say is a subsequence of , and is 𝛽 𝛽

a supersequence of.

Page 11: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

11

Cont.

2012/6/13

Definition 4. Closed temporal pattern CTP = {( ∈ TP ) ˄ ( ∄ ∈ TP ) such that ( ⊆ 𝛼 𝛽 𝛼

β) ∧ ( support ( ) = support ( ) )}𝛼 𝛽 Given two sequence and 𝛼 𝛽 If is a closed temporal pattern, 𝛼

𝛼 is a temporal pattern and there doesn’t exist a supersequence and support ( ) 𝛽 𝛼

= support ( ). 𝛽

Page 12: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

12

Cont.

2012/6/13

Ex. min_sup = 2 The endpoint sequence = <> is a temporal

pattern but not a closed temporal pattern. Because <> ⊂ <> and both support = 2.

Page 13: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

13

Cont.

2012/6/13

Closure Checking To verify a new closed temporal pattern p, we

require checking whether p is a sub-sequence or super-sequence of an existing temporal pattern p’ and the projected database of p and p’ is equal.

This paper borrow BI-Directional Extension [WH04] to check patterns’ closure. Forward-extension Backward-extension

Page 14: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

14

Cont.

2012/6/13

Definition 5. Forward-extension and backward-extension

If = <> is non-closed, there must exist at least one endpoint x, which can be used to extend to a new endpoint sequence ’, support () = support (’).

can be extended in five ways: (1)’= 〈〉 (2)’= 〈〉 ’ 𝛼 a forward-extension sequence (3)’= 〈〉 (4)’= 〈〉 (5)’= 〈〉 ’ backward-extension sequence

Page 15: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

15

Cont.

2012/6/13

If there exists no forward-extension endpoint nor backward-extension , must be a closed 𝛼endpoint sequence.

The CEMiner checks closure in two directions as follows, Forward directional checking Backward directional checking

Page 16: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

16

Cont.

2012/6/13

Definition First instance of a prefix sequence

Ex. The first instance of the prefix sequence AB in

sequence CAABC is CAAB.

Page 17: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

17

Cont.

2012/6/13

Definition 6. The i-th last-in-first appearance Ex. 〈 ABAB(AB)(AB) 〉 p = 〈〉1. The last-in-first appearance w.r.t. prefix p in ? (1) 1≤ i < n, n=4, i=2 first instance : 〈 ABAB(AB)(AB) 〉 2. The last-in-first appearance w.r.t. prefix p in? (2) i = n, i = n = 4 first instance : 〈 ABAB(AB)(AB) 〉

Page 18: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

18

Cont.

2012/6/13

Definition 7. The i-th semi-maximum period Ex. 〈 ABAB(AB)(AB) 〉 p = 〈〉 1. semi-maximum period of prefix p in (1) i =1 , before the last-in-first appearance : 〈 ABAB(AB)(AB) 〉 2. semi-maximum period of prefix p in (2) 1< i ≤ n, n=4, i=2 a. end of the first instance of 〈〉 : 〈 AB 〉 b. the 2-th last-in-first appearance w.r.t p: B 〈 ABAB(AB)(AB) 〉

Page 19: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

19

Cont.

2012/6/13

EbackScan search Let an endpoint sequence, if there exists i, 1 ≤ i ≤ n

and there exists an endpoint x which appears in each of the i-th semi-maximum periods of the prefixin database.

We can derive a new endpoint sequenceand we can stop growing the endpoint sequence .

Ex. Prefix sequence p = <A, C> B is the 2nd semi-max. period of the prefix p in

database We can derive a new prefix sequence p’ = <A, B, C>

Page 20: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

20

CEMiner Algorithm

2012/6/13

We use three pruning strategies to reduce the searching space efficiently and effectively.

(1) pre-pruning (2) post-pruning (3) pair-pruning

Page 21: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

21

CEMiner Algo.

2012/6/13

Page 22: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

22

CEMiner Algo.

2012/6/13

Page 23: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

23

CEMiner Algo.

2012/6/13

Page 24: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

24

CEMiner Algo.

2012/6/13

Page 25: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

25

CEMiner Algo.

2012/6/13

Pair-pruning: If the endpoint is a

starting endpoint, we can omit the closure checking.

Because the starting endpoint and finishing endpoint always occur in pairs in an endpoint sequence.

Page 26: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

26

CEMiner Algo.

2012/6/13

Ex. Prefix p =<> Endpoint B+ is a

backward-extension endpoint of p.

So we can stop growing p.

Page 27: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

27

CEMiner Algo.

2012/6/13

Page 28: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

28

CEMiner Algo.

2012/6/13

Page 29: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

29

CEMiner Algo.

2012/6/13

Pre-pruning: If y is finishing

endpoint and it has corresponding starting endpoint in .

Page 30: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

30

CEMiner Algo.

2012/6/13

Post-pruning: A finish point is

called significant, if it has a corresponding starting endpoint in projected postfix or in .

Page 31: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

31

Cont.

2012/6/13

Page 32: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

32

Experimental result

2012/6/13

Page 33: CEMiner  – An Efficient Algorithm  for  Mining Closed Patterns  from Time  Interval-based  Data

33

Conclusion

2012/6/13

We develop an efficient algorithm, CEMiner, to discover closed temporal patterns without candidate generation, based on proposed endpoint representation.

The algorithm further employs three pruning methods to reduce the search space effectively.