reza sherkat icde061 reza sherkat and davood rafiei department of computing science university of...

29
Reza Sherkat and Davood Rafiei Department of Computing Science University of Alberta Canada Efficiently Evaluating Order Preserving Similarity Queries over Historical Market-Basket Data Travel assistance provided by the Mary Louise Imrie Graduate Student Award

Post on 20-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Reza Sherkat ICDE06 1

Reza Sherkat and Davood Rafiei

Department of Computing Science

University of Alberta

Canada

Efficiently Evaluating Order Preserving Similarity Queries over

Historical Market-Basket Data

Travel assistance provided by the Mary Louise Imrie Graduate Student Award

Reza Sherkat ICDE06 2

Overview

• Introduction – Histories and Time-series

– Similarity model for histories

• Problem Definition

• Proposed Approach

• Results Highlight

• Conclusions

Reza Sherkat ICDE06 3

Querying Histories: Introduction

• Querying multiple snapshots of data– Temporal selection, projection, and join queries

• Finding similar time-series– Finding companies having similar stocks

• Is it possible to define a notion of similarity for objects based on the similarity of their histories?

Reza Sherkat ICDE06 4

Histories

History: A sequence of time-stamped observations

– Time-series: observations are real-values

– Observations can be more general

: bag of word

it jt kt

the history of a web-page the history of a patient

when observation

day 1 {a, b}

day 2 {a, b, c}

day 3 { }

day 4 {h, i}

Reza Sherkat ICDE06 5

Similarity Model for Histories

Similarity of two histories depends on:

• Pair-wise similarity of their observations

day h1 h2 h3

1 {a, b} {a, b, e} {f, g, h}

2 {a, b, c} {b, c, z} {a, b, c}

3 {f, g} { } {f, g, h}

4 {h, i} {f, g, h} {b, c}

History for 3 patients

Reza Sherkat ICDE06 6

Similarity Model for Histories

Similarity of two histories depends on:

• Pair-wise similarity of their observations

day h1 h2 h3

1 {a, b} {a, b, e} {f, g, h}

2 {a, b, c} {b, c, z} {a, b, c}

3 {f, g} { } {f, g, h}

4 {h, i} {f, g, h} {b, c}

History for 3 patients

• The order that similar observations are recorded – Constraints on time-stamps of observations

Reza Sherkat ICDE06 7

Problem Definition

Given a history as a query:

– Evaluate k-NN and Range queries efficiently.

– For each history in the result, find its common signature with the query - where the similarity comes from?

Reza Sherkat ICDE06 8

Alignment of histories:

– An approach to line-up subsequences of two histories

– Denoted by a sequence of matches:

– is an observation in A (B) or a gap ( ).

– is the score of a match.

– Alignment score measures the quality of an alignment.

Similarity Measure for Histories

ii

)( ii

ii

Reza Sherkat ICDE06 9

Alignments of Histories

Alignment score can be the sum of the score of matches in the alignment.

1 0 0 0

0 0 0

0 0 1

0 0 0

41

31

42

1a

2a

3a4a

1b 2b 3b 4b

11 ba The best alignment of two histories:

4311 , baba

2, 4311 babascore

4141 ,,,,, bbBaaA

Reza Sherkat ICDE06 10

Alignments of Histories

Alignment score can be the sum of the score of matches in the alignment.

1 0 0 0

0 0 0

0 0 1

0 0 0

41

31

42

1a

2a

3a4a

1b 2b 3b 4b

11 ba The best alignment of two histories:

4311 , baba

What is the best alignment of length 3?

443311 ,, bababa

4141 ,,,,, bbBaaA

Reza Sherkat ICDE06 11

Alignments of Histories

Alignment score can be the sum of the score of matches in the alignment.

1 0 0 0

0 0 0

0 0 1

0 0 0

41

31

42

1a

2a

3a4a

1b 2b 3b 4b

11 ba The best alignment of two histories:

4311 , baba

What is the best alignment of length 3?

443311 ,, bababa

If the match could not be considered, what would be the best alignment of length 2?

43 ba

4141 ,,,,, bbBaaA

Reza Sherkat ICDE06 12

Constraints on the Alignments of Histories

1. The number of matches in the alignment.

• l-alignment: alignment with l matches

2. The r-neighborhood constraint

• For each match

r ,l : parameters of the similarity query.

.)()(),( rtsts iiii

Reza Sherkat ICDE06 13

The principle of optimality holds if:

Principle of Optimality

mii aaaaA ,,,,, 11 njj bbbbB ,,,,, 11 p(A) p(B)s(A) s(B)

: optimal alignment of p(A) and p(B)

: optimal alignment of s(A) and s(B)

: optimal alignment of A and B

: concatenation operator

*p*s*

)()( ***spscorescore

Reza Sherkat ICDE06 14

ljiG ,

Score of Optimal l-alignment

mii aaaa ,,,,, 11

Optimal l-alignment of suffixes can formed by:

• Concatenating with optimal (l-1)-alignment of suffixes

ji ba

mjmi abaa ,,,,, 11 11,1

l

jiji Gbascore

• Matching with gap, and considering l-alignment of suffixes

mjmi abaa ,,,,,1

ia

ljii Gascore ,1

• Matching with gap, and considering l-alignment of suffixes

mjmi abaa ,,,,, 1 ljij Gbscore 1,

jb

njj bbbb ,,,,, 11

Reza Sherkat ICDE06 15

Similarity Measure for Histories

},min{),( 1,1

BA

GBAsim

l

l

: the score of optimal l-alignment of two histories.lG 1,1

can be used to find common signature of histories:

• A sequence of observations that appear in the same order in two histories.

• Generalizes the notion of longest common subsequence.

ljiG ,

Reza Sherkat ICDE06 16

Similarity Queries over Collection of Histories

• Straightforward (not practical) approach: naïve scan

• Indexing techniques are proposed for metric spaces,

but is not metric:

– when the distance between observations is not metric.

– when an r-neighberhood constraint is specified.

• We propose upper bounds to prune history search space.

),(1 BAsiml

Reza Sherkat ICDE06 17

A General Upper Bound for the Similarity Measure

Intuition: The score of an optimal relaxed l-alignment is not less than the score of optimal l-alignment.

1. For each observation, find an optimal match.

2. Aggregate the scores for top l optimal matches to find an upper bound for .

lG 1,1

This upper bound can prune some extra computations, but still all histories will be accessed to evaluate a query.

Reza Sherkat ICDE06 18

Intuitions:

• Observations are sparse in real life applications.

• The score of an optimal relaxed match is not less

than the score of an optimal match.

• The score of an optimal relaxed alignment is not

less than the score of optimal relaxed l-alignment.

An Index-based Upper Bound for the Similarity Measure

This upper bound can be evaluated efficiently by exploiting an inverted index if is Cosine or Extended Jaccard Coefficient. .

Reza Sherkat ICDE06 19

Experiments

• Experiments performed on AMD/XP 2600 512 Mb RAM

• Datasets:– DBLP

– Synth1: Our synthetic data

– Synth2: Modified IBM synthetic data generator

• Investigated: – Effectiveness of similarity measure

– Efficiency of our approach• Pruning power, Running time, Saleability

Reza Sherkat ICDE06 20

Reza Sherkat ICDE06 21

Synth2 dataset contains: • 20,000 histories • for each history is selected randomly from {1,…,10}• Length of histories: {32,…,64}

Effectiveness of our Similarity Measure

observation: document modeled as bit string

First observation: randomly selected

V(1)

V( i+1 )

V( i )

… …

V( n )

: Poisson distribution

V(i+1): bit string following V(i) in a pre-determined order

th

)( [Cho et al. VLDB 2000]

Reza Sherkat ICDE06 22

Effectiveness of our Similarity Measure (cnt.)

Mean deviation of from for k-NN queries: ir

q

kkMD

k

i rq

qi

1),(

* For 2,000 randomly generated queries

Reza Sherkat ICDE06 23

100

101

102

103

0

10

20

30

40

50

60

70

80

90

100

No. of Nearest Neighbors

Fra

ctio

n of

dat

aset

exa

min

edNaive scanPrune by LenghtGeneral UBIndex-based UB

Pruning Power vs. k

No. of neighbours in k-NN query (LOG scale) 1 10 100 1024

Fra

ctio

n o

f d

ata

ba

se

exa

min

ed

0

20

40

60

80

100

Reza Sherkat ICDE06 24

100

101

102

103

0

100

200

300

400

500

600

No. of Nearest Neighbors

Tim

e(m

sec)

Naive scanPrune by LengthGeneral UBIndex-based UB

Running Time vs. k

Dataset: Synth2, 8,000 Histories, 1,000 items

Tim

e (

ms

ec)

0

100

20

0

300

400

500

6

00

1 10 100 1024

No. of neighbours in k-NN query (LOG scale)

Reza Sherkat ICDE06 25

8 16 32 640

500

1000

1500

2000

2500

3000

3500

Size of dataset

Tim

e(m

sec)

Naive scanGeneral UBIndex-based UB

Scalability for 1-NN queries

No. of histories in the collection

8,000 16,000 32,000 64,000

Tim

e (

ms

ec)

Reza Sherkat ICDE06 26

256 512 1024 2048 4096 8092256

512

1024

2048

4096

8192

No. of items in store

Tim

e (m

sec)

Naive scanGeneral UBIndex-based UB

Running time vs. Sparseness of Observations

No. of items (LOG scale)

256 512 1,024 2,048 4,096 8,092

Tim

e (

ms

ec)

Reza Sherkat ICDE06 27

Conclusions

• Introduced a domain-independent framework to formulate and evaluate similarity queries over historical data.

• Generalized few concepts, including edit distance and longest common subsequence to histories.

• Developed upper bounds to efficiently evaluate queries. One of our upper bounds can directly take advantage of an index even though it is not metric.

• Our experiments confirm the effectiveness and efficiency of our approach.

Reza Sherkat ICDE06 28

Thank you for your attention!

Reza Sherkat ICDE06 29

Related Works

• Detecting, representing, querying histories– [Chawathe 1998], [Chien 2001]

• Similarity-based sequence matching – [Altschul 1990], [Pearson 1990], [Bieganski 1994]

• Finding similar sequence of events– [Wang 2003]

• Finding similar time series– [Agrawal 1995], [Rafiei 1997], [Keogh 2002], [Vlachos

2002, 2003], ...