reza sherkat icde061 reza sherkat and davood rafiei department of computing science university of...

Reza Sherkat ICDE06 1

Reza Sherkat and Davood Rafiei

Department of Computing Science

University of Alberta

Canada

Efficiently Evaluating Order Preserving Similarity Queries over

Historical Market-Basket Data

Travel assistance provided by the Mary Louise Imrie Graduate Student Award


Overview

• Introduction – Histories and Time-series

– Similarity model for histories

• Problem Definition

• Proposed Approach

• Results Highlight

• Conclusions


Querying Histories: Introduction

• Querying multiple snapshots of data– Temporal selection, projection, and join queries

• Finding similar time-series– Finding companies having similar stocks

• Is it possible to define a notion of similarity for objects based on the similarity of their histories?


Histories

History: A sequence of time-stamped observations

– Time-series: observations are real-values

– Observations can be more general

: bag of word

it jt kt

the history of a web-page the history of a patient

when observation

day 1 {a, b}

day 2 {a, b, c}

day 3 { }

day 4 {h, i}


Similarity Model for Histories

Similarity of two histories depends on:

• Pair-wise similarity of their observations

day h1 h2 h3

1 {a, b} {a, b, e} {f, g, h}

2 {a, b, c} {b, c, z} {a, b, c}

3 {f, g} { } {f, g, h}

4 {h, i} {f, g, h} {b, c}

History for 3 patients


Similarity Model for Histories

Similarity of two histories depends on:

• Pair-wise similarity of their observations

day h1 h2 h3

1 {a, b} {a, b, e} {f, g, h}

2 {a, b, c} {b, c, z} {a, b, c}

3 {f, g} { } {f, g, h}

4 {h, i} {f, g, h} {b, c}

History for 3 patients

• The order that similar observations are recorded – Constraints on time-stamps of observations


Problem Definition

Given a history as a query:

– Evaluate k-NN and Range queries efficiently.

– For each history in the result, find its common signature with the query - where the similarity comes from?


Alignment of histories:

– An approach to line-up subsequences of two histories

– Denoted by a sequence of matches:

– is an observation in A (B) or a gap ( ).

– is the score of a match.

– Alignment score measures the quality of an alignment.

Similarity Measure for Histories

ii

)( ii

ii


Alignments of Histories

Alignment score can be the sum of the score of matches in the alignment.

1 0 0 0

0 0 0

0 0 1

0 0 0

41

31

42

1a

2a

3a4a

1b 2b 3b 4b

11 ba The best alignment of two histories:

4311 , baba

2, 4311 babascore

4141 ,,,,, bbBaaA




1 0 0 0

0 0 0

0 0 1

0 0 0

41

31

42

1a

2a

3a4a

1b 2b 3b 4b


4311 , baba

What is the best alignment of length 3?

443311 ,, bababa

4141 ,,,,, bbBaaA




1 0 0 0

0 0 0

0 0 1

0 0 0

41

31

42

1a

2a

3a4a

1b 2b 3b 4b


4311 , baba

What is the best alignment of length 3?

443311 ,, bababa

If the match could not be considered, what would be the best alignment of length 2?

43 ba

4141 ,,,,, bbBaaA


Constraints on the Alignments of Histories

1. The number of matches in the alignment.

• l-alignment: alignment with l matches

2. The r-neighborhood constraint

• For each match

r ,l : parameters of the similarity query.

.)()(),( rtsts iiii


The principle of optimality holds if:

Principle of Optimality

mii aaaaA ,,,,, 11 njj bbbbB ,,,,, 11 p(A) p(B)s(A) s(B)

: optimal alignment of p(A) and p(B)

: optimal alignment of s(A) and s(B)

: optimal alignment of A and B

: concatenation operator

*p*s*

)()( ***spscorescore


ljiG ,

Score of Optimal l-alignment

mii aaaa ,,,,, 11

Optimal l-alignment of suffixes can formed by:

• Concatenating with optimal (l-1)-alignment of suffixes

ji ba

mjmi abaa ,,,,, 11 11,1

l

jiji Gbascore

• Matching with gap, and considering l-alignment of suffixes

mjmi abaa ,,,,,1

ia

ljii Gascore ,1

• Matching with gap, and considering l-alignment of suffixes

mjmi abaa ,,,,, 1 ljij Gbscore 1,

jb

njj bbbb ,,,,, 11


Similarity Measure for Histories

},min{),( 1,1

BA

GBAsim

l

l

: the score of optimal l-alignment of two histories.lG 1,1

can be used to find common signature of histories:

• A sequence of observations that appear in the same order in two histories.

• Generalizes the notion of longest common subsequence.

ljiG ,


Similarity Queries over Collection of Histories

• Straightforward (not practical) approach: naïve scan

• Indexing techniques are proposed for metric spaces,

but is not metric:

– when the distance between observations is not metric.

– when an r-neighberhood constraint is specified.

• We propose upper bounds to prune history search space.

),(1 BAsiml


A General Upper Bound for the Similarity Measure

Intuition: The score of an optimal relaxed l-alignment is not less than the score of optimal l-alignment.

1. For each observation, find an optimal match.

2. Aggregate the scores for top l optimal matches to find an upper bound for .

lG 1,1

This upper bound can prune some extra computations, but still all histories will be accessed to evaluate a query.


Intuitions:

• Observations are sparse in real life applications.

• The score of an optimal relaxed match is not less

than the score of an optimal match.

• The score of an optimal relaxed alignment is not

less than the score of optimal relaxed l-alignment.

An Index-based Upper Bound for the Similarity Measure

This upper bound can be evaluated efficiently by exploiting an inverted index if is Cosine or Extended Jaccard Coefficient. .


Experiments

• Experiments performed on AMD/XP 2600 512 Mb RAM

• Datasets:– DBLP

– Synth1: Our synthetic data

– Synth2: Modified IBM synthetic data generator

• Investigated: – Effectiveness of similarity measure

– Efficiency of our approach• Pruning power, Running time, Saleability


Synth2 dataset contains: • 20,000 histories • for each history is selected randomly from {1,…,10}• Length of histories: {32,…,64}

Effectiveness of our Similarity Measure

observation: document modeled as bit string

First observation: randomly selected

…

V(1)

…

V( i+1 )

…

V( i )

… …

V( n )

…

: Poisson distribution

V(i+1): bit string following V(i) in a pre-determined order

th

)( [Cho et al. VLDB 2000]


Effectiveness of our Similarity Measure (cnt.)

Mean deviation of from for k-NN queries: ir

q

kkMD

k

i rq

qi

1),(

* For 2,000 randomly generated queries


100

101

102

103

0

10

20

30

40

50

60

70

80

90

100

No. of Nearest Neighbors

Fra

ctio

n of

dat

aset

exa

min

edNaive scanPrune by LenghtGeneral UBIndex-based UB

Pruning Power vs. k

No. of neighbours in k-NN query (LOG scale) 1 10 100 1024

Fra

ctio

n o

f d

ata

ba

se

exa

min

ed

0

20

40

60

80

100


100

101

102

103

0

100

200

300

400

500

600

No. of Nearest Neighbors

Tim

e(m

sec)

Naive scanPrune by LengthGeneral UBIndex-based UB

Running Time vs. k

Dataset: Synth2, 8,000 Histories, 1,000 items

Tim

e (

ms

ec)

0

100

20

0

300

400

500

6

00

1 10 100 1024

No. of neighbours in k-NN query (LOG scale)


8 16 32 640

500

1000

1500

2000

2500

3000

3500

Size of dataset

Tim

e(m

sec)

Naive scanGeneral UBIndex-based UB

Scalability for 1-NN queries

No. of histories in the collection

8,000 16,000 32,000 64,000

Tim

e (

ms

ec)


256 512 1024 2048 4096 8092256

512

1024

2048

4096

8192

No. of items in store

Tim

e (m

sec)

Naive scanGeneral UBIndex-based UB

Running time vs. Sparseness of Observations

No. of items (LOG scale)

256 512 1,024 2,048 4,096 8,092

Tim

e (

ms

ec)


Conclusions

• Introduced a domain-independent framework to formulate and evaluate similarity queries over historical data.

• Generalized few concepts, including edit distance and longest common subsequence to histories.

• Developed upper bounds to efficiently evaluate queries. One of our upper bounds can directly take advantage of an index even though it is not metric.

• Our experiments confirm the effectiveness and efficiency of our approach.


Thank you for your attention!


Related Works

• Detecting, representing, querying histories– [Chawathe 1998], [Chien 2001]

• Similarity-based sequence matching – [Altschul 1990], [Pearson 1990], [Bieganski 1994]

• Finding similar sequence of events– [Wang 2003]

• Finding similar time series– [Agrawal 1995], [Rafiei 1997], [Keogh 2002], [Vlachos

2002, 2003], ...

reza sherkat icde061 reza sherkat and davood rafiei department of computing science university of...

Documents

histories slide

alignment optimal

histories similarity

alignment of suffixes

best alignment of length

optimal alignment of

reza sherkat icde0613

score of matches