reza sherkat icde061 reza sherkat and davood rafiei department of computing science university of...
Post on 20-Dec-2015
220 views
TRANSCRIPT
Reza Sherkat ICDE06 1
Reza Sherkat and Davood Rafiei
Department of Computing Science
University of Alberta
Canada
Efficiently Evaluating Order Preserving Similarity Queries over
Historical Market-Basket Data
Travel assistance provided by the Mary Louise Imrie Graduate Student Award
Reza Sherkat ICDE06 2
Overview
• Introduction – Histories and Time-series
– Similarity model for histories
• Problem Definition
• Proposed Approach
• Results Highlight
• Conclusions
Reza Sherkat ICDE06 3
Querying Histories: Introduction
• Querying multiple snapshots of data– Temporal selection, projection, and join queries
• Finding similar time-series– Finding companies having similar stocks
• Is it possible to define a notion of similarity for objects based on the similarity of their histories?
Reza Sherkat ICDE06 4
Histories
History: A sequence of time-stamped observations
– Time-series: observations are real-values
– Observations can be more general
: bag of word
it jt kt
the history of a web-page the history of a patient
when observation
day 1 {a, b}
day 2 {a, b, c}
day 3 { }
day 4 {h, i}
Reza Sherkat ICDE06 5
Similarity Model for Histories
Similarity of two histories depends on:
• Pair-wise similarity of their observations
day h1 h2 h3
1 {a, b} {a, b, e} {f, g, h}
2 {a, b, c} {b, c, z} {a, b, c}
3 {f, g} { } {f, g, h}
4 {h, i} {f, g, h} {b, c}
History for 3 patients
Reza Sherkat ICDE06 6
Similarity Model for Histories
Similarity of two histories depends on:
• Pair-wise similarity of their observations
day h1 h2 h3
1 {a, b} {a, b, e} {f, g, h}
2 {a, b, c} {b, c, z} {a, b, c}
3 {f, g} { } {f, g, h}
4 {h, i} {f, g, h} {b, c}
History for 3 patients
• The order that similar observations are recorded – Constraints on time-stamps of observations
Reza Sherkat ICDE06 7
Problem Definition
Given a history as a query:
– Evaluate k-NN and Range queries efficiently.
– For each history in the result, find its common signature with the query - where the similarity comes from?
Reza Sherkat ICDE06 8
Alignment of histories:
– An approach to line-up subsequences of two histories
– Denoted by a sequence of matches:
– is an observation in A (B) or a gap ( ).
– is the score of a match.
– Alignment score measures the quality of an alignment.
Similarity Measure for Histories
ii
)( ii
ii
Reza Sherkat ICDE06 9
Alignments of Histories
Alignment score can be the sum of the score of matches in the alignment.
1 0 0 0
0 0 0
0 0 1
0 0 0
41
31
42
1a
2a
3a4a
1b 2b 3b 4b
11 ba The best alignment of two histories:
4311 , baba
2, 4311 babascore
4141 ,,,,, bbBaaA
Reza Sherkat ICDE06 10
Alignments of Histories
Alignment score can be the sum of the score of matches in the alignment.
1 0 0 0
0 0 0
0 0 1
0 0 0
41
31
42
1a
2a
3a4a
1b 2b 3b 4b
11 ba The best alignment of two histories:
4311 , baba
What is the best alignment of length 3?
443311 ,, bababa
4141 ,,,,, bbBaaA
Reza Sherkat ICDE06 11
Alignments of Histories
Alignment score can be the sum of the score of matches in the alignment.
1 0 0 0
0 0 0
0 0 1
0 0 0
41
31
42
1a
2a
3a4a
1b 2b 3b 4b
11 ba The best alignment of two histories:
4311 , baba
What is the best alignment of length 3?
443311 ,, bababa
If the match could not be considered, what would be the best alignment of length 2?
43 ba
4141 ,,,,, bbBaaA
Reza Sherkat ICDE06 12
Constraints on the Alignments of Histories
1. The number of matches in the alignment.
• l-alignment: alignment with l matches
2. The r-neighborhood constraint
• For each match
r ,l : parameters of the similarity query.
.)()(),( rtsts iiii
Reza Sherkat ICDE06 13
The principle of optimality holds if:
Principle of Optimality
mii aaaaA ,,,,, 11 njj bbbbB ,,,,, 11 p(A) p(B)s(A) s(B)
: optimal alignment of p(A) and p(B)
: optimal alignment of s(A) and s(B)
: optimal alignment of A and B
: concatenation operator
*p*s*
)()( ***spscorescore
Reza Sherkat ICDE06 14
ljiG ,
Score of Optimal l-alignment
mii aaaa ,,,,, 11
Optimal l-alignment of suffixes can formed by:
• Concatenating with optimal (l-1)-alignment of suffixes
ji ba
mjmi abaa ,,,,, 11 11,1
l
jiji Gbascore
• Matching with gap, and considering l-alignment of suffixes
mjmi abaa ,,,,,1
ia
ljii Gascore ,1
• Matching with gap, and considering l-alignment of suffixes
mjmi abaa ,,,,, 1 ljij Gbscore 1,
jb
njj bbbb ,,,,, 11
Reza Sherkat ICDE06 15
Similarity Measure for Histories
},min{),( 1,1
BA
GBAsim
l
l
: the score of optimal l-alignment of two histories.lG 1,1
can be used to find common signature of histories:
• A sequence of observations that appear in the same order in two histories.
• Generalizes the notion of longest common subsequence.
ljiG ,
Reza Sherkat ICDE06 16
Similarity Queries over Collection of Histories
• Straightforward (not practical) approach: naïve scan
• Indexing techniques are proposed for metric spaces,
but is not metric:
– when the distance between observations is not metric.
– when an r-neighberhood constraint is specified.
• We propose upper bounds to prune history search space.
),(1 BAsiml
Reza Sherkat ICDE06 17
A General Upper Bound for the Similarity Measure
Intuition: The score of an optimal relaxed l-alignment is not less than the score of optimal l-alignment.
1. For each observation, find an optimal match.
2. Aggregate the scores for top l optimal matches to find an upper bound for .
lG 1,1
This upper bound can prune some extra computations, but still all histories will be accessed to evaluate a query.
Reza Sherkat ICDE06 18
Intuitions:
• Observations are sparse in real life applications.
• The score of an optimal relaxed match is not less
than the score of an optimal match.
• The score of an optimal relaxed alignment is not
less than the score of optimal relaxed l-alignment.
An Index-based Upper Bound for the Similarity Measure
This upper bound can be evaluated efficiently by exploiting an inverted index if is Cosine or Extended Jaccard Coefficient. .
Reza Sherkat ICDE06 19
Experiments
• Experiments performed on AMD/XP 2600 512 Mb RAM
• Datasets:– DBLP
– Synth1: Our synthetic data
– Synth2: Modified IBM synthetic data generator
• Investigated: – Effectiveness of similarity measure
– Efficiency of our approach• Pruning power, Running time, Saleability
Reza Sherkat ICDE06 21
Synth2 dataset contains: • 20,000 histories • for each history is selected randomly from {1,…,10}• Length of histories: {32,…,64}
Effectiveness of our Similarity Measure
observation: document modeled as bit string
First observation: randomly selected
…
V(1)
…
V( i+1 )
…
V( i )
… …
V( n )
…
: Poisson distribution
V(i+1): bit string following V(i) in a pre-determined order
th
)( [Cho et al. VLDB 2000]
Reza Sherkat ICDE06 22
Effectiveness of our Similarity Measure (cnt.)
Mean deviation of from for k-NN queries: ir
q
kkMD
k
i rq
qi
1),(
* For 2,000 randomly generated queries
Reza Sherkat ICDE06 23
100
101
102
103
0
10
20
30
40
50
60
70
80
90
100
No. of Nearest Neighbors
Fra
ctio
n of
dat
aset
exa
min
edNaive scanPrune by LenghtGeneral UBIndex-based UB
Pruning Power vs. k
No. of neighbours in k-NN query (LOG scale) 1 10 100 1024
Fra
ctio
n o
f d
ata
ba
se
exa
min
ed
0
20
40
60
80
100
Reza Sherkat ICDE06 24
100
101
102
103
0
100
200
300
400
500
600
No. of Nearest Neighbors
Tim
e(m
sec)
Naive scanPrune by LengthGeneral UBIndex-based UB
Running Time vs. k
Dataset: Synth2, 8,000 Histories, 1,000 items
Tim
e (
ms
ec)
0
100
20
0
300
400
500
6
00
1 10 100 1024
No. of neighbours in k-NN query (LOG scale)
Reza Sherkat ICDE06 25
8 16 32 640
500
1000
1500
2000
2500
3000
3500
Size of dataset
Tim
e(m
sec)
Naive scanGeneral UBIndex-based UB
Scalability for 1-NN queries
No. of histories in the collection
8,000 16,000 32,000 64,000
Tim
e (
ms
ec)
Reza Sherkat ICDE06 26
256 512 1024 2048 4096 8092256
512
1024
2048
4096
8192
No. of items in store
Tim
e (m
sec)
Naive scanGeneral UBIndex-based UB
Running time vs. Sparseness of Observations
No. of items (LOG scale)
256 512 1,024 2,048 4,096 8,092
Tim
e (
ms
ec)
Reza Sherkat ICDE06 27
Conclusions
• Introduced a domain-independent framework to formulate and evaluate similarity queries over historical data.
• Generalized few concepts, including edit distance and longest common subsequence to histories.
• Developed upper bounds to efficiently evaluate queries. One of our upper bounds can directly take advantage of an index even though it is not metric.
• Our experiments confirm the effectiveness and efficiency of our approach.
Reza Sherkat ICDE06 29
Related Works
• Detecting, representing, querying histories– [Chawathe 1998], [Chien 2001]
• Similarity-based sequence matching – [Altschul 1990], [Pearson 1990], [Bieganski 1994]
• Finding similar sequence of events– [Wang 2003]
• Finding similar time series– [Agrawal 1995], [Rafiei 1997], [Keogh 2002], [Vlachos
2002, 2003], ...