an overview of similarity query processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부
TRANSCRIPT
![Page 1: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/1.jpg)
An Overview of Similarity Query Processing
2014. 2. 26.
김종익전북대학교 컴퓨터공학부
![Page 2: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/2.jpg)
2
Table of Contents
01. Applications of similarity query processing
02. Problem Formulation
03. string Decomposition
04. Similarity Function
05. A naïve approach
06. Overlap Similarity
07. Similarity Query Processing with Inverted lists
08. Similarity Function Revisited
09. Filter and Verification Framework
10. Prefix Filtering based Approach
11. Exploiting Document Frequency Ordering
![Page 3: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/3.jpg)
3
Some examples and figures in this presentationare taken from the following materials
Marios Hadjieleftheriou and Chen Li, Efficient Approximate Search on String Collections (tutorial), ICDE 2009 and VLDB 2009
Chuan Xiao, Wei Wang, Xuemin Lin, Jeffrey Xu Yu, Efficient Similarity Joins for Near Duplicate Detection, WWW 2008 (slide)
Jongik Kim and Hongrae Lee, Efficient Exact Similarity Searches using Multiple Token Orderings, ICDE 2012 (slide)
![Page 4: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/4.jpg)
Applications of similarity query processing (1/8)
4
Actual queries gatheredby Google
Web Search
![Page 5: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/5.jpg)
5
Should be “Niels Bohr”
Applications of similarity query processing (2/8)
Data Integration and data cleaning
R
informix … …
microsoft … …
… … …
… … …
S
infromix …
… …
mcrosoft …
… …
![Page 6: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/6.jpg)
6
Applications of similarity query processing (3/8)
Duplicate (Web) Documents Detection
![Page 7: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/7.jpg)
7
Applications of similarity query processing (4/8)
Identify Spams
SPAM TEMPLATE
Sir/Madam,We happily announce to you the draw of the EURO MILLIONS SPANISH LOTTERY INTERNATIONALWINNINGS PROGRAM PROMOTIONS held on the 27TH MARCH 2008 in SPAIN. Your company or yourpersonal e-mail address attached to ticket number 653-908-321-675 with serial main number <NUMBER> drew lucky star winning numbers <NUMBER> which consequently won in the 2ND category, you have therefore been approved for a lump sum pay out of 960.000.00 Euros. (NINE HUNDRED AND SIXTY THOUSAND EUROS).CONGRATULATIONS!!!
Sincerely yours,<NAME><AFFILIATION>
![Page 8: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/8.jpg)
8
Applications of similarity query processing (5/8)
Detect Plagiarism
Q. What are the advantages of RAID5 over RAID4?A. 1. Several write requests could be processed in parallel, since the bottleneck of a unique check disk has been eliminated. 2. Read requests have a higher level of parallelism. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a dedicated check disk the check disk never participates in read.
Q. What are the advantages of RAID5 over RAID4? A. 1. Several write requests could be processed in parallel, since the bottleneck of a single check disk has been eliminated. 2. Read requests have a higher level of parallelism on RAID5. Since the data is distributed over all disks, read requests involve all disks, whereas in systems with a check disk the check disk never participates in read.
![Page 9: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/9.jpg)
Recommendation of friends in an SNS service
9
Applications of similarity query processing (6/8)
Friends vector: 1 0 0 1 1 0 0 1 Friends vector: 1 0 0 1 1 1 0 1
Friends of a person can be representation of a binary vector
![Page 10: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/10.jpg)
Read (a fragment of genome sequence) Alignment
10
Applications of similarity query processing (7/8)
GCTGATGTGCCGCCTCACTCCGGTGG …
CACTCCTGTGG
CTCACTCCTGTGG
GCTGATGTGCCACCTCA
GATGTGCCACCTCACTC
GTGCCGCCTCACTCCTG
CTCCTGTGG
Reference sequence
Short reads
![Page 11: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/11.jpg)
11
Applications of similarity query processing (8/8)
Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:
begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /
CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');
Usage:
SELECT * FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;
Limitation: cannot handle errors in the first letters:
Katherine versus Catherine
Query Relaxation
![Page 12: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/12.jpg)
12
Problem Formulation (1/2)
Find strings similar to a given string
![Page 13: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/13.jpg)
Similar to: a domain-specific function returns a similarity value between two strings
Common similarity functions: Jaccard coefficient Cosine similarity Dice similarity Edit distance
13
Problem Formulation (2/2)
Functions require set data
![Page 14: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/14.jpg)
14
String Decomposition
Word tokens for long string (e.g. web page)x = “yes as soon as possible”y = “as soon as possible please”
x = {A, B, C, D, E}y = {B, C, D, E, F}
word yes as soon as1 possbile please
token A B C D E F
q-gram tokens for short string (e.g. keyword query)x = “universal”
G(x, 2) = {un, ni, iv, ve, er, rs, sa, al}
u n i v e r s a l
![Page 15: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/15.jpg)
15
Similarity Function
Jaccard Similarity
Cosine similarity
Dice similarity
4( , ) 0.67
6
x yJ x y
x y
x = {A, B, C, D, E}y = {B, C, D, E, F}
| | 4( , ) 0.8
5| || |
x yC x y
x y
2 | | 8( , ) 0.8
| | | | 10
x yD x y
x y
Edit DistanceED(x, y) =
minimum number of edit operations to change x to y
(insertion, deletion, substitution)
x: Tom Hanks y: Ton Hank ED(x, y) = 2
![Page 16: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/16.jpg)
16
A naïve approach
Given a collection of strings C, a query string x, and a threshold t of a similarity function sim,
1. decompose each string in C and the query string into tokens.2. output those string y ∈ C such that sim(x, y) ≥ t.
Since C contains a lot of strings, this approach is obviously inefficient.
![Page 17: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/17.jpg)
17
Overlap Similarity (1/2)
yxt
tyxO
1),(t
yx
yxyxJ
),(
Given a similarity threshold t,
| | ( , )
| | | | | | | | | | ( , )
x y O x yt
x y x y x y O x y
( , ) | || |O x y t x y | |
( , )| || |
x yC x y t
x y
2 | |( , )
| | | |
x yD x y t
x y
(| | | |)
( , )2
t x yO x y
( , ) 4O x y x y Overlap Similarity
![Page 18: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/18.jpg)
18
Overlap Similarity (2/2)
( ( , ), ( , )) max(| ( , ) |,| ( , ) |)O G x q G y q G x q G y q d q ( , )ED x y d
Given an edit distance d,
u n i v e r s a l
d edit operations could affect d x q grams
or, d edit operations on x can mutate d x q grams of x
x = “universal” and G(x, 2) = {un, ni, iv, ve, er, rs, sa, al}
2 edit operations on x mutate 2 x 2 q-grams
Hence, y should contains at least |G(x, 2)| - 2 x 2 = 4 q-grams in G(x, 2)
![Page 19: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/19.jpg)
19
Similarity Query Processing with Inverted lists
ID String Record (token set)1 area { , re, ea}2 artisan { , rt, ti, is, sa, an}3 artist { , rt, ti, is, st}4 tisk {ti, is, sk}… … …
arskeais
sart
stti
1 2 3412
22 3
32 3 4
re 1Make Inverted Lists
an 2
3
Query: “artist” Overlap threshold: 4
Merge to count occurrences
1
2
3
4
2
4
5
2
Answers of the query 2: “artisan” 3: “artist”
{ , , , , }ar rt ti is st
4
ararar
![Page 20: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/20.jpg)
20
1
2
3
4
1
3
3
7
3 2
Count threshold t≥ 3
minHeap
2
1717
1: count 2 < t (X)
2: count 3 = t (O)
…11 2
3 3
2
2
Merge Algorithm – HeapMerge
![Page 21: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/21.jpg)
Similarity Function Revisited
( , ) | |O x y t x tyx
yxyxJ
),(
Given a query x with a similarity threshold t, FOR ALL y,
| | | | (| | | |)1
| | | |
ty x y x y
ty t x
2( , ) | |O x y t x | |( , )
| || |
x yC x y t
x y
2 | |( , )
| | | |
x yD x y t
x y
| |
( , )2
t xO x y
t
To determine the overlap threshold, we need to know the size of y, whichvaries according to each string in a collection.
( ( , ), ( , )) | ( , ) |O G x q G y q G x q t q ( , )ED x y t
![Page 22: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/22.jpg)
22
Filter and Verification Framework
Find those strings that shares at least α tokens with the query string, where α is an overlap lower bound.
FILTER
Verify each string found in filtering stage by directly applying a similarity function
VERIFICATION
Quickly generate initial candidates using a minimum constraint
Refine candidates using α
FILTER REFINEMENT
![Page 23: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/23.jpg)
23
Prefix Filtering based Approach
Query x = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4
arisrtstti
1 2 322 332 3 4
3
Inverted lists for the query
strt
332
aris
ti1 2 32
2 3 4
3
Sort the listsby their sizes
Prefix Lists: the first |G(x, 2)| – α + 1 lists
Suffix Lists: remaining α – 1 lists
Filtering Phase (the prefix filtering) Merge the prefix lists to generate candidates
Refinement Phase Search the suffix lists for each candidate A candidate searches each suffix list to identify if it is contained in the list Binary search is used because suffix lists are usually very long
23
12
candidates2 3 43 4 5
4
4Sort the tokens by theirdocument frequencies
Document frequencyordering
![Page 24: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/24.jpg)
24
Exploiting Document Frequency Ordering (1/2)
General Goal: minimize the number of candidates initially generated by making use of the document frequency ordering
rtstti
2 332 3 4
aris
1 2 32 3 4
strt
332
aris
ti1 2 32
2 3 4
3 4
Prefix Lists: the first |G(x, 2)| – α + 1 lists
Query x = “artist” {ar, rt, ti, is, st} and overlap threshold α = 4
Suffix Lists: remaining α – 1 lists
Prefix Lists: the first |G(x, 2)| – α + 1 lists
Suffix Lists: remaining α – 1 lists
Sort the tokens by theirdocument frequencies
234
candidates
123
candidates
We can reduce1. time for merging short lists2. number of candidates
time for verification candidates
![Page 25: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/25.jpg)
25
Query x = {w1, w2} and overlap threshold α = 2
w2 is the prefix list# of candidates is 5
w2 is the prefix list# of candidates is 0
w1 is the prefix list# of candidates is 0
Total number of candidates is 0
Partition
Observation By partitioning a data set, we can artificially modify document frequencies of tokens in
each partition. We evaluate a query in each partition and take the union of the results. We can reduce the number of candidates by utilizing different token orderings among
partitions. Because partitions have different token orderings, we need to sort tokens in a query record
in each partition.
Exploiting Document Frequency Ordering (2/2)
![Page 26: An Overview of Similarity Query Processing 2014. 2. 26. 김종익 전북대학교 컴퓨터공학부](https://reader036.vdocuments.net/reader036/viewer/2022062409/5697bff51a28abf838cbdb2a/html5/thumbnails/26.jpg)
Q&A