1 notes 06: efficient fuzzy search professor chen li department of computer science uc irvine...
TRANSCRIPT
![Page 1: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/1.jpg)
1
Notes 06: Efficient Fuzzy Search
Professor Chen LiDepartment of Computer Science
UC Irvine
CS122B: Projects in Databases and Web Applications Spring 2015
![Page 2: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/2.jpg)
22
Example: a movie database
Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi
Samuel Jackson Iron man 2008 Sci-Fi
Schwarzenegger The Terminator 1984 Sci-Fi
Samuel Jackson The man 2006 Crime
Find movies starred Schwarrzenger.
![Page 3: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/3.jpg)
33
Problem definition: approximate string searches
…
Schwarzenger
Samuel Jackson
Keanu ReevesStar
Query q:
Collection of strings s
Search
Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity
SchwarrzengerSchwarrzenger
![Page 4: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/4.jpg)
Similarity Functions Similar to:
a domain-specific function returns a similarity value between two strings
Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE
4
![Page 5: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/5.jpg)
5
A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,
deletion, substitution) to change s1 to s2Example:
s1: Tom Hanks
s2: Ton Hank
ed(s1,s2) = 2
Edit Distance
5
![Page 6: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/6.jpg)
State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:
begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /
CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');
Usage:
SELECT * FROM engdict
WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;
Limitation: cannot handle errors in the first letters:
Katherine versus Catherine
6
![Page 7: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/7.jpg)
7
Microsoft SQL Server Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores
7
![Page 8: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/8.jpg)
Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan
(Efficiency?)
8
![Page 9: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/9.jpg)
99
Outline Gram-based approaches Trie-based approaches
![Page 10: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/10.jpg)
1010
String Grams q-grams
(un),(ni),(iv),(ve),(er),(rs),(sa),(al)
For example: 2-gram
u n i v e r s a l
![Page 11: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/11.jpg)
1111
Inverted lists Convert strings to gram inverted lists
id strings01234
richstickstichstuckstatic
4
2 30
1 4
2-grams
atchckicristtatituuc
201 30 1 2 4
41 2 433
![Page 12: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/12.jpg)
1212
Main ExampleQuery
Merge
Data Grams
stick (st,ti,ic,ck)
count >=2
id strings
0 rich
1 stick
2 stich
3 stuck
4 static
ck
ic
st
ta
ti…
1,3
1,2,3,4
4
1,2,4
ed(s,q)≤1
0,0,1,2,41,2,4
Candidates
![Page 13: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/13.jpg)
1313
Problem definition:
Find elements whose occurrences ≥ T
Ascending
order
Ascending
order
MergeMerge
![Page 14: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/14.jpg)
1414
Example T = 4
Result: 13
1
3
5
10
13
10
13
15
5
7
13
13 15
![Page 15: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/15.jpg)
1515
Five Merge Algorithms
HeapMerger[Sarawagi,SIGMOD
2004]
MergeOpt[Sarawagi,SIGMOD
2004]
ScanCount MergeSkip DivideSkip
![Page 16: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/16.jpg)
1616
Heap-based Algorithm
Min-heap
Count # of the occurrences of each element by a heap
Push to heap ……
![Page 17: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/17.jpg)
1717
MergeOpt Algorithm
Long Lists: T-1 Short Lists
Binary
search
![Page 18: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/18.jpg)
1818
Example of MergeOpt [Sarawagi et al 2004]
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
Long Lists: 3Short Lists: 2
![Page 19: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/19.jpg)
1919
Five Merge Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
![Page 20: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/20.jpg)
2020
ScanCount Example
1 2 3
…
1
3
5
10
13
10
13
15
5
7
13
13 15
Count threshold T≥ 4
# of occurrences# of occurrences
00
00
00
44
11
Increment by 1
Increment by 111
String idsString ids
1313
1414
1515
00
22
00
00
Result!Result!
![Page 21: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/21.jpg)
2121
Five Merge Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
![Page 22: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/22.jpg)
2222
MergeSkip algorithm
Min-heap ……Pop T-1
T-1
Jump Greater or
equals
Greater or
equals
![Page 23: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/23.jpg)
2323
Example of MergeSkip
1
3
5
10
10
15
5
7
13 15
Count threshold T≥ 4
minHeap10
13 15
15
JumpJump
15151515
13131313
17171717
![Page 24: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/24.jpg)
2424
Skip is safe
Min-heap ……
# of occurrences of skipped elements ≤T-1
Skip
![Page 25: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/25.jpg)
2525
Five Merge Algorithms
HeapMerger MergeOpt
ScanCount MergeSkip DivideSkip
![Page 26: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/26.jpg)
26
DivideSkip Algorithm
Long Lists Short Lists
Binary
searchMergeSkip
![Page 27: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/27.jpg)
2727
How many lists are treated as long lists?
??
Short ListsMerge
Long ListsLookup
![Page 28: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/28.jpg)
2828
Performance (DBLP)
DivideSkip is the best one
![Page 29: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/29.jpg)
2929
Trie-Based Approach
![Page 30: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/30.jpg)
Trie Indexing
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Strings
exam
example
exemplar
exempt
sample
30
![Page 31: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/31.jpg)
Active nodes on Trie
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Prefix Distance
examp 2
exampl 1
example 0
exempl 2
exempla 2
sample 2
Query: “example”
Edit-distance threshold = 2
2
1
0
2
2
2
31
![Page 32: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/32.jpg)
Initialization
e
x
a
m
p
l
$
$
e
m
p
l
a
r
$
t
$
s
a
m
p
l
e
$e
Q = ε 0
1 1
2 2
Prefix DistancePrefix Distance
0
e 1
ex 2
s 1
sa 2
Prefix Distance
ε 0
Initial active nodes: all nodes within depth δ
32
![Page 33: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/33.jpg)
Incremental Algorithm
Return leaf nodes as answers.33
![Page 34: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/34.jpg)
34
Advantages: Trie size is small Can do search as the user types
DisadvantagesWorks for edit distance only
Good and bad
34
![Page 35: 1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring](https://reader035.vdocuments.net/reader035/viewer/2022062715/56649d6e5503460f94a4f56c/html5/thumbnails/35.jpg)
3535
References1. Efficient Merging and Filtering Algorithms for
Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008
2. Efficient Interactive Fuzzy Keyword Search, Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng, WWW 2009