1 notes 06: efficient fuzzy search professor chen li department of computer science uc irvine...

Post on 21-Dec-2015

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Notes 06: Efficient Fuzzy Search

Professor Chen LiDepartment of Computer Science

UC Irvine

CS122B: Projects in Databases and Web Applications Spring 2015

22

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies starred Schwarrzenger.

33

Problem definition: approximate string searches

Schwarzenger

Samuel Jackson

Keanu ReevesStar

Query q:

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity

SchwarrzengerSchwarrzenger

Similarity Functions Similar to:

a domain-specific function returns a similarity value between two strings

Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE

4

5

A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,

deletion, substitution) to change s1 to s2Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Edit Distance

5

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:

begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /

CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');

Usage:

SELECT * FROM engdict

WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;

Limitation: cannot handle errors in the first letters:

Katherine versus Catherine

6

7

Microsoft SQL Server Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores

7

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan

(Efficiency?)

8

99

Outline Gram-based approaches Trie-based approaches

1010

String Grams q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l

1111

Inverted lists Convert strings to gram inverted lists

id strings01234

richstickstichstuckstatic

4

2 30

1 4

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433

1212

Main ExampleQuery

Merge

Data Grams

stick (st,ti,ic,ck)

count >=2

id strings

0 rich

1 stick

2 stich

3 stuck

4 static

ck

ic

st

ta

ti…

1,3

1,2,3,4

4

1,2,4

ed(s,q)≤1

0,0,1,2,41,2,4

Candidates

1313

Problem definition:

Find elements whose occurrences ≥ T

Ascending

order

Ascending

order

MergeMerge

1414

Example T = 4

Result: 13

1

3

5

10

13

10

13

15

5

7

13

13 15

1515

Five Merge Algorithms

HeapMerger[Sarawagi,SIGMOD

2004]

MergeOpt[Sarawagi,SIGMOD

2004]

ScanCount MergeSkip DivideSkip

1616

Heap-based Algorithm

Min-heap

Count # of the occurrences of each element by a heap

Push to heap ……

1717

MergeOpt Algorithm

Long Lists: T-1 Short Lists

Binary

search

1818

Example of MergeOpt [Sarawagi et al 2004]

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

1919

Five Merge Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

2020

ScanCount Example

1 2 3

1

3

5

10

13

10

13

15

5

7

13

13 15

Count threshold T≥ 4

# of occurrences# of occurrences

00

00

00

44

11

Increment by 1

Increment by 111

String idsString ids

1313

1414

1515

00

22

00

00

Result!Result!

2121

Five Merge Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

2222

MergeSkip algorithm

Min-heap ……Pop T-1

T-1

Jump Greater or

equals

Greater or

equals

2323

Example of MergeSkip

1

3

5

10

10

15

5

7

13 15

Count threshold T≥ 4

minHeap10

13 15

15

JumpJump

15151515

13131313

17171717

2424

Skip is safe

Min-heap ……

# of occurrences of skipped elements ≤T-1

Skip

2525

Five Merge Algorithms

HeapMerger MergeOpt

ScanCount MergeSkip DivideSkip

26

DivideSkip Algorithm

Long Lists Short Lists

Binary

searchMergeSkip

2727

How many lists are treated as long lists?

??

Short ListsMerge

Long ListsLookup

2828

Performance (DBLP)

DivideSkip is the best one

2929

Trie-Based Approach

Trie Indexing

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Strings

exam

example

exemplar

exempt

sample

30

Active nodes on Trie

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Prefix Distance

examp 2

exampl 1

example 0

exempl 2

exempla 2

sample 2

Query: “example”

Edit-distance threshold = 2

2

1

0

2

2

2

31

Initialization

e

x

a

m

p

l

$

$

e

m

p

l

a

r

$

t

$

s

a

m

p

l

e

$e

Q = ε 0

1 1

2 2

Prefix DistancePrefix Distance

0

e 1

ex 2

s 1

sa 2

Prefix Distance

ε 0

Initial active nodes: all nodes within depth δ

32

Incremental Algorithm

Return leaf nodes as answers.33

34

Advantages: Trie size is small Can do search as the user types

DisadvantagesWorks for edit distance only

Good and bad

34

3535

References1. Efficient Merging and Filtering Algorithms for

Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008

2. Efficient Interactive Fuzzy Keyword Search, Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng, WWW 2009

top related