1 notes 06: efficient fuzzy search professor chen li department of computer science uc irvine...

Notes 06: Efficient Fuzzy Search

Professor Chen LiDepartment of Computer Science

UC Irvine

CS122B: Projects in Databases and Web Applications Spring 2015

Example: a movie database

Star Title Year GenreKeanu Reeves The Matrix 1999 Sci-Fi

Samuel Jackson Iron man 2008 Sci-Fi

Schwarzenegger The Terminator 1984 Sci-Fi

Samuel Jackson The man 2006 Crime

Find movies starred Schwarrzenger.

Problem definition: approximate string searches

Schwarzenger

Samuel Jackson

Keanu ReevesStar

Query q:

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δOutput: strings s that satisfy Sim(q,s)≤δSim functions: edit distance, Jaccard Coefficient and Cosine similaritySim functions: edit distance, Jaccard Coefficient and Cosine similarity

SchwarrzengerSchwarrzenger

Similarity Functions Similar to:

a domain-specific function returns a similarity value between two strings

Examples: Edit distance Hamming distance Jaccard similarity Soundex TF/IDF, BM25, DICE

A widely used metric to define string similarityEd(s1,s2) = minimum # of operations (insertion,

deletion, substitution) to change s1 to s2Example:

s1: Tom Hanks

s2: Ton Hank

ed(s1,s2) = 2

Edit Distance

State-of-the-art: Oracle 10g and older versions Supported by Oracle Text CREATE TABLE engdict(word VARCHAR(20), len INT); Create preferences for text indexing:

begin ctx_ddl.create_preference('STEM_FUZZY_PREF', 'BASIC_WORDLIST'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_MATCH','ENGLISH'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_SCORE','0'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','FUZZY_NUMRESULTS','5000'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','SUBSTRING_INDEX','TRUE'); ctx_ddl.set_attribute('STEM_FUZZY_PREF','STEMMER','ENGLISH'); end; /

CREATE INDEX fuzzy_stem_subst_idx ON engdict ( word ) INDEXTYPE IS ctxsys.context PARAMETERS ('Wordlist STEM_FUZZY_PREF');

Usage:

SELECT * FROM engdict

WHERE CONTAINS(word, 'fuzzy(universisty, 70, 6, weight)', 1) > 0;

Limitation: cannot handle errors in the first letters:

Katherine versus Catherine

Microsoft SQL Server Data cleaning tools available in SQL Server 2005 Part of Integration Services Supports fuzzy lookups Uses data flow pipeline of transformations Similarity function: tokens with TF/IDF scores

Lucene Using Levenshtein Distance (Edit Distance). Example: roam~0.8 Prefix pruning followed by a scan

(Efficiency?)

Outline Gram-based approaches Trie-based approaches

String Grams q-grams

(un),(ni),(iv),(ve),(er),(rs),(sa),(al)

For example: 2-gram

u n i v e r s a l

Inverted lists Convert strings to gram inverted lists

id strings01234

richstickstichstuckstatic

2-grams

atchckicristtatituuc

201 30 1 2 4

41 2 433

Main ExampleQuery

Data Grams

stick (st,ti,ic,ck)

count >=2

id strings

0 rich

1 stick

2 stich

3 stuck

4 static

1,2,3,4

ed(s,q)≤1

0,0,1,2,41,2,4

Candidates

Problem definition:

Find elements whose occurrences ≥ T

Ascending

MergeMerge

Example T = 4

Result: 13

Five Merge Algorithms

HeapMerger[Sarawagi,SIGMOD

MergeOpt[Sarawagi,SIGMOD

ScanCount MergeSkip DivideSkip

Heap-based Algorithm

Min-heap

Count # of the occurrences of each element by a heap

Push to heap ……

MergeOpt Algorithm

Long Lists: T-1 Short Lists

Binary

search

Example of MergeOpt [Sarawagi et al 2004]

Count threshold T≥ 4

Long Lists: 3Short Lists: 2

HeapMerger MergeOpt

ScanCount Example

# of occurrences# of occurrences

Increment by 1

Increment by 111

String idsString ids

Result!Result!

HeapMerger MergeOpt

MergeSkip algorithm

Min-heap ……Pop T-1

Jump Greater or

equals

Greater or

equals

Example of MergeSkip

minHeap10

JumpJump

15151515

13131313

17171717

Skip is safe

Min-heap ……

# of occurrences of skipped elements ≤T-1

HeapMerger MergeOpt

DivideSkip Algorithm

Long Lists Short Lists

Binary

searchMergeSkip

How many lists are treated as long lists?

Short ListsMerge

Long ListsLookup

Performance (DBLP)

DivideSkip is the best one

Trie-Based Approach

Trie Indexing

Strings

example

exemplar

exempt

sample

Active nodes on Trie

Prefix Distance

examp 2

exampl 1

example 0

exempl 2

exempla 2

sample 2

Query: “example”

Edit-distance threshold = 2

Initialization

Q = ε 0

Prefix DistancePrefix Distance

Prefix Distance

Initial active nodes: all nodes within depth δ

Incremental Algorithm

Return leaf nodes as answers.33

Advantages: Trie size is small Can do search as the user types

DisadvantagesWorks for edit distance only

Good and bad

References1. Efficient Merging and Filtering Algorithms for

Approximate String Searches, Chen Li, Jiaheng Lu, and Yiming Lu. ICDE 2008

2. Efficient Interactive Fuzzy Keyword Search, Shengyue Ji, Guoliang Li, Chen Li, and Jianhua Feng, WWW 2009

1 notes 06: efficient fuzzy search professor chen li department of computer science uc irvine...

Documents

irvine family chronicle · irvine family chronicle...

2011 mini cooper convertible irvine ca | irvine mini

lecture 5 fuzzy expert systems: fuzzy inference mamdani...

fuzzy sets and fuzzy techniques - lecture 10 -- fuzzy...

092115 irvine hotel market, destination irvine, cofc

fuzzy sets, fuzzy

chapter 3: fuzzy rules & fuzzy reasoning513].pdf · ch. 3:...

university of california, irvine 1966-67 catalogue ·...

efficient interactive fuzzy keyword search shengyue ji 1,...

irvine park christmas train name irvine park railroad...

fuzzy logic. lecture outline fuzzy systems fuzzy sets...

fuzzy inference systems. content the architecture of fuzzy...

introduction to fuzzy logic, fuzzy systems & fuzzy...

chapter 2 the operations of fuzzy set. outline standard...

irvine thebasisofmodernphysics

fuzzy logic and fuzzy modeling - 123seminarsonly.com ·...

- 1 - t. uc irvine health uc irvine health represents the...

2011 mini models irvine ca | irvine mini

1 cs122b: projects in databases and web applications spring...

therapist irvine