vldb 2013 riva del garda

Database Group – CSE - UNSW 1

VLDB 2013 RIVA DEL GARDA

Efficient Error-tolerant Query Autocompletion

Chuan Xiao1, Jianbin Qin2, Wei Wang2, Yoshiharu Ishikawa1, Koji Tsuda3, Kunihiko Sadakane4

1, Nagoya University, Japan 2, University of New South Wales, Australia

3, AIST and JST ERATO, Japan 4, NII, Japan

Presenter: Jianbin Qin [email protected]


MOTIVATION: ERROR TOLERANT QUERY SUGGESTION


MOTIVATION: ERROR TOLERANT CODE COMPLETION


PRELIMINARY: EDIT DISTANCE PREFIX SEARCH

Index

Edit Distance Prefix Searcher

Target String set S

Mobile Phone

Core

Browser

User query string q

Target String set S = {s1, s2, …, sn}.

Edit distance threshold τ .

Return a set of Result strings R contains all strings s S, such that s s, ed(s , q) ≤ τ ∈ ∃ ′ ≼ ′

Example:τ = 1, q = “abc”, S= {“acdefg”, “cda”, … }Then:R={“acdefg”} as ED(“abc”, “ac”) = 1 ≤ τ.

q qR R

Challenges: String set S usually very large. Query response time is critical.


EXISTING METHODS (CK09, JLLF09)

a b

ξ

d

b

c

c

c

S1 S4S2

d

d

1 7

0

5

2

6

3

8

9

4

Build Trie Index

Simulate edit distance calculation when traversing the trie.

q = “”q = “a”q = “ab”q = “abc”

Drawback: Tracking too many nodes during process.

O(|Σ|τ )

ED = 0

ED = 1

ED > 1

S

SiD String

S1 “abcd”

S2 “abdc”

S4 “bcd”

Example: τ = 1 When user types in:

Directly index string set S into a trie.


CONTRIBUTION: SPACE TRADE PERFORMANCE

We offer another option to trade space for runtime performance.

IndexError Tolerant Prefix Searcher Up to X20

larger

Build

Deletion Variants Trie

Transform an Edit Prefix Search Problem into an

Exact Prefix Search Problem

Up to X1000Faster

One server can serve up to 1000 times more users simultaneously.


DELETION VARIANT [T. BOCEK ET. AL. 2007]

Deletion Neighborhood Generation.

⟨x, Dx is called a variant-list pair, ⟩ Dx is the deletion list.

V(s,K) is the union of 0~k-variant list pairs. Called k-variant family of s.

s = abcd

abcd {}

bcd {1} acd {2} abd {3} abc {4}

cd {1,1} bd {1,2} bc {1,3}

ad {2,2} ac {2,3} ab {3,3}

0-Variants

1-Variants

2-Variants

2-Variants Family of s. V(s,2)


DELETION VARIANTS MATCHING PRINCIPLE

Variants Matching Principle: Given two strings s and t ,ED(s,t) ≤ τ, iff there exist x,Dx V(s,τ) and y,Dy ⟨ ⟩∈ ⟨ ⟩ ∈

V(t,τ), such that x = y and |Dx Dy| ≤ τ. ∪

Two conditions need to satisfy:1. x = y Identical Check.

(Efficiently process with index)2. |Dx U Dy| ≤ τ Deletion list Union Size Check. (No efficient methods)

abcd {} bcd {1} acd {2} abd {3} abc {4}cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3}

s = abcd, V(s,2)

abxd {} bxd {1} axd {2} abd {3} abx {4}xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3}

q = abxd, V(q,2)


ONE MORE ENUMERATION

abxd {} bxd {1} axd {2} abd {3} abx {4}xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3}

q = abxd, V(q,2)

abxd {} abxd {} abxd {1} abxd {2} abxd {3,4}…… abxd {4,4}

abd {3} abd {} abd {1} abd {2} …… abd {3,3}

……abd {3}

……ab (3,3) ab {} ab {3} ab {3,3}

q = abxd, Enumerated 2-Variants Family of q. EnumV(q,2)

abcd {} bcd {1} acd {2} abd {3} abc {4}cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3}

s = abcd, V(s,2)

Enumerated Variants Matching Principle: Given two strings s and q, ED(s,q) ≤ τ, iff there exist x,Dx V(s,τ) and y,Dy ⟨ ⟩∈ ⟨ ⟩

EnumV(q,τ) such that x = y and Dx = Dy. ∈


ENCODE AND BUILD INDEX

Then we encode <x, Dx> together: s=“abcd” = {abcd, #bcd, a#cd, ab#d, abc#, ##cd, #b#d, …}

a b

ξ

#

b

c

c

c

S1 S3S1

d

d

#

d

b

c

c

c

S1 S3S2

d

d

#

d

c

c

S2

d

S1

d

S2

c

S1

#

S2

#

#

S3

d#

S3

d

S2

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bcd”

Generate Variants

S1: abcd, #bcd, a#cd, ab#d, abc#, …

S2: abdc, #bdc, a#dc, ab#c, abd#, …

S3: bcd, #cd, b#d, bc#, …

bd {1,2} #b#d


t = 2

ENUMERATED VARIANT SIZE VERY LARGE

q = abc abc

#bc

a#c

ab#

##c

#b#

a##

abc, #abc, a#bc, ab#c, abc#, ##abc, #a#bc,#ab#c, #abc#, a##bc, a#b#c, a#bc#, ab##c, ab#c#, abc##abc

#bc bc, #bc, ##bc, #b#c, #bc#

a#c ac, a#c, #a#c, a##c, a#c#

ab# ab, ab#, #ab#, a#b#, ab##

##c c, #c, ##c

a## a, a#, a##

#b# b, #b, b#, #b#


ADAPTIVE ENUMERATION


ADAPTIVE ENUMERATION, FULL EXAMPLE T=1

a b

ξ

#

b

c

c

c

S1 S3S1

d

d

#

d

b

c

c

c

S1 S3S2

d

d

#

d

c

c

S2

d

S1

d

S2

c

S1

#

S2

#

#

S3

d#

S3

d

S2

q = “” EnumV = {ξ, #} q = “a” EnumV = {a, a#, #}

q = “ab” EnumV = {ab, ab#, a#, #b}

q = “abc” EnumV = {abc, abc#, ab#c, ab#, a#c, #bc}

O(τ ·(|q|+τ)τ)


EXPERIMENTS

Dataset: DBLP, 351,207 Terms. Average Length 8, |Σ| = 27. Prefix length is the query length. The time and size are all interval count.

1000 query average. Edit distance threshold τ = 3,IncNgTrie: Our algorithms

ICAN and ICPAN: previous direct trie methods.


EXPERIMENTS

DirectTrie: Original trie. NoReduction: IncNGTrie before compression.StringMerge: Merge branches reaching the same string.SubtreeMerge: Merge subtrees with identical content.


CONCLUSION

• An alternative way to solve edit prefix search Problem.• Our method is independent of character set size. • Gain up to 1000 times of query performance

improvement.• Data adaptive enumeration method.


Q & A


PRELIMINARY: EDIT DISTANCE PREFIX SEARCH

Core Component is the Prefix Edit Similarity Search. • A string Q is t-edit prefix

matching another string S is that there exist one prefix of S, that the edit distance with Q is within t.

• R = {s | s S, s’ P(s) such that ed(s’, Q) t} , P(s) denotes all the prefixes of s.

User Client

Index

Fuzzy Prefix Searcher

Target String set

Result Ranker

Core

Q

Example: If t = 1, Q=“abc” t-Edit Prefix Match “acdefghtijk”, as “ac” is the prefix of “acdefghtijk” and ed(Q, “ac”) <= 1;

R


PREVIOUS IDEAS

Q=“p”

a1

b7

ξ0

d5

b2

a8

c6

c3

c11

S1 S2S3

S

SiD

String

S1 abcd

S2 abdc

S4 bade

S5 bcd

d9

e10

S2

d 12

d4


ADAPTIVE ENUMERATION, FULL EXAMPLE T=1

a b

ξ

#

b

c

c

c

S1 S3S1

d

d

#

d

b

c

c

c

S1 S3S2

d

d

#

d

c

c

S2

d

S1

d

S2

c

S1

#

S2

#

#

S3

d#

S3

d

S2

q = “” EnumV = {ξ1, #1, #2} q = “a” EnumV = {a2, a#2, a#3, #2}

q = “ab” EnumV = {ab3, ab#3, ab#4, a#3, #b3}

q = “abc” EnumV = {abc4, abc#4, abc#5, ab#c4, ab#4, a#c4, #bc4}


PREVIOUS IDEAS CONT’D

Index data strings into a trie (Radix Tree). Keep active nodes while traversal the tree.For each query character Q[i] entered, traverse the trie and incrementally maintain all the nodes n such that ed(n, Q[1..i]) t (also called active nodes/states)

1

2

3

4

5

6

7

8

9

e0

mc

a

b

a

t

a

p

Q=Ø

1

2

3

4

5

6

7

8

9

e0

mc

a

b

a

t

a

p

Q=“p”

s1 s2 s3 s1 s2 s3

Id String

s1 cab

s2 eat

s3 map


OUR BASIC IDEAS

Embed the second condition into the first condition and efficiently process with Index.

s=“abcd” 0-Variant-list = {<abcd>} 1-Variant-list = {<#bcd>, <a#cd>, <ab#d>, <abc#> 2-Variant-list = {<##cd>, <#b#d>, <#bc#>, …

q=“abxd” 0-Variant-list = {<abxd, {}>} 1-Variant-list = {<bxd, {1}>, <axd, {2}>, <abd, {3}> … 2-Variant-list = {<xd, {1,1}>, <bd, {1,2}>, <bx, <1,3> …


PREVIOUS IDEAS: EXACT PREFIX MATCH

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bade”

S4 “bcd”

a b

ξ

d

b a

c

c

c

S1 S4S2

d

e

S3

d

d

1 7

0

5

2 8

6

3

11

9

10

12

4

Directly Indexing

Extended from Exact prefix search methods: Directly indexing strings S into a TRIE.Find the node that exactly match query q.

Example: User Types: q = “”q = “a”q = “ab”q = “abc”


PREVIOUS IDEAS: FUZZY PREFIX MATCH

a b

ξ

d

b a

c

c

c

S1 S4S2

d

e

S3

d

d

1 7

0

5

2 8

6

3

11

9

10

12

4

Directly Indexing

Simulate Edit distance Calculation During Traversal The TRIE.Directly indexing strings S into a TRIE.

Example: When t = 1 User Types: q = “”q = “a”q = “ab”q = “abc”

Draw Back: Tracking too many nodes during process.

ED = 0

ED = 1

ED > 1

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bade”

S4 “bcd”


PREVIOUS IDEAS: EXACT PREFIX MATCH

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bcd”

Directly Indexing

Extended from Exact prefix search methods: Directly indexing strings S into a TRIE.Find the node that exactly match query q.

Example: User Types: q = “”q = “a”q = “ab”q = “abc”

a b

ξ

d

b

c

c

c

S1 S4S2

d

d

1 7

0

5

2

6

3

8

9

4


DATA DEPENDENT ENUMERATION CONT’D

a b

ξ

#

b a

c

c

c

S1 S2S3

d

e

S2

d

d

#

b

c

S1 S3 S2

d

#

d

c

c

S3

d

S1

d

S2

c

S1

#

S1

#

#

S2

d

e

S2

d#

e

S2

#

S2


CONCLUSION


DATA DEPENDENT ENUMERATION CONT’D

K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x.

Problem Transformationcb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b


RESULT FETCHING

K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x.

Problem Transformationcb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b


MOTIVATION: FUZZY INSTANCE SEARCH

vldb 2013 riva del garda

Documents

strings s s

variants family of

q example

edit prefix search problem

distance prefix search

enumerated variants

variant family

variants2variants family