vldb 2013 riva del garda

32
Database Group – CSE - UNSW 1 VLDB 2013 RIVA DEL GARDA Efficient Error- tolerant Query Autocompletion Chuan Xiao 1 , Jianbin Qin 2 , Wei Wang 2 , Yoshiharu Ishikawa 1 , Koji Tsuda 3 , Kunihiko Sadakane 4 1, Nagoya University, Japan 2, University of New South Wales, Australia 3, AIST and JST ERATO, Japan 4, NII, Japan Presenter: Jianbin Qin [email protected]

Upload: casey

Post on 23-Feb-2016

41 views

Category:

Documents


0 download

DESCRIPTION

VLDB 2013 Riva del Garda. Efficient Error-tolerant Query Autocompletion Chuan Xiao 1 , Jianbin Qin 2 , Wei Wang 2 , Yoshiharu Ishikawa 1 , Koji Tsuda 3 , Kunihiko Sadakane 4 1, Nagoya University, Japan 2, University of New South Wales, Australia 3, AIST and JST ERATO, Japan - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 1

VLDB 2013 RIVA DEL GARDA

Efficient Error-tolerant Query Autocompletion

Chuan Xiao1, Jianbin Qin2, Wei Wang2, Yoshiharu Ishikawa1, Koji Tsuda3, Kunihiko Sadakane4

1, Nagoya University, Japan 2, University of New South Wales, Australia

3, AIST and JST ERATO, Japan 4, NII, Japan

Presenter: Jianbin Qin [email protected]

Page 2: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 2

MOTIVATION: ERROR TOLERANT QUERY SUGGESTION

Page 3: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 3

MOTIVATION: ERROR TOLERANT CODE COMPLETION

Page 4: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 4

PRELIMINARY: EDIT DISTANCE PREFIX SEARCH

Index

Edit Distance Prefix Searcher

Target String set S

Mobile Phone

Core

Browser

User query string q

Target String set S = {s1, s2, …, sn}.

Edit distance threshold τ .

Return a set of Result strings R contains all strings s S, such that s s, ed(s , q) ≤ τ ∈ ∃ ′ ≼ ′

Example:τ = 1, q = “abc”, S= {“acdefg”, “cda”, … }Then:R={“acdefg”} as ED(“abc”, “ac”) = 1 ≤ τ.

q qR R

Challenges: String set S usually very large. Query response time is critical.

Page 5: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 5

EXISTING METHODS (CK09, JLLF09)

a b

ξ

d

b

c

c

c

S1 S4S2

d

d

1 7

0

5

2

6

3

8

9

4

Build Trie Index

Simulate edit distance calculation when traversing the trie.

q = “”q = “a”q = “ab”q = “abc”

Drawback: Tracking too many nodes during process.

O(|Σ|τ )

ED = 0

ED = 1

ED > 1

S

SiD String

S1 “abcd”

S2 “abdc”

S4 “bcd”

Example: τ = 1 When user types in:

Directly index string set S into a trie.

Page 6: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 6

CONTRIBUTION: SPACE TRADE PERFORMANCE

We offer another option to trade space for runtime performance.

IndexError Tolerant Prefix Searcher Up to X20

larger

Build

Deletion Variants Trie

Transform an Edit Prefix Search Problem into an

Exact Prefix Search Problem

Up to X1000Faster

One server can serve up to 1000 times more users simultaneously.

Page 7: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 7

DELETION VARIANT [T. BOCEK ET. AL. 2007]

Deletion Neighborhood Generation.

⟨x, Dx is called a variant-list pair, ⟩ Dx is the deletion list.

V(s,K) is the union of 0~k-variant list pairs. Called k-variant family of s.

s = abcd

abcd {}

bcd {1} acd {2} abd {3} abc {4}

cd {1,1} bd {1,2} bc {1,3}

ad {2,2} ac {2,3} ab {3,3}

0-Variants

1-Variants

2-Variants

2-Variants Family of s. V(s,2)

Page 8: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 8

DELETION VARIANTS MATCHING PRINCIPLE

Variants Matching Principle: Given two strings s and t ,ED(s,t) ≤ τ, iff there exist x,Dx V(s,τ) and y,Dy ⟨ ⟩∈ ⟨ ⟩ ∈

V(t,τ), such that x = y and |Dx Dy| ≤ τ. ∪

Two conditions need to satisfy:1. x = y Identical Check.

(Efficiently process with index)2. |Dx U Dy| ≤ τ Deletion list Union Size Check. (No efficient methods)

abcd {} bcd {1} acd {2} abd {3} abc {4}cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3}

s = abcd, V(s,2)

abxd {} bxd {1} axd {2} abd {3} abx {4}xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3}

q = abxd, V(q,2)

Page 9: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 9

ONE MORE ENUMERATION

abxd {} bxd {1} axd {2} abd {3} abx {4}xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3}

q = abxd, V(q,2)

abxd {} abxd {} abxd {1} abxd {2} abxd {3,4}…… abxd {4,4}

abd {3} abd {} abd {1} abd {2} …… abd {3,3}

……abd {3}

……ab (3,3) ab {} ab {3} ab {3,3}

q = abxd, Enumerated 2-Variants Family of q. EnumV(q,2)

abcd {} bcd {1} acd {2} abd {3} abc {4}cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3}

s = abcd, V(s,2)

Enumerated Variants Matching Principle: Given two strings s and q, ED(s,q) ≤ τ, iff there exist x,Dx V(s,τ) and y,Dy ⟨ ⟩∈ ⟨ ⟩

EnumV(q,τ) such that x = y and Dx = Dy. ∈

Page 10: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 10

ENCODE AND BUILD INDEX

Then we encode <x, Dx> together: s=“abcd” = {abcd, #bcd, a#cd, ab#d, abc#, ##cd, #b#d, …}

a b

ξ

#

b

c

c

c

S1 S3S1

d

d

#

d

b

c

c

c

S1 S3S2

d

d

#

d

c

c

S2

d

S1

d

S2

c

S1

#

S2

#

#

S3

d#

S3

d

S2

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bcd”

Generate Variants

S1: abcd, #bcd, a#cd, ab#d, abc#, …

S2: abdc, #bdc, a#dc, ab#c, abd#, …

S3: bcd, #cd, b#d, bc#, …

bd {1,2} #b#d

Page 11: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 11

t = 2

ENUMERATED VARIANT SIZE VERY LARGE

q = abc abc

#bc

a#c

ab#

##c

#b#

a##

abc, #abc, a#bc, ab#c, abc#, ##abc, #a#bc,#ab#c, #abc#, a##bc, a#b#c, a#bc#, ab##c, ab#c#, abc##abc

#bc bc, #bc, ##bc, #b#c, #bc#

a#c ac, a#c, #a#c, a##c, a#c#

ab# ab, ab#, #ab#, a#b#, ab##

##c c, #c, ##c

a## a, a#, a##

#b# b, #b, b#, #b#

Page 12: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 12

ADAPTIVE ENUMERATION

Page 13: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 13

ADAPTIVE ENUMERATION, FULL EXAMPLE T=1

a b

ξ

#

b

c

c

c

S1 S3S1

d

d

#

d

b

c

c

c

S1 S3S2

d

d

#

d

c

c

S2

d

S1

d

S2

c

S1

#

S2

#

#

S3

d#

S3

d

S2

q = “” EnumV = {ξ, #} q = “a” EnumV = {a, a#, #}

q = “ab” EnumV = {ab, ab#, a#, #b}

q = “abc” EnumV = {abc, abc#, ab#c, ab#, a#c, #bc}

O(τ ·(|q|+τ)τ)

Page 14: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 14

EXPERIMENTS

Dataset: DBLP, 351,207 Terms. Average Length 8, |Σ| = 27. Prefix length is the query length. The time and size are all interval count.

1000 query average. Edit distance threshold τ = 3,IncNgTrie: Our algorithms

ICAN and ICPAN: previous direct trie methods.

Page 15: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 15

EXPERIMENTS

DirectTrie: Original trie. NoReduction: IncNGTrie before compression.StringMerge: Merge branches reaching the same string.SubtreeMerge: Merge subtrees with identical content.

Page 16: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 16

CONCLUSION

• An alternative way to solve edit prefix search Problem.• Our method is independent of character set size. • Gain up to 1000 times of query performance

improvement.• Data adaptive enumeration method.

Page 17: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 17

Q & A

Page 18: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 18

Q & A

Page 19: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 19

Q & A

Page 20: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 20

PRELIMINARY: EDIT DISTANCE PREFIX SEARCH

Core Component is the Prefix Edit Similarity Search. • A string Q is t-edit prefix

matching another string S is that there exist one prefix of S, that the edit distance with Q is within t.

• R = {s | s S, s’ P(s) such that ed(s’, Q) t} , P(s) denotes all the prefixes of s.

User Client

Index

Fuzzy Prefix Searcher

Target String set

Result Ranker

Core

Q

Example: If t = 1, Q=“abc” t-Edit Prefix Match “acdefghtijk”, as “ac” is the prefix of “acdefghtijk” and ed(Q, “ac”) <= 1;

R

Page 21: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 21

PREVIOUS IDEAS

Q=“p”

a1

b7

ξ0

d5

b2

a8

c6

c3

c11

S1 S2S3

S

SiD

String

S1 abcd

S2 abdc

S4 bade

S5 bcd

d9

e10

S2

d 12

d4

Page 22: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 22

ADAPTIVE ENUMERATION, FULL EXAMPLE T=1

a b

ξ

#

b

c

c

c

S1 S3S1

d

d

#

d

b

c

c

c

S1 S3S2

d

d

#

d

c

c

S2

d

S1

d

S2

c

S1

#

S2

#

#

S3

d#

S3

d

S2

q = “” EnumV = {ξ1, #1, #2} q = “a” EnumV = {a2, a#2, a#3, #2}

q = “ab” EnumV = {ab3, ab#3, ab#4, a#3, #b3}

q = “abc” EnumV = {abc4, abc#4, abc#5, ab#c4, ab#4, a#c4, #bc4}

Page 23: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 23

PREVIOUS IDEAS CONT’D

Index data strings into a trie (Radix Tree). Keep active nodes while traversal the tree.For each query character Q[i] entered, traverse the trie and incrementally maintain all the nodes n such that ed(n, Q[1..i]) t (also called active nodes/states)

1

2

3

4

5

6

7

8

9

e0

mc

a

b

a

t

a

p

Q=Ø

1

2

3

4

5

6

7

8

9

e0

mc

a

b

a

t

a

p

Q=“p”

s1 s2 s3 s1 s2 s3

Id String

s1 cab

s2 eat

s3 map

Page 24: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 24

OUR BASIC IDEAS

Embed the second condition into the first condition and efficiently process with Index.

s=“abcd”  0-Variant-list = {<abcd>} 1-Variant-list = {<#bcd>, <a#cd>, <ab#d>, <abc#> 2-Variant-list = {<##cd>, <#b#d>, <#bc#>, …

q=“abxd”  0-Variant-list = {<abxd, {}>} 1-Variant-list = {<bxd, {1}>, <axd, {2}>, <abd, {3}> … 2-Variant-list = {<xd, {1,1}>, <bd, {1,2}>, <bx, <1,3> …

Page 25: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 25

PREVIOUS IDEAS: EXACT PREFIX MATCH

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bade”

S4 “bcd”

a b

ξ

d

b a

c

c

c

S1 S4S2

d

e

S3

d

d

1 7

0

5

2 8

6

3

11

9

10

12

4

Directly Indexing

Extended from Exact prefix search methods: Directly indexing strings S into a TRIE.Find the node that exactly match query q.

Example: User Types: q = “”q = “a”q = “ab”q = “abc”

Page 26: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 26

PREVIOUS IDEAS: FUZZY PREFIX MATCH

a b

ξ

d

b a

c

c

c

S1 S4S2

d

e

S3

d

d

1 7

0

5

2 8

6

3

11

9

10

12

4

Directly Indexing

Simulate Edit distance Calculation During Traversal The TRIE.Directly indexing strings S into a TRIE.

Example: When t = 1 User Types: q = “”q = “a”q = “ab”q = “abc”

Draw Back: Tracking too many nodes during process.

ED = 0

ED = 1

ED > 1

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bade”

S4 “bcd”

Page 27: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 27

PREVIOUS IDEAS: EXACT PREFIX MATCH

S

SiD String

S1 “abcd”

S2 “abdc”

S3 “bcd”

Directly Indexing

Extended from Exact prefix search methods: Directly indexing strings S into a TRIE.Find the node that exactly match query q.

Example: User Types: q = “”q = “a”q = “ab”q = “abc”

a b

ξ

d

b

c

c

c

S1 S4S2

d

d

1 7

0

5

2

6

3

8

9

4

Page 28: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 28

DATA DEPENDENT ENUMERATION CONT’D

a b

ξ

#

b a

c

c

c

S1 S2S3

d

e

S2

d

d

#

b

c

S1 S3 S2

d

#

d

c

c

S3

d

S1

d

S2

c

S1

#

S1

#

#

S2

d

e

S2

d#

e

S2

#

S2

Page 29: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 29

CONCLUSION

Page 30: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 30

DATA DEPENDENT ENUMERATION CONT’D

K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x.

Problem Transformationcb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b

Page 31: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 31

RESULT FETCHING

K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x.

Problem Transformationcb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b

Page 32: VLDB  2013 Riva del Garda

Database Group – CSE - UNSW 32

MOTIVATION: FUZZY INSTANCE SEARCH