vldb 2013 riva del garda
DESCRIPTION
VLDB 2013 Riva del Garda. Efficient Error-tolerant Query Autocompletion Chuan Xiao 1 , Jianbin Qin 2 , Wei Wang 2 , Yoshiharu Ishikawa 1 , Koji Tsuda 3 , Kunihiko Sadakane 4 1, Nagoya University, Japan 2, University of New South Wales, Australia 3, AIST and JST ERATO, Japan - PowerPoint PPT PresentationTRANSCRIPT
Database Group – CSE - UNSW 1
VLDB 2013 RIVA DEL GARDA
Efficient Error-tolerant Query Autocompletion
Chuan Xiao1, Jianbin Qin2, Wei Wang2, Yoshiharu Ishikawa1, Koji Tsuda3, Kunihiko Sadakane4
1, Nagoya University, Japan 2, University of New South Wales, Australia
3, AIST and JST ERATO, Japan 4, NII, Japan
Presenter: Jianbin Qin [email protected]
Database Group – CSE - UNSW 2
MOTIVATION: ERROR TOLERANT QUERY SUGGESTION
Database Group – CSE - UNSW 3
MOTIVATION: ERROR TOLERANT CODE COMPLETION
Database Group – CSE - UNSW 4
PRELIMINARY: EDIT DISTANCE PREFIX SEARCH
Index
Edit Distance Prefix Searcher
Target String set S
Mobile Phone
Core
Browser
User query string q
Target String set S = {s1, s2, …, sn}.
Edit distance threshold τ .
Return a set of Result strings R contains all strings s S, such that s s, ed(s , q) ≤ τ ∈ ∃ ′ ≼ ′
Example:τ = 1, q = “abc”, S= {“acdefg”, “cda”, … }Then:R={“acdefg”} as ED(“abc”, “ac”) = 1 ≤ τ.
q qR R
Challenges: String set S usually very large. Query response time is critical.
Database Group – CSE - UNSW 5
EXISTING METHODS (CK09, JLLF09)
a b
ξ
d
b
c
c
c
S1 S4S2
d
d
1 7
0
5
2
6
3
8
9
4
Build Trie Index
Simulate edit distance calculation when traversing the trie.
q = “”q = “a”q = “ab”q = “abc”
Drawback: Tracking too many nodes during process.
O(|Σ|τ )
ED = 0
ED = 1
ED > 1
S
SiD String
S1 “abcd”
S2 “abdc”
S4 “bcd”
Example: τ = 1 When user types in:
Directly index string set S into a trie.
Database Group – CSE - UNSW 6
CONTRIBUTION: SPACE TRADE PERFORMANCE
We offer another option to trade space for runtime performance.
IndexError Tolerant Prefix Searcher Up to X20
larger
Build
Deletion Variants Trie
Transform an Edit Prefix Search Problem into an
Exact Prefix Search Problem
Up to X1000Faster
One server can serve up to 1000 times more users simultaneously.
Database Group – CSE - UNSW 7
DELETION VARIANT [T. BOCEK ET. AL. 2007]
Deletion Neighborhood Generation.
⟨x, Dx is called a variant-list pair, ⟩ Dx is the deletion list.
V(s,K) is the union of 0~k-variant list pairs. Called k-variant family of s.
s = abcd
abcd {}
bcd {1} acd {2} abd {3} abc {4}
cd {1,1} bd {1,2} bc {1,3}
ad {2,2} ac {2,3} ab {3,3}
0-Variants
1-Variants
2-Variants
2-Variants Family of s. V(s,2)
Database Group – CSE - UNSW 8
DELETION VARIANTS MATCHING PRINCIPLE
Variants Matching Principle: Given two strings s and t ,ED(s,t) ≤ τ, iff there exist x,Dx V(s,τ) and y,Dy ⟨ ⟩∈ ⟨ ⟩ ∈
V(t,τ), such that x = y and |Dx Dy| ≤ τ. ∪
Two conditions need to satisfy:1. x = y Identical Check.
(Efficiently process with index)2. |Dx U Dy| ≤ τ Deletion list Union Size Check. (No efficient methods)
abcd {} bcd {1} acd {2} abd {3} abc {4}cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3}
s = abcd, V(s,2)
abxd {} bxd {1} axd {2} abd {3} abx {4}xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3}
q = abxd, V(q,2)
Database Group – CSE - UNSW 9
ONE MORE ENUMERATION
abxd {} bxd {1} axd {2} abd {3} abx {4}xd {1,1} bd {1,2} bx {1,3} ad {2,2} ax {2,3} ab {3,3}
q = abxd, V(q,2)
abxd {} abxd {} abxd {1} abxd {2} abxd {3,4}…… abxd {4,4}
abd {3} abd {} abd {1} abd {2} …… abd {3,3}
……abd {3}
……ab (3,3) ab {} ab {3} ab {3,3}
q = abxd, Enumerated 2-Variants Family of q. EnumV(q,2)
abcd {} bcd {1} acd {2} abd {3} abc {4}cd {1,1} bd {1,2} bc {1,3} ad {2,2} ac {2,3} ab {3,3}
s = abcd, V(s,2)
Enumerated Variants Matching Principle: Given two strings s and q, ED(s,q) ≤ τ, iff there exist x,Dx V(s,τ) and y,Dy ⟨ ⟩∈ ⟨ ⟩
EnumV(q,τ) such that x = y and Dx = Dy. ∈
Database Group – CSE - UNSW 10
ENCODE AND BUILD INDEX
Then we encode <x, Dx> together: s=“abcd” = {abcd, #bcd, a#cd, ab#d, abc#, ##cd, #b#d, …}
a b
ξ
#
b
c
c
c
S1 S3S1
d
d
#
d
b
c
c
c
S1 S3S2
d
d
#
d
c
c
S2
d
S1
d
S2
c
S1
#
S2
#
#
S3
d#
S3
d
S2
S
SiD String
S1 “abcd”
S2 “abdc”
S3 “bcd”
Generate Variants
S1: abcd, #bcd, a#cd, ab#d, abc#, …
S2: abdc, #bdc, a#dc, ab#c, abd#, …
S3: bcd, #cd, b#d, bc#, …
bd {1,2} #b#d
Database Group – CSE - UNSW 11
t = 2
ENUMERATED VARIANT SIZE VERY LARGE
q = abc abc
#bc
a#c
ab#
##c
#b#
a##
abc, #abc, a#bc, ab#c, abc#, ##abc, #a#bc,#ab#c, #abc#, a##bc, a#b#c, a#bc#, ab##c, ab#c#, abc##abc
#bc bc, #bc, ##bc, #b#c, #bc#
a#c ac, a#c, #a#c, a##c, a#c#
ab# ab, ab#, #ab#, a#b#, ab##
##c c, #c, ##c
a## a, a#, a##
#b# b, #b, b#, #b#
Database Group – CSE - UNSW 12
ADAPTIVE ENUMERATION
Database Group – CSE - UNSW 13
ADAPTIVE ENUMERATION, FULL EXAMPLE T=1
a b
ξ
#
b
c
c
c
S1 S3S1
d
d
#
d
b
c
c
c
S1 S3S2
d
d
#
d
c
c
S2
d
S1
d
S2
c
S1
#
S2
#
#
S3
d#
S3
d
S2
q = “” EnumV = {ξ, #} q = “a” EnumV = {a, a#, #}
q = “ab” EnumV = {ab, ab#, a#, #b}
q = “abc” EnumV = {abc, abc#, ab#c, ab#, a#c, #bc}
O(τ ·(|q|+τ)τ)
Database Group – CSE - UNSW 14
EXPERIMENTS
Dataset: DBLP, 351,207 Terms. Average Length 8, |Σ| = 27. Prefix length is the query length. The time and size are all interval count.
1000 query average. Edit distance threshold τ = 3,IncNgTrie: Our algorithms
ICAN and ICPAN: previous direct trie methods.
Database Group – CSE - UNSW 15
EXPERIMENTS
DirectTrie: Original trie. NoReduction: IncNGTrie before compression.StringMerge: Merge branches reaching the same string.SubtreeMerge: Merge subtrees with identical content.
Database Group – CSE - UNSW 16
CONCLUSION
• An alternative way to solve edit prefix search Problem.• Our method is independent of character set size. • Gain up to 1000 times of query performance
improvement.• Data adaptive enumeration method.
Database Group – CSE - UNSW 17
Q & A
Database Group – CSE - UNSW 18
Q & A
Database Group – CSE - UNSW 19
Q & A
Database Group – CSE - UNSW 20
PRELIMINARY: EDIT DISTANCE PREFIX SEARCH
Core Component is the Prefix Edit Similarity Search. • A string Q is t-edit prefix
matching another string S is that there exist one prefix of S, that the edit distance with Q is within t.
• R = {s | s S, s’ P(s) such that ed(s’, Q) t} , P(s) denotes all the prefixes of s.
User Client
Index
Fuzzy Prefix Searcher
Target String set
Result Ranker
Core
Q
Example: If t = 1, Q=“abc” t-Edit Prefix Match “acdefghtijk”, as “ac” is the prefix of “acdefghtijk” and ed(Q, “ac”) <= 1;
R
Database Group – CSE - UNSW 21
PREVIOUS IDEAS
Q=“p”
a1
b7
ξ0
d5
b2
a8
c6
c3
c11
S1 S2S3
S
SiD
String
S1 abcd
S2 abdc
S4 bade
S5 bcd
d9
e10
S2
d 12
d4
Database Group – CSE - UNSW 22
ADAPTIVE ENUMERATION, FULL EXAMPLE T=1
a b
ξ
#
b
c
c
c
S1 S3S1
d
d
#
d
b
c
c
c
S1 S3S2
d
d
#
d
c
c
S2
d
S1
d
S2
c
S1
#
S2
#
#
S3
d#
S3
d
S2
q = “” EnumV = {ξ1, #1, #2} q = “a” EnumV = {a2, a#2, a#3, #2}
q = “ab” EnumV = {ab3, ab#3, ab#4, a#3, #b3}
q = “abc” EnumV = {abc4, abc#4, abc#5, ab#c4, ab#4, a#c4, #bc4}
Database Group – CSE - UNSW 23
PREVIOUS IDEAS CONT’D
Index data strings into a trie (Radix Tree). Keep active nodes while traversal the tree.For each query character Q[i] entered, traverse the trie and incrementally maintain all the nodes n such that ed(n, Q[1..i]) t (also called active nodes/states)
1
2
3
4
5
6
7
8
9
e0
mc
a
b
a
t
a
p
Q=Ø
1
2
3
4
5
6
7
8
9
e0
mc
a
b
a
t
a
p
Q=“p”
s1 s2 s3 s1 s2 s3
Id String
s1 cab
s2 eat
s3 map
Database Group – CSE - UNSW 24
OUR BASIC IDEAS
Embed the second condition into the first condition and efficiently process with Index.
s=“abcd” 0-Variant-list = {<abcd>} 1-Variant-list = {<#bcd>, <a#cd>, <ab#d>, <abc#> 2-Variant-list = {<##cd>, <#b#d>, <#bc#>, …
q=“abxd” 0-Variant-list = {<abxd, {}>} 1-Variant-list = {<bxd, {1}>, <axd, {2}>, <abd, {3}> … 2-Variant-list = {<xd, {1,1}>, <bd, {1,2}>, <bx, <1,3> …
Database Group – CSE - UNSW 25
PREVIOUS IDEAS: EXACT PREFIX MATCH
S
SiD String
S1 “abcd”
S2 “abdc”
S3 “bade”
S4 “bcd”
a b
ξ
d
b a
c
c
c
S1 S4S2
d
e
S3
d
d
1 7
0
5
2 8
6
3
11
9
10
12
4
Directly Indexing
Extended from Exact prefix search methods: Directly indexing strings S into a TRIE.Find the node that exactly match query q.
Example: User Types: q = “”q = “a”q = “ab”q = “abc”
Database Group – CSE - UNSW 26
PREVIOUS IDEAS: FUZZY PREFIX MATCH
a b
ξ
d
b a
c
c
c
S1 S4S2
d
e
S3
d
d
1 7
0
5
2 8
6
3
11
9
10
12
4
Directly Indexing
Simulate Edit distance Calculation During Traversal The TRIE.Directly indexing strings S into a TRIE.
Example: When t = 1 User Types: q = “”q = “a”q = “ab”q = “abc”
Draw Back: Tracking too many nodes during process.
ED = 0
ED = 1
ED > 1
S
SiD String
S1 “abcd”
S2 “abdc”
S3 “bade”
S4 “bcd”
Database Group – CSE - UNSW 27
PREVIOUS IDEAS: EXACT PREFIX MATCH
S
SiD String
S1 “abcd”
S2 “abdc”
S3 “bcd”
Directly Indexing
Extended from Exact prefix search methods: Directly indexing strings S into a TRIE.Find the node that exactly match query q.
Example: User Types: q = “”q = “a”q = “ab”q = “abc”
a b
ξ
d
b
c
c
c
S1 S4S2
d
d
1 7
0
5
2
6
3
8
9
4
Database Group – CSE - UNSW 28
DATA DEPENDENT ENUMERATION CONT’D
a b
ξ
#
b a
c
c
c
S1 S2S3
d
e
S2
d
d
#
b
c
S1 S3 S2
d
#
d
c
c
S3
d
S1
d
S2
c
S1
#
S1
#
#
S2
d
e
S2
d#
e
S2
#
S2
Database Group – CSE - UNSW 29
CONCLUSION
Database Group – CSE - UNSW 30
DATA DEPENDENT ENUMERATION CONT’D
K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x.
Problem Transformationcb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b
Database Group – CSE - UNSW 31
RESULT FETCHING
K-Matching Variant Given two i-deletion-marked variants(0ik) x and y, if y contains the same string content with x, (not count the mark symbol) and the size of the union of their deletion-position-lists k, y is called a k-matching variant of x.
Problem Transformationcb, c#b, #cb, cb#, c##b, #c#b, c#b# c#b
Database Group – CSE - UNSW 32
MOTIVATION: FUZZY INSTANCE SEARCH