survey on spatial keyword search - sjtuyaobin/papers/skssv.pdfsurvey on spatial keyword search bin...
TRANSCRIPT
Survey on Spatial Keyword Search
Bin Yao
Department of Computer Science
Florida State University
Abstract
Many applications are featured with both text and location information, which leads to a novel
type of search: spatial keyword search (SKS). The SKS is gaining attention from the database com-
munity only recently. A large variety of problems remain open. Spatial and text data have been
independently studied for decades. Spatial data have many unique features that are drastically dif-
ferent from text data. As a result, most of the existing techniques for string processing are either
inapplicable or inefficient when adapted to spatial databases. My Ph.D research is in the general
area of SKS. The areas related to my research include the nearest neighbor search, range search, ap-
proximate string search, the spatial keyword and approximate string search. This document surveys
these related areas.
1
TABLE OF CONTENTS
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 4
1 Introduction 7
2 Range and nearest neighbor queries 10
2.1 Range search in Euclidean space . . . . . . . . . . . . . . . . . . . . .. . . . . . 10
2.2 NN search in Euclidean space . . . . . . . . . . . . . . . . . . . . . . . .. . . . 11
2.2.1 Exact NN search for low dimensional data . . . . . . . . . . . .. . . . . . 11
2.2.2 Exact NN search for high dimensional data . . . . . . . . . . .. . . . . . 12
2.2.3 Approximate NN search for low dimensional data . . . . . .. . . . . . . . 13
2.2.4 Approximate NN search for high dimensional data . . . . .. . . . . . . . 14
2.2.5 KNN search in relational databases . . . . . . . . . . . . . . . .. . . . . 15
2.3 Range and NN queries on road networks . . . . . . . . . . . . . . . . .. . . . . . 18
2.3.1 Range search on road networks . . . . . . . . . . . . . . . . . . . . .. . 18
2.3.2 KNN search on road networks . . . . . . . . . . . . . . . . . . . . . . .. 19
3 Keyword and approximate string queries 23
3.1 Keyword search in text databases . . . . . . . . . . . . . . . . . . . .. . . . . . . 23
3.1.1 Tree index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2
3.1.2 Signature file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Approximate string search in text databases . . . . . . . . . .. . . . . . . . . . . 25
3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25
3.2.2 Merging-list algorithms for approximate string search . . . . . . . . . . . 27
3.2.3 Variable-length-gram algorithms for approximate string search . . . . . . . 31
3.2.4 Bed-tree: index for string similarity search based on the edit distance . . . . 36
4 Spatial keyword and approximate string queries 39
4.1 Keyword search on spatial databases . . . . . . . . . . . . . . . . .. . . . . . . . 39
4.1.1 Distance-first top-k spatial keyword query . . . . . . . . .. . . . . . . . . 39
4.1.2 The m-closest keywords query on spatial databases . . .. . . . . . . . . . 41
4.2 Approximate string search on spatial databases . . . . . . .. . . . . . . . . . . . 47
4.2.1 The min-wise signature . . . . . . . . . . . . . . . . . . . . . . . . . .. . 49
4.2.2 The construction of the MHR-tree . . . . . . . . . . . . . . . . . .. . . . 49
4.2.3 The query algorithm for the MHR-tree . . . . . . . . . . . . . . .. . . . . 51
5 Conclusion 53
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 54
3
LIST OF FIGURES
2.1 The R-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11
2.2 The iDistance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 13
2.3 The interpretation of LSH. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 14
2.4 The zχ-kNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 kNN box: definition and exact search. . . . . . . . . . . . . . . . . . . . . .. . . . 16
2.6 The z-kNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.7 A map of Silver Spring, MD (From figure 1 in [30]). . . . . . . . .. . . . . . . . . 20
2.8 (a) An example of road network, (b) the shortest-path quadtree of vertex s, and (c)
the shortest-path quadtree of vertex t (From figure 3 in [30]). . . . . . . . . . . . . . 21
2.9 The 30,000 shortest paths between all pairs of vertices in sets A and B in the spatial
network of Silver Spring, MD (From figure 1 in [32]). . . . . . . . .. . . . . . . . 22
3.1 An example of trie with inverted lists. . . . . . . . . . . . . . . .. . . . . . . . . . 24
3.2 The example of bit string computation withl = 4, m = 2. . . . . . . . . . . . . . . 24
4
3.3 The MergeOpt Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 28
3.4 The MergeSkip Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 30
3.5 The DivideSkip Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 30
3.6 Strings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32
3.7 Gram dictionary as a trie. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 32
3.8 Reversed-gram trie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 32
3.9 NAG vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32
3.10 The VGEN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 33
3.11 Pruning a subtrie to select grams . . . . . . . . . . . . . . . . . . .. . . . . . . . . 35
3.12 The VerifyED function . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 37
3.13 The RangeQuery Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 37
4.1 The example of an IR2-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 An example of the node information of the bR∗-tree. . . . . . . . . . . . . . . . . . 42
4.3 The SubsetSearch funtion . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 43
4.4 The SearchInOneNode funtion . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 43
4.5 An example of the index of R∗-tree and inverted lists (from Figure 3 in [40]). . . . . 45
4.6 An example of bottom-up construction of virtual bR∗-tree (from Figure 4 in [40]). . 46
5
4.7 The bottom-up search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 47
4.8 Approximate string search with a range query. . . . . . . . . .. . . . . . . . . . . . 48
4.9 The construction of the MHR-tree. . . . . . . . . . . . . . . . . . . .. . . . . . . . 50
4.10 The range queries with the MHR-tree. . . . . . . . . . . . . . . . .. . . . . . . . . 52
6
CHAPTER 1
INTRODUCTION
With the popularity of geographic services such as GPS, Google Earth and Yahoo Maps, queries in
spatial databases have become increasingly important in recent years. Beyond spatial queries such
as nearest neighbor queries, range queries and spatial joins, queries on spatial objects associated
with textual information are beginning to receive significant attention from the spatial database
research community.
A spatial database manages geometric objects such as points, rectangles, and so on. In reality,
a spatial object often comes with a text description: for example, the list of services and amenities
of a hotel, the menu of a restaurant, the outpatient specialties of a hospital, and so on. In many
applications, users need to search with both spatial and textual predicates. The following are some
examples. Query 1: find the hospital that is nearest to my house in all hospitals that offer acupunc-
tural outpatient service. Query 2: find all pairs of hotels and restaurants such that (i) they are within
2 miles from each other, (ii) the hotel has wireless connection and a gym room, and (iii) the restau-
rant serves Italian food and French wine. Query 3: similar tothe previous query, but find the closest
pair of hotel and restaurant (as opposed to all pairs in 2 miles). The results of all these queries must
satisfy certain spatial predicates (i.e., nearest neighbor search, distance join, closest pair in Queries
7
1, 2, 3, respectively), and must contain certain keywords intheir descriptions.
From the other side, keyword search, particularly on the Internet has exponentially boomed
since 1990s. Many of these queries allow the user to provide alist of keywords that the spatial ob-
jects should contain, in their description or other attribute. For example, online yellow pages allow
users to specify a set of keywords including an address, and return businesses whose description
contains these keywords, ordered by their distance to the specified address location. As another
example, real estate web sites allow users to search for properties with specific keywords in their
description within a specified city. In the past, the addressinformation is stored as text as well,
which makes the lookup inefficient or inapplicable sometimes. For example, given a specified loca-
tion, we can not retrieve all the objects within 1 miles from the location with the help of text-format
address information. In a natural way, we store the latitudeand longitude of location in a non-text
format to facilitate the search.
Associating objects with texts is a highly flexible approachto capture objects’ information,
especially when it is difficult to ”normalize” such information in a universal relational schema.
Therefore, the ability of searching the objects by their texts is essential. Keyword search has been
extensively studied in document retrieval, web search, andin recent years, relational databases.
However, spatial data have many unique features that are drastically different from plain documents,
web pages, and relational data. As a result, most of the existing techniques are either inapplicable
or inefficient when adapted to spatial databases. For example, current systems use ad-hoc combi-
nations of range queries and keyword search techniques to tackle the problem. An R-Tree is used
to find the objects within query region and for each retrievedobject an inverted index is used to
check if the query keywords are contained. Such two-step approaches could suffer from unneces-
sary node visits (higher IO cost) and keyword comparisons (higher CPU cost). To understand this,
8
we denote the exact solution to a spatial keyword query asA and the set of candidate points that
have been visited by the R-tree solution asAc. An intuitive observation is that it may be the case
that |Ac| ≫ |A|, where| · | denotes set cardinality. In an extreme example, considering a spatial
keyword query with some query strings that do not exist within its query range from setP , A = ∅.
So ideally, this query should incur a minimum query cost. However, in the worst case, an R-tree
solution could possibly visit all index nodes and data points from the R-tree that indexesP . The
fundamental issue here is that the possible pruning power from the string match predicate has been
completely ignored by the R-tree solution. Similar arguments hold for the string index approach,
where a query might retrieve a very large number of strings only to prune everything based on the
spatial predicate at post-processing. Clearly, in practice none of these solutions will work better
that a combined approach that prunes simultaneously based on the string match predicate and the
spatial predicate. The spatial keyword search is gaining attention from the database community
only recently. A large variety of problems remain open. Besides, the existing approaches for spatial
keyword search have drawbacks as well and need to be further investigated.
My research is in the area of the spatial keyword search. The areas related to my research in-
clude nearest neighbor search, range search, approximate string search, the spatial keyword and
approximate string search. I will survey these areas in thisdocument. In particular, the paper is or-
ganized as follows. Chapter 2 discusses the nearest neighbor and range queries in spatial databases.
Chapter 3 describes the keyword search and approximate string search in text databases. Chapter 4
introduces the spatial keyword and approximate string queries. Chapter 5 concludes the survey.
9
CHAPTER 2
RANGE AND NEAREST NEIGHBOR QUERIES
2.1 Range search in Euclidean space
Range search and nearest neighbor (NN) search are fundamental problems in data management,
which have been well-studied in the literature [19, 27, 35, 38]. I will review them in this chap-
ter. In the context of spatial databases, the R-tree index [5, 18] provides an efficient algorithm for
range queries in Euclidean space. Intuitively, the R-tree is an extension of theB+-tree to higher
dimensions. Points are grouped into minimum bounding rectangles (MBRs) which are recursively
grouped into MBRs in higher levels of the tree. The grouping is based on data locality and bounded
by the page size. An example of the R-tree is illustrated in Figure 2.1. Given a query rectangle r
and an R-tree index on data set P, we start from the root of the R-tree and check the MBR of each
of its children, then recursively visit any node u whose MBR intersects or falls inside r. When a leaf
node is reached, all the points that are inside r are returned.
10
p10
N6
p4
p10
q
p6
minmaxdist(q,N1)mindist(q,N1)p3
maxdist(q,N1)
N2N5
N6
p1 p2
p8 p9
p7
p11
p12
N1 N2
N3 N4
The R-tree
p1 p2 p3
N3N1
N4p4
p5
p5 p6
p7 p9p8
p12p11
N5
Figure 2.1: The R-tree.
2.2 NN search in Euclidean space
The NN search has much more variants in the literature compared to the range search. In this
section, I will review the NN search in Euclidean space for both exact and approximate versions.
2.2.1 Exact NN search for low dimensional data
The R-tree index also provides efficient algorithms for NN search in Euclidean space. The R-
tree demonstrates efficient algorithms using either the depth-first [27] or the best-first [19] approach.
The main idea behind these algorithms is to utilize branch and bound pruning techniques based on
the relative distances between a query pointq to a given MBRN . Such distances include the
mindist, the minmaxdist and the maxdist. The mindist measures the minimum possible distance for
a pointq to any point in an MBRN ; the minmaxdist measures the lower bound on the maximum
distance for a pointq to any point in an MBRN ; and finally, the maxdist simply measures the
maximum possible distance betweenq and any point in an MBRN . These distances are easy
to compute arithmetically givenq andN . An example for these distances has been provided in
Figure 2.1. The principle for utilizing them to prune the search space for NN search in R-tree is
straightforward, e.g., when an MBRNa’s mindist toq is larger than another MBRNb’s minmaxdist,
11
we can safely pruneNa entirely. Given theq, we first initialize a priority queuePQ with R-Tree’s
root node. The elements inPQ are in an ascending order by their mindists. We also maintaina
global minmaxdist (GMM ), which indicates the current lower bound on the maximum distance
within which we can find the NN. Notice that every node withmindist > GMM will not be
added toPQ. In each iteration, we pop the first element inPQ. If it is a point, then we find the
NN. Otherwise, we look into its children. For each childc, we calculate its mindist (c.md) and
minmaxdist (c.mmd), checkc.md with GMM to see ifc can be added toPQ and updateGMM
with min{c.mmd,GMM}. The NN search is a special case of K nearest neighbors (KNN) query
when k equals 1. The above algorithm can be easily extended toanswer KNN. The GMM in KNN
search indicates the current lower bound on the maximum distance within which we can find the
KNN. The algorithm terminates when k points are found.
2.2.2 Exact NN search for high dimensional data
The R-tree based algorithms have relatively poor performance for data beyond six dimen-
sions [38]. The state of the art technique for retrieving theexact KNN in high dimensions is the
iDistance method [20]. The design of iDistance is motivatedby the following observations. First,
the similarity between data points can be derived with reference to a representative point. Second,
data points can be ordered based on their distances to a reference point, which are a single dimen-
sional value. So we can map high dimensional data into one dimensional space and enable a range
query by reusing the existing one dimensional indices such as theB+-tree. In particular, data is
first partitioned into clusters by any popular clustering method. In [20], authors suggest that a good
partitioning strategy can reduce the partitioning overlapand large number of points with close sim-
ilarity as much as possible. A point p in a clusterci is mapped to a one-dimensional value, which is
12
· · ·Leaf nodes of B+-tree
qrq
c1
c2c3
Figure 2.2: The iDistance.
the distance between p andci’s cluster center. The essence of the KNN search algorithm issimilar
to the generalized search strategy. It begins by searching asmall ’sphere’, check all partitions inter-
secting the current query space and incrementally enlargesthe search space till all KNN are found.
Figure 2.2 illustrates the pruning technique used by iDistance. Therq is the radius of query ball
and the circlec3 defines the data cluster. We only have to exam the points in thecircularity region
betweenc2 andc1, which corresponds to the gray region in theB+-tree.
2.2.3 Approximate NN search for low dimensional data
Besides exact KNN search, people also study the approximateKNN. In exact KNN search, a
majority of the query cost is spent on verifying points as real KNN. Approximate methods improve
efficiency by relaxing the precision of verification.
Computational geometry community has spent considerable efforts in designing approximate
nearest neighbor algorithms [3, 7]. Arya et al. [3] designeda modification of the standard kd-
tree, called the Balanced Box Decomposition (BBD) tree thatcan answer(1 + ǫ)-approximate
13
w
w
w
o1
o2o3
−→a
p1
p2
p3
Figure 2.3: The interpretation of LSH.
nearest neighbor queries inO(1/ǫd log N). BBD-tree takesO(N log N) time to build. However,
the performance of BBD-tree degrades when the size of data set becomes huge (e.g. more than
100, 000) or dimensions go beyond20.
2.2.4 Approximate NN search for high dimensional data
The state-of-the-art method for approximate KNN retrievalin high dimension space (e.g.d >
20) is the LSB-tree [35]. The LSB-tree is based on the locality sensitive hash function, which maps
a d-dimensional pointo to a one-dimensional value. And the locality sensitive means that if the
chance of mapping two pointso1, o2 to the same value grows as their distance|o1, o2| decreases. A
popular LSH function is defined as follows [13]:
h(o) = ⌊−→a · −→o + b
w⌋
Here,−→o represents the d-dimensional vector representation of a point o;−→a is another d-dimensional
vector where each component is drawn independently from a so-called p-stable distribution [13];
14
−→a · −→o denotes the dot product of these two vectors.w is a sufficiently large constant and b is
uniformly drawn from[0, w).
The intuition of such a hash function is that, if two points are close to each other, their shifted
projections (on line−→a ) will have the same hash value with high probability. On the other hand, two
faraway points are likely to be projected into different values. In figure 2.3, pointso1, o2 ando3 are
projected to the vector−→a (in this case,b = 0). The projections ofo1 ando2 end up in the same
interval while the projection ofo3 is in the different interval as it is far away from other two points.
The construction of an LSB-tree is as follows. Given a d dimensional data set S, we first map
each point o in S to an m dimensional point M(o), and then, calculate the Z-order value z(o) of M(o),
which is a one dimensional value. The LSB-tree is just a conventional B+-tree indexing all the Z-
order values. In reality, a single LSB-tree can produce query results with good quality. To achieve
better quality with theoretical guarantee, authors in [35]propose to build multiple LSB-trees. To
retrieve the KNN, one is needed to do range queries on all the LSB-trees in terms of Z-order values
of the query point.
The construction of the LSB-trees consumesO((dn/B)1.5) space. And the LSB-trees based
algorithm returns a4-approximate NN with at least constant probability, inO(E√
dn/B) IOs,
whereE is the height of an LSB-tree.
2.2.5 KNN search in relational databases
We made a contribution to the KNN problem by proposing algorithms for both approximate and
exact KNN search in relational databases [38]. The motivation of our work is in two folders. First,
we want to enable the relational database engine to automatically optimize the KNN search when
additional query conditions are specified. Second, the algorithms need no changes to the database
15
zχ-kNN (point q, point sets{P 0, . . . , Pα})
1. CandidatesC = ∅;
2. Fori = 0, . . . , α
3. Findzip as the successor ofzq+−→vi
in P i;
4. LetCi beγ points up and down next tozip in P i;
5. For each pointp in Ci, let p = p −−→vi ;
6. C = C⋃
Ci;
7. LetAχ = kNN(q, C) and outputAχ.
Figure 2.4: The zχ-kNN Algorithm
p4
zp1
γ
zp4zℓ zp3
zhzp2zp5
γz-val
p5
zq
r∗
rp
δh
δℓ
qp1
p2
p3
Figure 2.5:kNN box: definition
and exact search.
engine.
For approximate KNN retrieval, we utilize thez-values to map points in a multi-dimensional
space into one dimension, and then translate the KNN search for a query pointq into one dimen-
sional range search on thez-values around theq’s z-value. In most cases, z-values preserve the
spatial locality and we can findq’s KNN in a close neighborhood (sayγ positions up and down) of
its z-value. However, this is not always the case. In order toget a theoretical guarantee, we produce
α, independent, randomly shifted copies of the input data setP and repeat the above procedure for
each randomly shifted version ofP . Specifically, we define the “random shift” operation, as shifting
all data points inP by a random vector−→v ∈ Rd. This operation is simplyp + −→v for all p ∈ P ,
and denoted asP + −→v . We independently at random generateα number of vectors{−→v1 , . . . ,−→vα}
where∀i ∈ [1, α], −→vi ∈ Rd. Let P i = P + −→vi , P 0 = P and−→v0 =−→0 . For eachP i, its points are
16
z-kNN (point q, point sets{P 0, . . . , Pα})
1. LetAχi bekNN(q, Ci) whereCi is from Line4 in the zχ-kNN algorithm;
2. Letziℓ andzi
h be the z-values of the lower-left and upper-right corner points for the boxM(Aχi );
3. Letziγℓ
andziγh
be the lower bound and upper bound of the z-values zχ-kNN has searched to produceCi;
4. If ∃i ∈ [0, α], s.t.ziγℓ
≤ ziℓ andzi
γh≥ zi
h
5. ReturnAχ by zχ-kNN asA;
6. Else
7. Findj ∈ [0, α] such that the number of points inRjP with z-values in∈ [zj
ℓ , zjh] is minimized;
8. LetCe be those points and returnA = kNN(q, Ce);
Figure 2.6: The z-kNN Algorithm
sorted by theirz-values. Next, for a query pointq and a data setP , let zp be the successorz-value
of zq among all z-values for points inP . Theγ-neighborhood ofq is defined as theγ points up
and down next tozp. OurkNN query algorithm essentially finds theγ-neighborhood of the query
point qi = q + −→vi in P i for i ∈ [0, α] and select the final topk from the points in the unioned
(α + 1) γ-neighborhoods, with a maximum(α + 1)(2γ + 1) number of distinct points. We denote
this algorithm as the zχ-kNN algorithm (Figure 2.4).
Once we retrieve the approximate KNN (Aχ), one can retrieve the exact results based on the
Aχ. Figure 2.5 demonstrates our idea. In Figure 2.5,k = 3, the dotted circle is thekth NN ball, and
the solid circle is the approximatekth NN ball (denoted asB(Aχ)). The KNN box forAχ is the
smallest box that fully enclosesB(Aχ), denoted asM(Aχ), which is defined by the lower-left and
17
right-upper corner pointsδℓ andδh, i.e., the∆ points in Figure 2.5. Letzℓ andzh be thez-values of
δℓ andδh points ofM(Aχ). By the Corollary in [38], we guarantee that for allp ∈ A, zp ∈ [zℓ, zh].
[zℓ, zh] can be easily calculated once we retrieve the approximatekth NN. After that, we need check
if [zℓ, zh] is fully contained by[zγl, zγh
]. If yes, thezχ-kNN already retrieves all the exact answers,
i.e.,Aχ = A. If this is not the case, we can do a range search on any tableRi with [ziℓ, z
ih]. Ideally,
we choose the table with smallest[ziℓ, z
ih]. The algorithm is shown in Figure 2.6.
2.3 Range and NN queries on road networks
The growing popularity of online mapping services such as Google Maps and Microsoft Map-
Point has led to an interest in responding to queries such as finding nearest objects from a set (e.g.,
gas stations and restaurants) where the distance is measured in terms of paths along the road net-
work (network distance). Objects are usually constrained to lie on the road network. The access
method to the road network is significantly different from that in the Euclidean space. The NN
search on road networks plays an important role in many applications. In this section, I will go over
the state-of-the-art range and NN search techniques on roadnetwork through the literature.
2.3.1 Range search on road networks
Given a query pointq and a distance thresholdθ, the θ-range search on the road networks
retrieves all vertices on the road networkG that are withinθ network distance toq. The solution
is to adapt the Dijkstra’s algorithm. We can initialize the Dijkstra’s algorithm withq, visit all the
vertices in an ascending order in terms of network distance and stop when the distance of current
visiting vertex is greater thanθ. In practice, lots of applications may require to return a certain
mount of results in the order. In those scenarios, KNN searchare required. The KNN search on
18
road networks is more complicated than theθ-range search. I will introduce it in details in the
following.
2.3.2 KNN search on road networks
The KNN retrieval on road networks has been studied for decades. Most of them are graph-
based, which usually incorporate Dijkstra’s algorithm [14] in at least some parts of the solution
[11, 26]. Recently, a novel algorithm is presented for finding KNN on road networks in [30]. The
algorithm is based on precomputing the shortest paths between all possible vertices in the network
and then making use of an encoding that takes advantage of thefact that the shortest paths from
vertex u to all of the remaining vertices can be decomposed into subsets based on the first edges
on the shortest paths to them from u. Thus, in the worst case, the amount of work depends on the
number of objects that are examined and the number of links onthe shortest paths to them from q,
rather than depending on the number of vertices in the network. The amount of storage required to
keep track of the subsets is reduced by taking advantage of their spatial coherence which is captured
by the aid of a shortest path quadtree. The theoretical analysis has shown that the storage has been
reduced fromO(N3) to O(N1.5).
The problem with an approach that uses Dijkstra’s algorithmis that it must visit a very large
number of the vertices of the network for finding the shortestpath between vertices. For example,
Figure 2.7(a) shows the vertices that would be visited when finding the shortest path from the ver-
tex X to the vertex V in a spatial network corresponding to Silver Spring, MD. In the process of
obtaining the shortest path from X to V of length75 edges,75.4% of the vertices in the network are
visited. In [30], author proposedshortest-path mapfor each vertex in the network. In particular,
given vertexui, the shortest-path mapmuipartitions the underlying space intoMui
regions, where
19
(a) (b)
Figure 2.7: A map of Silver Spring, MD (From figure 1 in [30]).
Muiis the out degree ofui and there is one regionruij for each vertexwuij (1 ≤ j ≤ Mui
) that
is connected toui by an edgeeuij. Regionruij spans the space occupied by all vertices v such that
the shortest path fromui to v contains edgeeuij. Regionruij is bounded by a subset of the edges
of the shortest paths fromui to the vertices within it. For example, Figure 2.7(b) is sucha partition
for the vertex X in the road network of Figure 2.7(a) where we use different colors to denote the
different regions.
Sankaranarayanan et al. [31] propose to index the above partitions by using a variant of the
region quadtree [29]. Each quadtree block records the identity of the region which it belongs to.
Once we locate the block containing the destination vertex,we know what region it is in and hence
the adjacent edge from source vertex which connect the next vertex on the shortest path between
source and destination. Then we move to the quadtree of the next vertex. By repeating this process,
we can obtain the whole path. For example, consider the road network in Figure 2.8(a) where we
want to find the shortest path from vertex s to vertex d, and theshortest-path quadtree for s is given
20
(a) (b) (c)
Figure 2.8: (a) An example of road network, (b) the shortest-path quadtree of vertex s,
and (c) the shortest-path quadtree of vertex t (From figure 3 in [30]).
by Figure 2.8(b). Looking up vertex d in the shortest-path quadtree of s determines that d is in
the region of the quadtree corresponding to the edge from vertex s to t. Therefore, the shortest-path
from s to d passes through t. Next, we obtain the shortest-path quadtree of t which is given by Figure
2.8(c). Looking up vertex d in the shortest-path quadtree oft determines that d is in the region of the
quadtree corresponding to the edge from vertex t to u. This process is continued until encountering
an edge to vertex d.
Based on the above representation of road networks, authorsin [30] propose KNEARESTSPA-
TIALNETWORK for KNN retrieval. The actual mechanics of the algorithm are similar to the
best-first algorithm [31] with the difference that objects are associated with network distance in-
tervals instead of distances. When a nonleaf block b is removed from the queue, the children of
b are inserted into the queue with their corresponding minimum network distances. When a leaf
block b is removed from the queue, the objects in b are enqueued with their corresponding initial
network distance intervals, which can be refined to find the exact network distance later. The whole
algorithm terminates when find the first k objects.
21
Figure 2.9: The 30,000 shortest paths between all pairs of vertices in sets A and B in the
spatial network of Silver Spring, MD (From figure 1 in [32]).
Later on, Sankaranarayanan et al. in [32] propose an approximate distance oracle of sizeO( nǫd )
to obtain theǫ-approximate network distance. The observation is that given a set of source vertices
A and a set of destination vertices B such that A and B are sufficiently far away from each other, the
shortest paths between them may share common vertices. Figure 2.9 illustrates such a case where
all the30, 000 shortest paths between vertices in A and in B have many vertices in common, while
the network distances between them can be partly approximated by a single value.
22
CHAPTER 3
KEYWORD AND APPROXIMATE STRING QUERIES
3.1 Keyword search in text databases
Keyword search has been extensively studied in document retrieval, web and relational databases.
Given a collection of files which store keywords and a list of query keywordsw1, · · · , wm, we want
to retrieve all the files that contains all the query keywords. The methods handling such a search fall
into two categories. One is based on tree index and the other uses the signature. I will explore both
of them in the following.
3.1.1 Tree index
A prefix tree (a.k.a trie) can be used to answer keyword searchefficiently. Figure 3.1 illustrates
an example of a trie. Each wordw corresponds to a unique path from the root of the trie to a leaf
node. Each leaf node has an inverted list of descriptors of files that contain the corresponding word.
For query keywordsw1, · · · , wm, we search on the trie, retrieve all the lists correspondingto the
query keywords and merge them to find the result.
23
d
a
t
e
1,4,7,9
s
k
y
t
u
n c
k
2,4,5,9
2 6,8,9
Figure 3.1: An example of trie with inverted lists.
word hashed bit string
a 0101
b 1001
c 0011
d 0110
Figure 3.2: The example of
bit string computation with
l = 4, m = 2.
3.1.2 Signature file
Signature files have been successfully used in keyword search as it has not only small space
consumption but also fast string matching. A signature file is designed to perform membership
tests: determine whether a query wordw exists in a set of wordsW . If the test returns ’no’, thenw
is definitely not inW . On the other hand, if it returns ’yes’, the true answer can beeither way, in
which case the wholeW must be scanned to avoid a false hit. There are several implementations of
a signature file, and the one adopted in [16] is known as superimposed coding (SC), which has been
shown to be more effective than other variants [15]. Specifically, SC works as follows. It builds a
bit signature of lengthl from W by hashing each word inW to a string ofl bits, and then taking the
disjunction of all bit strings. The bit string of a wordw is denoted ash(w). Initially, all the l bits are
initialized to0. Then, SC repeats the followingm times: randomly choose a bit and set it to1. It’s
important that randomization must usew as its seed to ensure that the samew always ends up with
an identicalh(w). Furthermore, them choices are mutually independent, and may even happen to
24
be the same bit. The values ofl andm affect the space cost and false hit probability. Figure 3.2
gives an example for this process. For example, in the bit string h(b) of b, the first and last bits are
set to1. The bit signature of a set of words simply ORs the bit stringsof all the words in the set.
For instance, the signature of set{a, b} is 1101.
Given a query keywordw, SC performs the membership test by checking if all the1’s of h(w)
appear at the same positions in the signature ofW . If not, it is guaranteed thatw cannot belong to
W . Otherwise, it need to scanW . For example, we want to test whetherc with signature0011 is a
member of set{a, b} with signature1101. Since there are three positions where digits are different
between0011 and1101, SC immediately reports ’no’.
3.2 Approximate string search in text databases
The approximate string search can be described as follows: given a set of strings, we want to
efficiently find strings that are similar to a query string. Ithas plenty of applications such as spell
checking, data cleaning and so on. Before explore the algorithms for the approximate string search,
I will introduce some preliminaries related to the approximate string processing.
3.2.1 Preliminaries
Different string-similarity functions. A basic concept is how we define the similarity be-
tween two strings. Different similarity functions have been studied in the literature, such as edit
distance [23], cosine similarity [4] and Jaccard coefficient [33]. Take edit distance as an example.
Let S be a set of strings. The edit distance between two stringss1 ands2 is the minimum number
of edit operations of single characters that are needed to transforms1 to s2. Edit operations include
insertion, deletion, and substitution. We denote the edit distance between two stringss1 ands2 as
25
ed(s1, s2). For instance, ed(”toy”, ”boy”) =1. Using this function, an approximate string search
with a query string q and thresholdτ is finding all s∈ S such that ed(s, q)≤ τ .
Q-Grams based Inverted lists. Let Σ be an alphabet. For a string s of the characters inΣ, we
use|s| to denote the length of s. We introduce two characters# and$ not inΣ. Given a string s and
a positive integer q, we extend s to a new string s’ by prefixingq−1 copies of# and suffixing q−1
copies of$. A positional q-gram of s is a pair (i, g), where g is the substring of length q starting at
the i-th character of s’. The set of positional q-grams of s, denoted by G(s), is obtained by sliding a
window of length q over the characters of s’. For example, suppose q= 2, and s= theater. G(s, q)
is: (1, #t), (2, th), (3, he), (4, ea), (5, at), (6, te), (7, er), (8, r$). The number of positional q-grams of
the string s is|s| + q − 1. We construct inverted lists of q-grams as follows. For eachgram g of the
strings in S, we have a listlg of the ids of the strings that include this gram.
T-occurrence problem. It is observed in [33] that an approximate query with a strings can
be answered by solving the following problem:
T-occurrence Problem: Find the string ids that appear at least T times on the inverted lists of the
grams in G(s, q), where T is a constant related to the threshold τ in the query, the gram length q and
the similarity function. Assume that the similarity function is edit distance. For a string r∈ S that
satisfies the condition ed(r, s)≤ τ , it should share at least the following number of q-grams with s:
T = max{|s|, |r|} + q − 1 − τ ∗ q (3.1)
Many algorithms [24, 33] are proposed for answering approximate string queries based on the T-
occurrence problem. We will elaborate them in Section 3.2.2.
String filters. Various filters have been proposed in the literature to eliminate strings that can-
not be similar enough to a query string. In this section, we will introduce three filtering techniques:
26
length filter, position filter [17] and prefix filter [9].
Length Filtering: If two stringss1 ands2 are within edit distanceτ , the difference between their
lengths cannot exceedτ . Therefore, given a strings1, we only need to exam strings2 ∈ S such that
the difference between|s1| and|s2| is no greater thanτ .
Position Filtering: If two stringss1 ands2 are within edit distanceτ , then a q-gram ins1 cannot
correspond to a q-gram ins2 that differs by more thanτ positions. Therefore, given a positional
gram(i1, g1) in the query string, we only need to consider the other corresponding gram(i2, g2) in
the string set, such that|i1 − i2| ≤ τ .
Prefix Filtering: Given two q-gram sets G(s1) and G(s2) for stringss1 ands2, we can fix an
orderingO of the universe from which all set elements are drawn. Let p(n, s) denotes the n-th prefix
element in G(s) as per the orderingO. For simplicity, p(1, s) is abbreviated asps. An important
property is that, if|G(s1)⋂
G(s2)| ≥ T , thenps2≤ p(n, s1), wheren = |s1| − T + 1.
3.2.2 Merging-list algorithms for approximate string search
Several existing algorithms [24,33] assume an index of inverted lists for the grams of the strings
in the data setS to answer approximate string queries. Most of them focus on reducing the running
time to merge the string-id (SID) lists of the grams of the query string. We will briefly describe
them in the following sections. Notice that an optimizationis to sort the string ids on each inverted
list in an ascending order.
Heap algorithm. When merging the lists, we just need to maintain a frontier ofthe lists and
at each step advance the frontier to the next SID which appears in lists whose total number adds up
to more than T. Such an SID can be found efficiently by using a heap to maintain the frontiers of
all lists being merged. We then repeatedly pop the minimum SID from the heap, count its number
27
MergeOpt(Q, T, I)
1. LetA = l1, l2, · · · lN be the lists of index I in decreasing order of length
corresponding to the N grams of Q;
2. LetL = l1, l2, · · · lT−1 be firstT − 1 longest lists of A;
Insert frontiers of listsS = A − L in a heap H;
3. while H not empty do
4. pop from H current minimum records m, count them ascm,
and push in H next records from lists in S that popped.
5. for i = T − 1 down to1 do
6. search for m inli using a binary search method, and if foundcm++;
7. if (cm ≥ T ) output m;
Figure 3.3: The MergeOpt Algorithm
if successive popped SIDs are the same, and push in the heap the next SID from the frontier of the
popped list. Let N =|G(Q, q)| denotes the number of lists corresponding to the grams from the
query string, and M denote the total size of these N lists. This algorithm requiresO(MlogN) time
andO(N) storage space (not including the size of the inverted lists)for storing the heap.
MergeOpt algorithm. In [33], authors propose a MergeOpt algorithm to improve theeffi-
ciency of the Heap algorithm. It treats theT − 1 longest inverted lists of G(Q, q) separately. For the
remainingN − (T −1) relatively short inverted lists, it uses the Heap algorithmto merge them with
28
a lower threshold, i.e.,1. For each candidate string, it does a binary search on each oftheT −1 long
lists to verify if the string appears on at least T times amongall the lists. This algorithm is based on
the observation that a record in the answer must appear on at least one of the short lists. And doing
a binary search on long lists is much more efficient than scanning them. The algorithm is formally
descried in Figure 3.3.
ScanCount algorithm. In [24], authors propose ScanCount algorithm, which improves the
Heap algorithm by eliminating the heap data structure and the corresponding operations on the
heap. Instead, it just maintains an array of counters for allthe SIDs inS. It scans the N inverted
lists one by one. For each SID on each list, it increments the counter corresponding to the string by
1. It reports the SIDs that appear at least T times on the lists. The time complexity of this algorithm
is O(M). The space complexity isO(|S|), where|S| is the size of the string set, since it needs to
keep a counter for each SID. Authors argue that this higher space complexity is not a major concern,
since this extra space tends to be much smaller than that of the inverted lists. Despite its simplicity,
this algorithm can achieve a good performance when combinedwith various filtering techniques in
Section 3.2.1.
MergeSkip algorithm. To further improve the running time for merging list, authors in [24]
propose MergeSkip algorithm (Figure 3.4). The intuition isto skip on the lists those SIDs that
cannot be in the answer to the query, by utilizing the threshold T. Similar to Heap algorithm, it also
maintains a heap for the frontiers of these lists. A key difference is that, during each iteration, it
pops those records from the heap that have the same value as the top record t on the heap. Let the
number of popped records be n. If there are at least T such records, it adds t to the result set (line 7
in Figure 3.4), and add their next records on the lists to the heap. Otherwise, record t cannot be in
29
MergeSkip(I, T)
1. Insert the frontier SIDs of the lists to a heap H;
2. Initialize a result set R to be empty;
3. while H not empty do
4. Let t be the top record on the heap;
5. Pop from H those records equal to t;
6. Let n be the number of popped records;
7. if ( n ≥ T ) {Add t to R;
8. Push next record on each popped list to H;}
9. else{PopT − 1 − n smallest records from H;
10. Let t’ be the current top record on H;
11. for each of theT − 1 popped lists
12. Locate its smallest record r≥ t’;
13. Push this record to H;}
14.return R;
Figure 3.4: The MergeSkip Algorithm
DivideSkip( I, T )
1. Initialize a result set R to be empty;
2. LetLlong be the set of L longest lists
among the lists I;
3. LetLshort be the remaining short lists;
4. Use MergeSkip onLshort to find SIDs that appear
at leastT − L times;
5. for each record r found{
6. for each list inLlong
7. Check if r appears on this list;
8. if ( r appears≥ T times among all lists ){
9. Add r to R;
}
}
10.return R;
Figure 3.5: The DivideSkip Algorithm
the answer. In addition to popping these n records, it popsT − 1 − n additional records from the
heap (line 9). Therefore, it has poppedT − 1 records from the heap. Let t’ be the current top record
on the heap. For each of theT − 1 popped lists, it locates its smallest record r such that r≥ t’ by
using a binary search (line 12). It then pushes r to the heap (line 13).
30
DivideSkip algorithm. The DivideSkip algorithm [24] is a combination of MergeSkipand
MergeOpt. Figure 3.5 summarizes the algorithm. Given a set of SID lists, it first sorts these lists
based on their lengths. It divides the lists into two groups.It groups theL (a tunable parameter)
longest lists to a setLlong, and the remaining short lists as another setLshort. It uses the MergeSkip
algorithm onLshort to find records r that appear at leastT −L times on the short lists. For each such
record r and each listllong in Llong, it checks if r appears onllong. If the total number of occurrences
of r among all these lists is at least T , we add it to the result set R. The parameterL can greatly
affect the performance of the algorithm and is discussed in details in [24].
3.2.3 Variable-length-gram algorithms for approximate string search
The above algorithms are all based on the fixed-length grams.A dilemma for these algorithms
is how to select a good gram length. If we increase the gram length, there could be fewer strings
sharing a gram, causing the inverted lists to be shorter. Therefore, it may decrease the time to merge
the inverted lists. On the other hand, we will have a lower threshold on the number of common
grams shared by similar strings, causing a less selective count filter to eliminate dissimilar string
pairs. The number of false positives after merging the listswill increase, causing more time to
compute their real edit distances (a costly computation) inorder to verify if they are in the answer
to the query. Based on this dilemma, authors in [25] propose avariable-length index (VGRAM) to
improve the performance. In a high level, VGRAM works as follows: (1) It analyzes the frequencies
of variable-length grams in the strings, and select a set of grams, called gram dictionary, such that
each selected gram in the dictionary is not too frequent in the strings; (2) For a string, it generates a
set of grams of variable lengths using the gram dictionary; (3) It can be showed that if two strings
are within edit distanceτ , then their sets of vgrams also have enough similarity, which is related to
31
τ . This set similarity can be used to improve the performance of existing algorithms. Figure 3.6,
3.7, 3.8, 3.9 (from Figure 2 in [25]) show a VGRAM index for strings. In the following, I will
explore each component in this index.
id string
0 stick
1 stich
2 such
3 stuck
Figure 3.6: Strings.
n1
n2 n3 n4 n5 n6
n7 n8 n9 n10 n11 n12 n13
n14 n15 n16 n17 n18 n19 n20 n21
n22
ci
st
u
h k c t u u c
# # # i # # # #
#
Figure 3.7: Gram dictionary as a trie.
n1
n2 n3 n4 n5 n6 n7
n8 n9 n10 n11 n12 n13 n14 n15
n16 n17 n18 n19 n20 n21 n22
ch k t
u
i u c t c s t
# # # s ## # #
#
n23
n24
l
s
Figure 3.8: Reversed-gram trie.
id NAG vector
0 2,3
1 2,3
2 2,3
3 3,4
Figure 3.9: NAG vectors.
Variable-length grams. A gram dictionary is a set of gramsD of lengths betweenqmin and
qmax. A gram dictionaryD can be stored as a trie. The trie is a tree, and each edge is labeled with a
32
VGEN( D, string s,qmin, qmax )
1. positionp = 1; VG = empty set;
2. whilep ≤ |s| − qmin + 1
3. Find a longest gram inD using the trie to match a substringt of s starting at positionp;
4. if( t is not found )t = s[p, p + qmin − 1];
5. if( positional gram (p,t) is not subsumed by any positional gram in VG ) Insert (p, t) to VG;
6. p = p + 1;
7. return VG;
Figure 3.10: The VGEN Algorithm
character. To distinguish a gram from its extended grams, itpreprocesses the grams inD by adding
to the end of each gram a special endmarker symbol, e.g., #. A path from the root node to a leaf
node corresponds to a gram inD. For example, Figure 3.7 shows a trie for a gram dictionary ofthe
four strings in Figure 3.6, whereqmin = 2 andqmax = 3. The dictionary includes the following
grams: ch, ck, ic, sti, st, su, tu, uc. The pathn1 −n4 −n10 −n17 −n22 corresponds to the gram sti.
To generate variable-length grams, it still uses a window toslide over s, but the window size
varies, depending on the string s and the grams inD. Intuitively, it generates a gram for the longest
substring (starting from the current position) that matches a gram inD. If no such gram exists inD,
it will generate a gram of lengthqmin. In addition, for a positional gram (a,g) whose corresponding
substring s[a, b] has been subsumed by the substring s[a’, b’] of an earlier positional gram (a’, g’),
i.e., a′ ≤ a ≤ b ≤ b′, it ignores the positional gram (a, g). The algorithm (VGEN)is formally
33
described in Figure 3.10.
Constructing gram dictionary. The construction of gram dictionary is achieved by two steps.
In the first step, it analyzes the frequencies of q-grams for the strings, where q is withinqmin and
qmax. In the second step, it selects grams with a small frequency.
To collect the frequencies of grams, it uses a trie (called frequency trie) to collect gram fre-
quencies. The algorithm avoids generating all the grams forthe strings based on the following
observation. Given a string s, for each integer q in [qmin,qmax − 1], for each positional q-gram
(p, g), there is a positional gram (p,g’) for its extendedqmax-gram g’. Therefore, we can generate
qmax-grams for the strings to do the counting on the trie without generating the shorter grams.
To select high-quality grams, it judiciously prunes the frequency trie and uses the remaining
grams to form a gram dictionary. The intuition of the pruningprocess is the following: (1) Keep
short grams if possible. If a gram g has a low frequency, it eliminates from the trie all the extended
grams of g. (2) If a gram is very frequent, keep some of its extended grams. As a simple example,
consider a gram ab. If its frequency is low, then we will keep it in the gram dictionary. If its
frequency is very high, we will consider keeping this gram and its extended grams, such as aba,
abb, abc, etc. The goal is that, by keeping these extended grams in the dictionary, the number of
strings that generate an ab gram by the VGEN algorithm could become smaller, since they may
generate the extended grams instead of ab. The details of thealgorithm is illustrated in Figure 3.11.
Similarity of gram sets. Next, I will elaborate the relationship between the similarity of two
strings and the similarity of their gram sets generated using the same gram dictionary. For each
string s in the setS, we want to know how many grams in VG(s) can be affected byτ edit operations.
We precompute an upper bound of this number for each possibleτ value, and store the values (for
34
Prune( Node n, Threshold T )
1. if ( each child of n is not a leaf node ) // the root→n path is shorter thanqmin
2. for ( each child c of n ) call Prune(c; T); // recursive call
3. return;
4. L = the (only) leaf-node child of n; // a gram corresponds tothe leaf-node child of n
5. if( n.freq ≤ T )
6. keep L, and remove other children of n; L.freq = n.freq;
7. else
8. select a maximal subset of children of n (excluding L),
so that the summation of their freq values and L.freq is stillnot greater than T;
9. add the freq values of these children to that of L, and remove these children from n;
10. for ( each remaining child c of n excluding L ) call Prune(c, T); // recursive call
Figure 3.11: Pruning a subtrie to select grams
differentτ values) in a vector for s, called thevector of number of affected grams(NAG vector) of
string s. Theτ -th number in the vector is denoted by NAG(s,τ ). Such upper bounds can be used to
improve the performance of existing algorithms. Figure 3.9shows the NAG vectors for the strings.
From lemma 1 in [25], we have the following equation:
T = max{|V G(s1)| − NAG(s1, τ), |V G(s2)| − NAG(s2, τ)} (3.2)
To adopt VGRAM in the existing algorithms e.g., ProbeCluster [33], we just need to make the
35
same two minor modifications: (1) call VGEN to convert a string to a set of variable-length grams;
(2) use Equation 3.2 instead of Equation 3.1 as a set-similarity threshold for the sets of two similar
strings.
3.2.4 Bed-tree: index for string similarity search based on the edit distance
Zhang et al. in [41] propose a novel tree index (Bed-tree) for approximate string search recently.
Bed-tree is a B+-tree based index for evaluating all types of similarity queries (top-K, range, etc.)
on edit distance and normalized edit distance.
To index the strings with the B+-tree, it is necessary to construct a mapping from the string
domain to integer space. The insertion and deletion operations on the B+-tree rely only on the
comparability of the string order. The comparability is defined as follows: A string orderφ is
efficiently comparable if it takes linear time to verify ifφ(si) is larger thanφ(sj) for any string
pair si andsj. To handle range queries and top-K queries, it needs lower bounding property, which
is defined as: A string orderφ is lower bounding if it efficiently returns the minimal edit distance
between string q and anysl ∈ [si, sj ].
With this property, the B+-tree can handle range queries using algorithm in Figure 3.13. The
VerifyED function (Figure 3.12) is for testing if the edit distance of two string is within some
thresholdθ. In the algorithm they use LB(si, [sj−1, sj ]) to denote the lower bound on the edit
distance betweensi and any stringsl ∈ [sj−1, sj]. The algorithm in Figure 3.13 iteratively visits the
nodes with lower bound edit distance no larger thanθ and verifies the strings found at the leaf level
of the tree using VerifyED function. The minimal and maximalstringssmin andsmax indicate the
boundaries of any string in a given subtree with respect to the string orderφ. This information can
be retrieved from the parent node, as the algorithm implies.
36
VerifyED(strings1, strings2, thresholdθ)
1. if ||s1| − |s2|| > θ return FALSE
2. Construct a table T of 2 rows and|s2| + 1 columns
3. for j = 1 to min(|s2| + 1, 1 + θ) T[1][j] = j-1
4. Set m =θ + 1
5. for i = 2 to |s1| + 1
6. for j =min(1, i -θ) to min(|s2| + 1, i + θ)
7. d1=(j<i+θ) ? T[1][j] + 1 : θ + 1
8. d2=(j>1) ? T[2][j-1] + 1 : θ + 1
9. d3=(j>1)?T[1][j-1]+(s1 [i-1]=s2[j-1])?0:1):θ+1
10. T[2][j] = min(d1, d2, d3); m = min(m, T[2][j]);
11. if m > θ return FALSE
12. for j = 0 to|s2| + 1 T[1][j] = T[2][j]
13.return TRUE;
Figure 3.12: The VerifyED function
RangeQuery(string q, B+-tree node N, thresholdθ,
minimal stringsmin, maximal stringsmax)
1. if N is the leaf node
2. for eachsj ∈ N
3. if VerifyED(q,sj , θ)
4. Includesj in query result
5. else
6. if LB(q, [smin, s1]) ≤ θ
7. RangeQuery(q,N1 , θ, smin, s1)
8. for j = 2 to m
9. if LB(q, [sj−1, sj ]) ≤ θ
10. RangeQuery(q,Nj , θ, sj−1, sj)
11. if LB(q, [sm, smax]) ≤ θ
12. RangeQuery(q,Nm+1 , θ, sm, smax)
Figure 3.13: The RangeQuery Algorithm
If a string orderφ supports range queries, it also directly supports top-k selection queries on the
B+-tree structure. It simply uses a min-heap to keep the current top-k similar strings and update the
thresholdθ with the distance value of the top element in the heap. The details refer to the paper [41].
Authors in [41] also discuss three concrete string mappings. According to their experiments,
the most effective mapping is gram counting order (GCO). I will briefly describe it in the following.
37
GCO is based on counting the number of q-grams within a string. It uses a hash function to map
q-grams to a set of L buckets, and count the number of q-grams in each bucket. Thus, the q-gram
set is transformed into a vector of L non-negative integers.The edit distance between two stringssi
andsj is no smaller than
max(|Q(si)|/|Q(sj)|
q,|Q(sj)|/|Q(si)|
q)
Here,Q(si)|/|Q(sj) is the set of q-grams in Q(si) and not Q(sj), and vice versa. After mapping
strings from gram space to the bucket space, a new lower boundholds. If vi andvj are the L-
dimensional bucket vector representations ofsi andsj respectively, the edit distance betweensi
andsj is no smaller than
max(∑
vi[l]>vj [l]
vi[l] − vj[l]
q,
∑
vj [l]>vi[l]
vj[l] − vi[l]
q)
To achieve a tighter lower bound they apply z-order on the q-gram counting vectors. Eventually,
GCO maps the strings into the z-values, which can be index by the B+-tree.
38
CHAPTER 4
SPATIAL KEYWORD AND APPROXIMATE STRING
QUERIES
4.1 Keyword search on spatial databases
Many applications require to find objects closest to a specified location that contains a set of
keywords, which leads to a novel type of search: spatial keyword search (SKS). An SKS consists
of a query point and a set of keywords. The answer is a list of objects ranked according to a
combination of their distance to the query point and the relevance of their text description to the
query keywords. There are several variants of such a query being studied since SKS emerged in the
database community recently [16]. I will briefly introduce them in the following.
4.1.1 Distance-first top-k spatial keyword query
Felipe et al. in [16] propose a variant of SKS: distance-firsttop-k spatial keyword query. For-
mally, an object O is defined as a pair (O.p,O.t), where O.p is alocation descriptor in the multidi-
mensional space and O.t is a textual description. Let D be theuniverse of all objects in a database.
A top-k spatial queryQp retrieves the k nearest objects to the specified query point q. A keyword
39
N1 N2
N3 N4 N5 N6
o1 o2 o3 o4 o5 o6 o7 o8
1011 1111 0111 0011 1101 1001 1010 1111
1111 0111 1101 1111
1111 1111
Figure 4.1: The example of an IR2-tree.
queryQt consists of a set of keywordsw1, · · · , wm. The result ofQt is a list of objects ordered
by the relevance IRscore(O.t,Qt) of their textual descriptions to the query keywords, as measured
by an IR ranking function [34]. In particular, authors in [16] deal with boolean keyword query,
which returns the set of all objects whose text contains all of w1, · · · , wm. A distance-first top-k
spatial keyword query is a combination of a top-k spatial query and a Boolean keyword query, i.e.
it retrieves k objects that contain all ofw1, · · · , wm and are closest toQp.
The state-of-the-art solution to the above problem is base on the information retrieval R-tree
(IR2-tree), which combines the R-tree [5, 18] with the signaturefile [15]. For the review of the
signature file, please refer to Section 3.1.2.
The IR2-tree is an R-tree where each entry N is augmented with a signature that summarizes
the union of the texts of the objects in the subtree of N. Figure 4.1 demonstrates an example. Note
that the signature of a nonleaf entry N can be conveniently obtained as the disjunction of all the
signatures in the child node of N.
To answer NN queris, Felipe et al. [16] adapt the best-first algorithm [19] to the IR2-tree. In
particular, given a query q and a keyword setWq, the algorithm accesses the entries of an IR2-tree in
ascending order of the distance of their MBRs to q, pruning those entries whose signatures indicate
40
the absence of at least one word ofWq in their subtrees. If a leaf entry, say a point p, cannot be
pruned, it retrieves the textWp. If Wq is a subset ofWp, then it returns p as the answer. Otherwise,
it continues until no more entry need to be processed.
4.1.2 The m-closest keywords query on spatial databases
Zhang et al. [39] deals with another variant of SKS that is defined as follows. LetD be a set of
points each of which associates with a single keyword. Givena setWq of query keywords, the goal
is to retrieve|Wq| points fromD such that (1) each point has a distinct keyword inWq and (2) the
maximum mutual distance of these points is minimized. This problem is denoted as the m-closest
keywords (mCK) query.
To efficiently answer mCK queries, Zhang et al. [39] propose anovel index bR∗-tree, which is an
extension of the R∗-tree [5]. Besides the MBR, each node is associated with additional information
of keywords. A straightforward extension is to summarize the keywords in the node. Assuming
there are N keywords in the database, the keywords for each node can be represented using a bitmap
of size N, with a ’1’ indicating its existence in the node and a’0’ otherwise. For example, a bitmap
B = 1001 indicates that there are four keywords in the database and the current node is associated
with the keywords in the first and last positions of the bitmap. Given a queryQ = 0110, if we have
B&Q = 0, it implies that the given node does not include any query keywords and can be pruned
from the search space. Besides the keyword bitmap, it also stores the keyword MBR in the node for
facilitating pruning. The keyword MBR of keywordwi is the MBR covering all the objects in the
nodes that are associated withwi. Using this information, we know the approximate area in thenode
where each keyword is distributed. Figure 4.2 illustrates an example of an internal node containing
three keywordsw1, w2 andw3 represented as111. It also maintains the keyword MBRs ofw1, w2
41
node bitmap
MBR keyword MBR
w3
w3
w1
w1
w1
w2
w2
w2
w3
w3
w1
w1
w1
w2
w2
w2
1 1 1
C1
C2
C3
W3
W2
W1
Figure 4.2: An example of the node information of the bR∗-tree.
andw3. To construct the bR∗-tree, it follows the means of the original R∗-tree algorithm [5] by
adding the operations of updating keywords and keyword MBR when AdjustTree is invoked. The
set of keywords of the parent node is the union of the sets of keywords in the child nodes. And
the keyword MBR ofwi in the parent node is actually the minimum bound of the corresponding
keyword MBRs in the child nodes.
The search algorithm starts from the root node. The result can be located within one child node
or across multiple child nodes of the root. If a child node contains all the m query keywords, it is
included in candidate search space. Similarly, if multiplechild nodes together can contribute all the
query keywords and they are close to each other, then they arealso included in the search space.
After exploring the root node, it obtains a list of candidatesubsets of its child nodes. Note that
during the whole search process, the number of nodes in a nodeset will never exceed m because the
target m tuples can only reside in at most m child nodes. This provides an additional constraint to
reduce the search space.
The first step is to find a relatively small diameter for branch-and-bound pruning before it starts
the search. It starts from the root node and choose a child node with the smallest MBR that contains
all the query keywords and traverse down that node. The process is repeated until it reaches the
42
leaf level or until it is unable to find any child node with all the query keywords. Then it performs
exhaustive search within the node it found and use the diameter of the result as its initial diameter
for searching (denoted asδ∗).
SubsetSearch (current subset curSet)
1. if(curSet contains leaf nodes)
2. δ = ExhaustiveSearch(curSet);
3. if (δ < δ∗) δ∗ = δ;
4. else
5. if (curSet has only one node)
6. setList = SearchInOneNode(curSet);
7. for each S∈ setList
8. δ∗ = SubsetSearch(S);
9. if (curSet has multiple nodes)
10. setList = SearchInMultiNode(curSet);
11. for each S∈ setList
12. δ∗ = SubsetSearch(S);
13.output distance of m closest keywords.
Figure 4.3: The SubsetSearch funtion
SearchInOneNode (a node N in bR∗-tree)
1. L1 = all the child nodes in N;
2. for i from 2 to m
3. for each NodeSetC1 ∈ Li−1
4. for each NodeSetC2 ∈ Li−1
5. if C1 andC2 share the firsti − 1 nodes
6. C = NodeSet(C1, C2);
7. if (C has subset not appear inLi−1) continue;
8. if (C is neither distance mutex nor keyword
9. mutex)Li = Li ∪ C;
10.for each NodeSet S∈ ∪mi=1Li
11. if (S contains all query keywords )
12. add S to cList;
13.return cList;
Figure 4.4: The SearchInOneNode funtion
With this initial δ∗, it starts search from the root node. The function SubsetSearch in Figure
4.3 traverses the tree in a depth-first manner so as to visit the data objects in leaf entries as soon as
43
possible. An effective strategy for reducing the number of candidate subsets is importance as each
subset will later spawn an exponential number of new subsets. Authors utilize the priori algorithm
proposed in [1] combined with two constraints called distance mutex and keyword mutex. The defi-
nition of distance mutex is based on the observation that if the minimum distance between two node
MBRs of N and N’ is larger thanδ∗, then the node set{N, N’} does not give a result with diameter
better thanδ∗. We have similar property for keyword MBRs, which is defined as keyword mu-
tex. To demonstrate how to utilize this two constraints, Figure 4.4 illustrates the SearchInOneNode
function, which is the essential part of the whole algorithm.
First (in line 1 of Figure 4.4), it puts all the child nodes in the bottom level of the lattice. The
lattice is built level by level with increasing number of child nodes in the NodeSet. In level i, each
NodeSet contains exactly i child nodes. For a query with m keywords, we only need to check
NodeSet with at most m nodes, leading to a lattice with at mostm levels. Lines5 − 6 show two
setsC1 andC2 in level i − 1 being joined, they must havei − 2 nodes in common. Lines7 − 12
check if any of its subsets in leveli−1 is pruned due to distance mutex or keyword mutex. If all the
subsets are legal, we check whether this new candidate itself is distance mutex or keyword mutex
for pruning. If it is not pruned, we add it to level i. In lines10 − 12, after all the candidates have
been generated, it checks each one to see if it contains all the query keywords. Those missing any
keywords are eliminated. It does not check this constraint while building the lattice because if a
node does not contain all the query keywords, we can still combine with other nodes to cover the
missing keywords. As long as it is neither distance mutex norkeyword mutex, we keep it in the
lattice.
The drawback of the bR∗-tree lies that it may suffer from high IOs cost if the total number of
keywords is large. The root node has to maintain all keyword MBRs. The internal nodes become
44
R2
R3 R4 R5 R6
ba a
loc1 loc5
loc2 loc3 loc4 loc6 loc8
loc3 loc4 loc6 loc7
loc7
R1
a a aa a a
a a a
aa
a
a
a
a
a
a
b b b b b b b
b
b
b
bbb
b
b
b
b b
b b
b b
a a b
loc8
aa loc1 a a loc2 b a loc5
b
Figure 4.5: An example of the index of R∗-tree and inverted lists (from Figure 3 in [40]).
overloaded and occupy a large amount of storage space. Such an index structure prevents the search
algorithm from handling spatial database with massive number of keywords. Later on, Zhang et
al. [40] remedy this drawback by proposing a novel index based on the R∗-tree and inverted list.
The new index is excellent for the scalable mCK queries and applied to locate mapped resources in
web 2.0 [40].
The new light-weighted index structure is based on the R∗-tree and inverted lists. The R∗-tree is
used to index all the spatial locations associated with keywords. It is constructed in the same manner
as described in [5] except that each node is assigned a label indicating the path to the root node. As
an example, Figure 4.5 (from Figure 3 in [40]) illustrates anindex of the R∗-tree and inverted lists.
Node R5 and R6 are both labeled as ’b’ because they share the same path to the root node’s second
entry. Given a node label, we can find where the node is locatedin the R∗-tree without explicitly
accessing to the tree. Two locations close to each other probably have the same prefix of node label.
45
R2R1
Query :
a a a ab
R3a
R5b
aba aaa
R3a
R5b
abaa a aa
Figure 4.6: An example of bottom-up construction of virtualbR∗-tree (from Figure 4 in [40]).
Thus, the label can be used to approximate the distance between the points. An inverted index is
built along with the R∗-tree. It maintains inverted lists for all the keywords. Each element in the list
consists of the node label and the actual location. Note thatthe list of locations are ordered by the
label to preserve the spatial proximity of nodes.
For the query processing, given m query keywords, it first retrieves m lists of data points from
the inverted lists. Authors in [40] propose an solution to construct a virtual bR∗-tree using the labels
and locations of the data points. The virtual bR∗-tree is built level by level in bottom-up manner as
illustrated in Figure 4.6 (from Figure 4 in [40]). Then, m inverted lists corresponding to the query
tags are merged into one list ordered by the node label. Thosepoints with the same label are used
to construct a new virtual node which has a counterpart in theoriginal R∗-tree. Note that the MBR
of the virtual node is much smaller than its counterpart as itis built on the points only relevant to
the query. For each virtual node, it maintains the keyword bitmap and the keyword MBRs as it does
in [5]. Then, the a priori-based search strategy can be applied in the virtual bR∗-tree. The detailed
search algorithm is shown in Figure 4.7. When a virtual node is constructed, it will be treated as a
subtree and fed to the algorithm SubsetSearch proposed in [5].
46
Bottom-Up search (m query keywords, inverted index)
1. Retrieve m inverted lists for all query keywords;
2. Merge the lists into one list L ordered by the label;
3. Initialize a virtual node curNode;
4. while L.level< tree.height
5. for each element vnode in L
6. if (vnode has the same label with its previous element)
7. add vnode into curNode;
8. else
9. SubsetSearch(curNode);
10. add curNode to List L’;
11. L=L’;
12.output distance and location of m closest keywords.
Figure 4.7: The bottom-up search
4.2 Approximate string search on spatial databases
In this section, we consider another problem related to the spatial keyword search. In reality, for
many scenarios, keyword search for retrieving approximatestring matches is required [2,8,10,21,22,
24,28,36]. Approximate string search could be necessary when users have a fuzzy search condition
or simply a spelling error when submitting the query, or the strings in the database contain some
47
p7: theatre
p3: theaters
p5: grocery
p2: ymca club
p10: gym
p4: theaterp6: barnes noble
p9: shaw’s market
p8: Commonwealth ave
p1: Moe’sQ : r, theatre, τ = 2
Figure 4.8: Approximate string search with a range query.
degree of uncertainty or error. In the context of spatial databases approximate string search could
be combined with any type of spatial queries, including range and nearest neighbor queries. An
example for the approximate string match range query is shown in Figure 4.8, depicting a common
scenario in location-based services: find all objects within a spatial ranger that have a description
that is similar to ’theatre’. Similar examples could be constructed for KNN queries. We refer to these
queries asSpatial Approximate String(SAS) queries. This question is first handled by Yao et al.
in [37]. The approximate string search has been studied independently in the database community
(see Chapter 3). The main challenge for SAS problem is that the basic solution of simply integrating
approximate edit distance evaluation techniques into a normal R-tree is expensive and impractical.
Yao et al. [37] propose the MHR-tree that takes into account the potential pruning capability
provided by the string match predicate and the spatial predicate simultaneously. The MHR-tree is
based on the R∗-tree. In each node, it stores the signature of the string information of its children
nodes. Before exploring how to construct the MHR-tree and answer SAS queries with MHR-tree, I
will introducing the min-wise signature, which is used to summarize the string information.
48
4.2.1 The min-wise signature
The min-wise independent families of permutations were first introduced in [6, 12], which can
be used for estimatingset resemblance. Broder et al. has shown in [6] that a min-wise independent
permutationπ could be used to construct an unbiased estimator for set resemblance. However, there
is no efficient implementation of the min-wise families of permutations in practice. Thus, prior art
often uses linear hash functions based on Rabin fingerprintsto simulate the behavior of the min-wise
independent permutations [6].
4.2.2 The construction of the MHR-tree
To incorporate the pruning power of edit distances into the R-tree, we can utilize the result from
Lemma 1 in [37]. The intuition is that if we store theq-grams for all strings in a subtree rooted
at an R-tree nodeu, denoted asGu, given a query stringσ, we can extract the queryq-gramsGσ
and check the size of the intersection betweenGu andGσ , i.e., |Gu ∩ Gσ|. Then we can possibly
prune nodeu by Lemma 1 in [37], even ifu does intersect with the query ranger or it needs to be
explored based on the MinDist metric for the KNN query.
To estimate the intersection size betweenGu andGσ , we embed the min-wise signature ofGu
in an R-tree node, instead ofGu itself. The min-wise signatures(Gu) has a constant size, which
make it suitable to be stored in the R-tree node.
For a leaf level nodeu, let the set of points contained inu beup. For every pointp in up, we
compute itsq-gramsGp and the corresponding min-wise signatures(Gp). In order to compute the
min-wise signature for the node, lets(A)[i] be theith element for the min-wise signature ofA, i.e.,
s(A)[i] = min{πi(A)}. Given the set of min-wise signatures{s(A1), . . . , s(Ak)} of k sets, one
49
Construct-MHR(Data set P, Hash Functions{h1, . . . , hℓ})
1. Use any existing bulk-loading algorithmA for R-tree;
2. Letu be an R-tree node produced byA overP ;
3. if (u is a leaf node)
4. ComputeGp ands(Gp) for every pointp ∈ up;
5. Stores(Gp) together withp in u;
6. else
7. for (every child entryci with child nodewi)
8. Store MBR(wi), s(Gwi), and pointer towi in ci;
9. for (i = 1, . . . , ℓ)
10. Lets(Gu)[i] = min(s(Gw1)[i], . . . , s(Gwf
[i]));
11. Stores(Gu) in parent ofu;
Figure 4.9: The construction of the MHR-tree.
can easily derive that:
s(A1 ∪ · · · ∪ Ak)[i] = min{s(A1)[i], . . . , s(Ak)[i]} (4.1)
for i = 1, . . . , ℓ, since each element in a min-wise signature always takes thesmallest image for a
set.
We can obtains(Gu) using Equation 4.1 ands(Gp)s for every pointp ∈ up, directly. Finally,
we store all (p, s(Gp)) pairs in nodeu, ands(Gu) in the index entry that points tou in u’s parent.
50
For an index level nodeu, let its child entries be{c1, . . . , cf} wheref is the fan-out of the
R-tree. Each entryci points to a child nodewi of u, and contains the MBR forwi. We also store the
min-wise signature of the node pointed to byci, i.e.,s(wi). Clearly,Gu =⋃
i=1,...,f Gwi. Hence,
s(Gu) = s(Gw1∪ · · · ∪ Gwf
); based on Equation 4.1, we can computes(Gu) usings(Gwi)s.
This procedure is recursively applied in a bottom-up fashion until the root node of the R-tree
has been processed. The complete construction algorithm ispresented in Figure 4.9.
4.2.3 The query algorithm for the MHR-tree
The query algorithms for the MHR-tree generally follow the same principles as the correspond-
ing algorithms for the spatial query component. However, wewould like to incorporate the pruning
method based onq-grams and set resemblance estimation without the explicitknowledge ofGu
for a given R-tree nodeu. We need to achieve this with the help ofs(Gu). Thus, the key issue
boils down to estimating|Gu ∩ Gσ | usings(Gu) andσ. The details of estimating|Gu ∩ Gσ | is
demonstrated in Section IV in [37].
The SAS range query algorithm is presented in Figure 4.10. One can also revise the KNN
algorithm for the normal R-tree to derive the KNN-MHR algorithm. The basic idea is to use a
priority queue that orders objects in the queue with respectto the query point using the MinDist
metric. However, only nodes or data points that can pass the string pruning test will be inserted into
the queue. Whenever a point is removed from the head of the queue, it is inserted inA. The search
terminates whenA hask points or the priority queue becomes empty.
51
Range-MHR(MHR-treeR, Ranger, Stringσ, int τ )
1. LetB be a FIFO queue initialized to∅, letA = ∅;
2. Letu be the root node ofR; insertu into B;
3. while (B 6= ∅)
4. Letu be the head element ofB; pop outu;
5. if (u is a leaf node)
6. for (every pointp ∈ up)
7. if (p is contained inr)
8. if (|Gp ∩ Gσ| ≥ max(|σp|, |σ|) − 1 − (τ − 1) ∗ q)
9. if (ε(σp, σ) < τ ) Insertp in A;
10. else
11. for (every child entryci of u)
12. if (r and MBR(wi) intersect)
13. Calculates(G = Gwi∪ Gσ) based ons(Gwi
), s(Gσ) and Equation 4.1;
14. Calculate ̂|Gwi∩ Gσ| using Equation 8 in [37];
15. if ( ̂|Gwi∩ Gσ | ≥ |σ| − 1 − (τ − 1) ∗ q)
16. Read nodewi and insertwi into B;
17.ReturnA.
Figure 4.10: The range queries with the MHR-tree.
52
CHAPTER 5
CONCLUSION
In this document, I survey a type of query: spatial keyword search. Since it’s a combination of
spatial search and text search, I first review some critical issues in both fields. Range and nearest
neighbor queries are fundamental problems in spatial databases. I go over them in both Euclidean
space and road networks. Next, I present the state-of-the-art techniques for processing the string
in the literature. The approximate string search is a hot spot research issue in string databases. I
introduce the most popular q-grams based indices and algorithms for the approximate string search.
Lastly, I reach to the spatial keyword and approximate string queries. Few variants of the spatial
keyword and approximate string queries have been studied and large of them remain open. I briefly
describe three variants that have been explored.
53
BIBLIOGRAPHY
[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. InVLDB, 1994.
[2] A. Arasu, S. Chaudhuri, K. Ganjam, and R. Kaushik. Incorporating string transformations in
record matching. InSIGMOD, 2008.
[3] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A.Y. Wu. An optimal algorithm for
approximate nearest neighbor searching in fixed dimensions. Journal of ACM, 45(6):891–923,
1998.
[4] R. Bayardo, Y. Ma, and R. Srikant. Scaling up all-pairs similarity search. InWWW, 2007.
[5] N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger. The R∗-tree: an efficient and robust
access method for points and rectangles. InSIGMOD, 1990.
[6] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent per-
mutations (extended abstract). InSTOC, 1998.
[7] T. M. Chan. Approximate nearest neighbor queries revisited. InSoCG, 1997.
[8] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robustand efficient fuzzy match for
online data cleaning. InSIGMOD, 2003.
54
[9] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data
cleaning. InICDE, 2006.
[10] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data
cleaning. InICDE, 2006.
[11] H.-J. Cho and C.-W. Chung. An efficient and scalable approach to CNN queries in a road
network. InVLDB, 2005.
[12] E. Cohen. Size-estimation framework with applications to transitive closure and reachability.
Journal of Computer and System Sciences, 55(3):441–453, 1997.
[13] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based
on p-stable distributions. InSoCG, 2004.
[14] E. W. Dijkstra. A note on two problems in connexion with graphs.Numerische Mathematik,
1:269–271, 1959.
[15] C. Faloutsos and S. Christodoulakis. Signature files: an access method for documents and its
analytical performance evaluation.TOIS, 2:267–288, 1984.
[16] I.D. Felipe, V. Hristidis, and N. Rishe. Keyword searchon spatial databases. InICDE, 2008.
[17] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava.
Approximate string joins in a database (almost) for free. InVLDB, 2001.
[18] A. Guttman. R-trees: A dynamic index structure for spatial searching. InSIGMOD, 1984.
[19] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases.TODS, 24(2):265–318,
1999.
55
[20] H. V. Jagadish, B. C. Ooi, K. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+-tree based
indexing method for nearest neighbor search.ACM Trans. Database Syst., 30(2):364–397,
2005.
[21] N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in
practice. InVLDB, 2004.
[22] H. Lee, R. Ng, and K. Shim. Extending q-grams to estimateselectivity of string matching with
low edit distance. InVLDB, 2007.
[23] V. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones.
Inf. Transmission, 1:8–17, 1965.
[24] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string
searches. InICDE, 2008.
[25] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on
string collections using variable-length grams. InVLDB, 2007.
[26] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query processing in spatial network
databases. InVLDB, 2003.
[27] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. InSIGMOD, 1995.
[28] S. Sahinalp, M. Tasan, J. Macker, and Z. Ozsoyoglu. Distance based indexing for string
proximity search. InICDE, 2003.
[29] H. Samet.Foundations of Multidimensional and Metric Data Structures. Morgan-Kaufmann,
San Francisco, 2006.
56
[30] H. Samet, J. Sankaranarayanan, and H. Alborzi. Scalable network distance browsing in spatial
databases. InSIGMOD, 2008.
[31] J. Sankaranarayanan, H. Alborzi, and H. Samet. Efficient query processing on spatial net-
works. InACM GIS, 2005.
[32] J. Sankaranarayanan and H. Samet. Distance oracles forspatial networks. InICDE, 2009.
[33] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicate. InSIGMOD, 2004.
[34] A. Singhal. Modern information retrieval: A brief overview. In IEEE Data Eng. Bull., 2001.
[35] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest
neighbor search. InSIGMOD, 2009.
[36] X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections
to support approximate queries efficiently. InSIGMOD, 2008.
[37] B. Yao, F. Li, M. Hadjieleftheriou, and K. Hou. Approximate string search in spatial databases.
In ICDE, 2010.
[38] B. Yao, F. Li, and P. Kumar. K nearest neighbor queries and knn-joins in large relational
databases (almost) for free. InICDE, 2010.
[39] D. Zhang, Y. M. Chee, A. Mondal, A. K. H. Tung, and M. Kitsuregawa. Keyword search in
spatial databases: Towards searching by document. InICDE, 2009.
[40] D. Zhang, B. C. Ooi, and A. K. H. Tung. Locating mapped resources in web 2.0. InICDE,
2010.
57
[41] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index
structure for string similarity search based on edit distance. InSIGMOD, 2010.
58