survey on spatial keyword search - sjtuyaobin/papers/skssv.pdfsurvey on spatial keyword search bin...

Survey on Spatial Keyword Search

Bin Yao

[email protected]

Department of Computer Science

Florida State University

Abstract

Many applications are featured with both text and location information, which leads to a novel

type of search: spatial keyword search (SKS). The SKS is gaining attention from the database com-

munity only recently. A large variety of problems remain open. Spatial and text data have been

independently studied for decades. Spatial data have many unique features that are drastically dif-

ferent from text data. As a result, most of the existing techniques for string processing are either

inapplicable or inefficient when adapted to spatial databases. My Ph.D research is in the general

area of SKS. The areas related to my research include the nearest neighbor search, range search, ap-

proximate string search, the spatial keyword and approximate string search. This document surveys

these related areas.

1

TABLE OF CONTENTS

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 4

1 Introduction 7

2 Range and nearest neighbor queries 10

2.1 Range search in Euclidean space . . . . . . . . . . . . . . . . . . . . .. . . . . . 10

2.2 NN search in Euclidean space . . . . . . . . . . . . . . . . . . . . . . . .. . . . 11

2.2.1 Exact NN search for low dimensional data . . . . . . . . . . . .. . . . . . 11

2.2.2 Exact NN search for high dimensional data . . . . . . . . . . .. . . . . . 12

2.2.3 Approximate NN search for low dimensional data . . . . . .. . . . . . . . 13

2.2.4 Approximate NN search for high dimensional data . . . . .. . . . . . . . 14

2.2.5 KNN search in relational databases . . . . . . . . . . . . . . . .. . . . . 15

2.3 Range and NN queries on road networks . . . . . . . . . . . . . . . . .. . . . . . 18

2.3.1 Range search on road networks . . . . . . . . . . . . . . . . . . . . .. . 18

2.3.2 KNN search on road networks . . . . . . . . . . . . . . . . . . . . . . .. 19

3 Keyword and approximate string queries 23

3.1 Keyword search in text databases . . . . . . . . . . . . . . . . . . . .. . . . . . . 23

3.1.1 Tree index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2

3.1.2 Signature file . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Approximate string search in text databases . . . . . . . . . .. . . . . . . . . . . 25

3.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

3.2.2 Merging-list algorithms for approximate string search . . . . . . . . . . . 27

3.2.3 Variable-length-gram algorithms for approximate string search . . . . . . . 31

3.2.4 Bed-tree: index for string similarity search based on the edit distance . . . . 36

4 Spatial keyword and approximate string queries 39

4.1 Keyword search on spatial databases . . . . . . . . . . . . . . . . .. . . . . . . . 39

4.1.1 Distance-first top-k spatial keyword query . . . . . . . . .. . . . . . . . . 39

4.1.2 The m-closest keywords query on spatial databases . . .. . . . . . . . . . 41

4.2 Approximate string search on spatial databases . . . . . . .. . . . . . . . . . . . 47

4.2.1 The min-wise signature . . . . . . . . . . . . . . . . . . . . . . . . . .. . 49

4.2.2 The construction of the MHR-tree . . . . . . . . . . . . . . . . . .. . . . 49

4.2.3 The query algorithm for the MHR-tree . . . . . . . . . . . . . . .. . . . . 51

5 Conclusion 53

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 54

3

LIST OF FIGURES

2.1 The R-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 11

2.2 The iDistance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 13

2.3 The interpretation of LSH. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 14

2.4 The zχ-kNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.5 kNN box: definition and exact search. . . . . . . . . . . . . . . . . . . . . .. . . . 16

2.6 The z-kNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.7 A map of Silver Spring, MD (From figure 1 in [30]). . . . . . . . .. . . . . . . . . 20

2.8 (a) An example of road network, (b) the shortest-path quadtree of vertex s, and (c)

the shortest-path quadtree of vertex t (From figure 3 in [30]). . . . . . . . . . . . . . 21

2.9 The 30,000 shortest paths between all pairs of vertices in sets A and B in the spatial

network of Silver Spring, MD (From figure 1 in [32]). . . . . . . . .. . . . . . . . 22

3.1 An example of trie with inverted lists. . . . . . . . . . . . . . . .. . . . . . . . . . 24

3.2 The example of bit string computation withl = 4, m = 2. . . . . . . . . . . . . . . 24

4

3.3 The MergeOpt Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . 28

3.4 The MergeSkip Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 30

3.5 The DivideSkip Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 30

3.6 Strings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32

3.7 Gram dictionary as a trie. . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 32

3.8 Reversed-gram trie. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 32

3.9 NAG vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 32

3.10 The VGEN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . 33

3.11 Pruning a subtrie to select grams . . . . . . . . . . . . . . . . . . .. . . . . . . . . 35

3.12 The VerifyED function . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 37

3.13 The RangeQuery Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 37

4.1 The example of an IR2-tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 An example of the node information of the bR∗-tree. . . . . . . . . . . . . . . . . . 42

4.3 The SubsetSearch funtion . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 43

4.4 The SearchInOneNode funtion . . . . . . . . . . . . . . . . . . . . . . .. . . . . . 43

4.5 An example of the index of R∗-tree and inverted lists (from Figure 3 in [40]). . . . . 45

4.6 An example of bottom-up construction of virtual bR∗-tree (from Figure 4 in [40]). . 46

5

4.7 The bottom-up search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . 47

4.8 Approximate string search with a range query. . . . . . . . . .. . . . . . . . . . . . 48

4.9 The construction of the MHR-tree. . . . . . . . . . . . . . . . . . . .. . . . . . . . 50

4.10 The range queries with the MHR-tree. . . . . . . . . . . . . . . . .. . . . . . . . . 52

6

CHAPTER 1

INTRODUCTION

With the popularity of geographic services such as GPS, Google Earth and Yahoo Maps, queries in

spatial databases have become increasingly important in recent years. Beyond spatial queries such

as nearest neighbor queries, range queries and spatial joins, queries on spatial objects associated

with textual information are beginning to receive significant attention from the spatial database

research community.

A spatial database manages geometric objects such as points, rectangles, and so on. In reality,

a spatial object often comes with a text description: for example, the list of services and amenities

of a hotel, the menu of a restaurant, the outpatient specialties of a hospital, and so on. In many

applications, users need to search with both spatial and textual predicates. The following are some

examples. Query 1: find the hospital that is nearest to my house in all hospitals that offer acupunc-

tural outpatient service. Query 2: find all pairs of hotels and restaurants such that (i) they are within

2 miles from each other, (ii) the hotel has wireless connection and a gym room, and (iii) the restau-

rant serves Italian food and French wine. Query 3: similar tothe previous query, but find the closest

pair of hotel and restaurant (as opposed to all pairs in 2 miles). The results of all these queries must

satisfy certain spatial predicates (i.e., nearest neighbor search, distance join, closest pair in Queries

7

1, 2, 3, respectively), and must contain certain keywords intheir descriptions.

From the other side, keyword search, particularly on the Internet has exponentially boomed

since 1990s. Many of these queries allow the user to provide alist of keywords that the spatial ob-

jects should contain, in their description or other attribute. For example, online yellow pages allow

users to specify a set of keywords including an address, and return businesses whose description

contains these keywords, ordered by their distance to the specified address location. As another

example, real estate web sites allow users to search for properties with specific keywords in their

description within a specified city. In the past, the addressinformation is stored as text as well,

which makes the lookup inefficient or inapplicable sometimes. For example, given a specified loca-

tion, we can not retrieve all the objects within 1 miles from the location with the help of text-format

address information. In a natural way, we store the latitudeand longitude of location in a non-text

format to facilitate the search.

Associating objects with texts is a highly flexible approachto capture objects’ information,

especially when it is difficult to ”normalize” such information in a universal relational schema.

Therefore, the ability of searching the objects by their texts is essential. Keyword search has been

extensively studied in document retrieval, web search, andin recent years, relational databases.

However, spatial data have many unique features that are drastically different from plain documents,

web pages, and relational data. As a result, most of the existing techniques are either inapplicable

or inefficient when adapted to spatial databases. For example, current systems use ad-hoc combi-

nations of range queries and keyword search techniques to tackle the problem. An R-Tree is used

to find the objects within query region and for each retrievedobject an inverted index is used to

check if the query keywords are contained. Such two-step approaches could suffer from unneces-

sary node visits (higher IO cost) and keyword comparisons (higher CPU cost). To understand this,

8

we denote the exact solution to a spatial keyword query asA and the set of candidate points that

have been visited by the R-tree solution asAc. An intuitive observation is that it may be the case

that |Ac| ≫ |A|, where| · | denotes set cardinality. In an extreme example, considering a spatial

keyword query with some query strings that do not exist within its query range from setP , A = ∅.

So ideally, this query should incur a minimum query cost. However, in the worst case, an R-tree

solution could possibly visit all index nodes and data points from the R-tree that indexesP . The

fundamental issue here is that the possible pruning power from the string match predicate has been

completely ignored by the R-tree solution. Similar arguments hold for the string index approach,

where a query might retrieve a very large number of strings only to prune everything based on the

spatial predicate at post-processing. Clearly, in practice none of these solutions will work better

that a combined approach that prunes simultaneously based on the string match predicate and the

spatial predicate. The spatial keyword search is gaining attention from the database community

only recently. A large variety of problems remain open. Besides, the existing approaches for spatial

keyword search have drawbacks as well and need to be further investigated.

My research is in the area of the spatial keyword search. The areas related to my research in-

clude nearest neighbor search, range search, approximate string search, the spatial keyword and

approximate string search. I will survey these areas in thisdocument. In particular, the paper is or-

ganized as follows. Chapter 2 discusses the nearest neighbor and range queries in spatial databases.

Chapter 3 describes the keyword search and approximate string search in text databases. Chapter 4

introduces the spatial keyword and approximate string queries. Chapter 5 concludes the survey.

9

CHAPTER 2

RANGE AND NEAREST NEIGHBOR QUERIES

2.1 Range search in Euclidean space

Range search and nearest neighbor (NN) search are fundamental problems in data management,

which have been well-studied in the literature [19, 27, 35, 38]. I will review them in this chap-

ter. In the context of spatial databases, the R-tree index [5, 18] provides an efficient algorithm for

range queries in Euclidean space. Intuitively, the R-tree is an extension of theB+-tree to higher

dimensions. Points are grouped into minimum bounding rectangles (MBRs) which are recursively

grouped into MBRs in higher levels of the tree. The grouping is based on data locality and bounded

by the page size. An example of the R-tree is illustrated in Figure 2.1. Given a query rectangle r

and an R-tree index on data set P, we start from the root of the R-tree and check the MBR of each

of its children, then recursively visit any node u whose MBR intersects or falls inside r. When a leaf

node is reached, all the points that are inside r are returned.

10

p10

N6

p4

p10

q

p6

minmaxdist(q,N1)mindist(q,N1)p3

maxdist(q,N1)

N2N5

N6

p1 p2

p8 p9

p7

p11

p12

N1 N2

N3 N4

The R-tree

p1 p2 p3

N3N1

N4p4

p5

p5 p6

p7 p9p8

p12p11

N5

Figure 2.1: The R-tree.

2.2 NN search in Euclidean space

The NN search has much more variants in the literature compared to the range search. In this

section, I will review the NN search in Euclidean space for both exact and approximate versions.

2.2.1 Exact NN search for low dimensional data

The R-tree index also provides efficient algorithms for NN search in Euclidean space. The R-

tree demonstrates efficient algorithms using either the depth-first [27] or the best-first [19] approach.

The main idea behind these algorithms is to utilize branch and bound pruning techniques based on

the relative distances between a query pointq to a given MBRN . Such distances include the

mindist, the minmaxdist and the maxdist. The mindist measures the minimum possible distance for

a pointq to any point in an MBRN ; the minmaxdist measures the lower bound on the maximum

distance for a pointq to any point in an MBRN ; and finally, the maxdist simply measures the

maximum possible distance betweenq and any point in an MBRN . These distances are easy

to compute arithmetically givenq andN . An example for these distances has been provided in

Figure 2.1. The principle for utilizing them to prune the search space for NN search in R-tree is

straightforward, e.g., when an MBRNa’s mindist toq is larger than another MBRNb’s minmaxdist,

11

we can safely pruneNa entirely. Given theq, we first initialize a priority queuePQ with R-Tree’s

root node. The elements inPQ are in an ascending order by their mindists. We also maintaina

global minmaxdist (GMM ), which indicates the current lower bound on the maximum distance

within which we can find the NN. Notice that every node withmindist > GMM will not be

added toPQ. In each iteration, we pop the first element inPQ. If it is a point, then we find the

NN. Otherwise, we look into its children. For each childc, we calculate its mindist (c.md) and

minmaxdist (c.mmd), checkc.md with GMM to see ifc can be added toPQ and updateGMM

with min{c.mmd,GMM}. The NN search is a special case of K nearest neighbors (KNN) query

when k equals 1. The above algorithm can be easily extended toanswer KNN. The GMM in KNN

search indicates the current lower bound on the maximum distance within which we can find the

KNN. The algorithm terminates when k points are found.

2.2.2 Exact NN search for high dimensional data

The R-tree based algorithms have relatively poor performance for data beyond six dimen-

sions [38]. The state of the art technique for retrieving theexact KNN in high dimensions is the

iDistance method [20]. The design of iDistance is motivatedby the following observations. First,

the similarity between data points can be derived with reference to a representative point. Second,

data points can be ordered based on their distances to a reference point, which are a single dimen-

sional value. So we can map high dimensional data into one dimensional space and enable a range

query by reusing the existing one dimensional indices such as theB+-tree. In particular, data is

first partitioned into clusters by any popular clustering method. In [20], authors suggest that a good

partitioning strategy can reduce the partitioning overlapand large number of points with close sim-

ilarity as much as possible. A point p in a clusterci is mapped to a one-dimensional value, which is

12

· · ·Leaf nodes of B+-tree

qrq

c1

c2c3

Figure 2.2: The iDistance.

the distance between p andci’s cluster center. The essence of the KNN search algorithm issimilar

to the generalized search strategy. It begins by searching asmall ’sphere’, check all partitions inter-

secting the current query space and incrementally enlargesthe search space till all KNN are found.

Figure 2.2 illustrates the pruning technique used by iDistance. Therq is the radius of query ball

and the circlec3 defines the data cluster. We only have to exam the points in thecircularity region

betweenc2 andc1, which corresponds to the gray region in theB+-tree.

2.2.3 Approximate NN search for low dimensional data

Besides exact KNN search, people also study the approximateKNN. In exact KNN search, a

majority of the query cost is spent on verifying points as real KNN. Approximate methods improve

efficiency by relaxing the precision of verification.

Computational geometry community has spent considerable efforts in designing approximate

nearest neighbor algorithms [3, 7]. Arya et al. [3] designeda modification of the standard kd-

tree, called the Balanced Box Decomposition (BBD) tree thatcan answer(1 + ǫ)-approximate

13

w

w

w

o1

o2o3

−→a

p1

p2

p3

Figure 2.3: The interpretation of LSH.

nearest neighbor queries inO(1/ǫd log N). BBD-tree takesO(N log N) time to build. However,

the performance of BBD-tree degrades when the size of data set becomes huge (e.g. more than

100, 000) or dimensions go beyond20.

2.2.4 Approximate NN search for high dimensional data

The state-of-the-art method for approximate KNN retrievalin high dimension space (e.g.d >

20) is the LSB-tree [35]. The LSB-tree is based on the locality sensitive hash function, which maps

a d-dimensional pointo to a one-dimensional value. And the locality sensitive means that if the

chance of mapping two pointso1, o2 to the same value grows as their distance|o1, o2| decreases. A

popular LSH function is defined as follows [13]:

h(o) = ⌊−→a · −→o + b

w⌋

Here,−→o represents the d-dimensional vector representation of a point o;−→a is another d-dimensional

vector where each component is drawn independently from a so-called p-stable distribution [13];

14

−→a · −→o denotes the dot product of these two vectors.w is a sufficiently large constant and b is

uniformly drawn from[0, w).

The intuition of such a hash function is that, if two points are close to each other, their shifted

projections (on line−→a ) will have the same hash value with high probability. On the other hand, two

faraway points are likely to be projected into different values. In figure 2.3, pointso1, o2 ando3 are

projected to the vector−→a (in this case,b = 0). The projections ofo1 ando2 end up in the same

interval while the projection ofo3 is in the different interval as it is far away from other two points.

The construction of an LSB-tree is as follows. Given a d dimensional data set S, we first map

each point o in S to an m dimensional point M(o), and then, calculate the Z-order value z(o) of M(o),

which is a one dimensional value. The LSB-tree is just a conventional B+-tree indexing all the Z-

order values. In reality, a single LSB-tree can produce query results with good quality. To achieve

better quality with theoretical guarantee, authors in [35]propose to build multiple LSB-trees. To

retrieve the KNN, one is needed to do range queries on all the LSB-trees in terms of Z-order values

of the query point.

The construction of the LSB-trees consumesO((dn/B)1.5) space. And the LSB-trees based

algorithm returns a4-approximate NN with at least constant probability, inO(E√

dn/B) IOs,

whereE is the height of an LSB-tree.

2.2.5 KNN search in relational databases

We made a contribution to the KNN problem by proposing algorithms for both approximate and

exact KNN search in relational databases [38]. The motivation of our work is in two folders. First,

we want to enable the relational database engine to automatically optimize the KNN search when

additional query conditions are specified. Second, the algorithms need no changes to the database

15

zχ-kNN (point q, point sets{P 0, . . . , Pα})

1. CandidatesC = ∅;

2. Fori = 0, . . . , α

3. Findzip as the successor ofzq+−→vi

in P i;

4. LetCi beγ points up and down next tozip in P i;

5. For each pointp in Ci, let p = p −−→vi ;

6. C = C⋃

Ci;

7. LetAχ = kNN(q, C) and outputAχ.

Figure 2.4: The zχ-kNN Algorithm

p4

zp1

γ

zp4zℓ zp3

zhzp2zp5

γz-val

p5

zq

r∗

rp

δh

δℓ

qp1

p2

p3

Figure 2.5:kNN box: definition

and exact search.

engine.

For approximate KNN retrieval, we utilize thez-values to map points in a multi-dimensional

space into one dimension, and then translate the KNN search for a query pointq into one dimen-

sional range search on thez-values around theq’s z-value. In most cases, z-values preserve the

spatial locality and we can findq’s KNN in a close neighborhood (sayγ positions up and down) of

its z-value. However, this is not always the case. In order toget a theoretical guarantee, we produce

α, independent, randomly shifted copies of the input data setP and repeat the above procedure for

each randomly shifted version ofP . Specifically, we define the “random shift” operation, as shifting

all data points inP by a random vector−→v ∈ Rd. This operation is simplyp + −→v for all p ∈ P ,

and denoted asP + −→v . We independently at random generateα number of vectors{−→v1 , . . . ,−→vα}

where∀i ∈ [1, α], −→vi ∈ Rd. Let P i = P + −→vi , P 0 = P and−→v0 =−→0 . For eachP i, its points are

16

z-kNN (point q, point sets{P 0, . . . , Pα})

1. LetAχi bekNN(q, Ci) whereCi is from Line4 in the zχ-kNN algorithm;

2. Letziℓ andzi

h be the z-values of the lower-left and upper-right corner points for the boxM(Aχi );

3. Letziγℓ

andziγh

be the lower bound and upper bound of the z-values zχ-kNN has searched to produceCi;

4. If ∃i ∈ [0, α], s.t.ziγℓ

≤ ziℓ andzi

γh≥ zi

h

5. ReturnAχ by zχ-kNN asA;

6. Else

7. Findj ∈ [0, α] such that the number of points inRjP with z-values in∈ [zj

ℓ , zjh] is minimized;

8. LetCe be those points and returnA = kNN(q, Ce);

Figure 2.6: The z-kNN Algorithm

sorted by theirz-values. Next, for a query pointq and a data setP , let zp be the successorz-value

of zq among all z-values for points inP . Theγ-neighborhood ofq is defined as theγ points up

and down next tozp. OurkNN query algorithm essentially finds theγ-neighborhood of the query

point qi = q + −→vi in P i for i ∈ [0, α] and select the final topk from the points in the unioned

(α + 1) γ-neighborhoods, with a maximum(α + 1)(2γ + 1) number of distinct points. We denote

this algorithm as the zχ-kNN algorithm (Figure 2.4).

Once we retrieve the approximate KNN (Aχ), one can retrieve the exact results based on the

Aχ. Figure 2.5 demonstrates our idea. In Figure 2.5,k = 3, the dotted circle is thekth NN ball, and

the solid circle is the approximatekth NN ball (denoted asB(Aχ)). The KNN box forAχ is the

smallest box that fully enclosesB(Aχ), denoted asM(Aχ), which is defined by the lower-left and

17

right-upper corner pointsδℓ andδh, i.e., the∆ points in Figure 2.5. Letzℓ andzh be thez-values of

δℓ andδh points ofM(Aχ). By the Corollary in [38], we guarantee that for allp ∈ A, zp ∈ [zℓ, zh].

[zℓ, zh] can be easily calculated once we retrieve the approximatekth NN. After that, we need check

if [zℓ, zh] is fully contained by[zγl, zγh

]. If yes, thezχ-kNN already retrieves all the exact answers,

i.e.,Aχ = A. If this is not the case, we can do a range search on any tableRi with [ziℓ, z

ih]. Ideally,

we choose the table with smallest[ziℓ, z

ih]. The algorithm is shown in Figure 2.6.

2.3 Range and NN queries on road networks

The growing popularity of online mapping services such as Google Maps and Microsoft Map-

Point has led to an interest in responding to queries such as finding nearest objects from a set (e.g.,

gas stations and restaurants) where the distance is measured in terms of paths along the road net-

work (network distance). Objects are usually constrained to lie on the road network. The access

method to the road network is significantly different from that in the Euclidean space. The NN

search on road networks plays an important role in many applications. In this section, I will go over

the state-of-the-art range and NN search techniques on roadnetwork through the literature.

2.3.1 Range search on road networks

Given a query pointq and a distance thresholdθ, the θ-range search on the road networks

retrieves all vertices on the road networkG that are withinθ network distance toq. The solution

is to adapt the Dijkstra’s algorithm. We can initialize the Dijkstra’s algorithm withq, visit all the

vertices in an ascending order in terms of network distance and stop when the distance of current

visiting vertex is greater thanθ. In practice, lots of applications may require to return a certain

mount of results in the order. In those scenarios, KNN searchare required. The KNN search on

18

road networks is more complicated than theθ-range search. I will introduce it in details in the

following.

2.3.2 KNN search on road networks

The KNN retrieval on road networks has been studied for decades. Most of them are graph-

based, which usually incorporate Dijkstra’s algorithm [14] in at least some parts of the solution

[11, 26]. Recently, a novel algorithm is presented for finding KNN on road networks in [30]. The

algorithm is based on precomputing the shortest paths between all possible vertices in the network

and then making use of an encoding that takes advantage of thefact that the shortest paths from

vertex u to all of the remaining vertices can be decomposed into subsets based on the first edges

on the shortest paths to them from u. Thus, in the worst case, the amount of work depends on the

number of objects that are examined and the number of links onthe shortest paths to them from q,

rather than depending on the number of vertices in the network. The amount of storage required to

keep track of the subsets is reduced by taking advantage of their spatial coherence which is captured

by the aid of a shortest path quadtree. The theoretical analysis has shown that the storage has been

reduced fromO(N3) to O(N1.5).

The problem with an approach that uses Dijkstra’s algorithmis that it must visit a very large

number of the vertices of the network for finding the shortestpath between vertices. For example,

Figure 2.7(a) shows the vertices that would be visited when finding the shortest path from the ver-

tex X to the vertex V in a spatial network corresponding to Silver Spring, MD. In the process of

obtaining the shortest path from X to V of length75 edges,75.4% of the vertices in the network are

visited. In [30], author proposedshortest-path mapfor each vertex in the network. In particular,

given vertexui, the shortest-path mapmuipartitions the underlying space intoMui

regions, where

19

(a) (b)

Figure 2.7: A map of Silver Spring, MD (From figure 1 in [30]).

Muiis the out degree ofui and there is one regionruij for each vertexwuij (1 ≤ j ≤ Mui

) that

is connected toui by an edgeeuij. Regionruij spans the space occupied by all vertices v such that

the shortest path fromui to v contains edgeeuij. Regionruij is bounded by a subset of the edges

of the shortest paths fromui to the vertices within it. For example, Figure 2.7(b) is sucha partition

for the vertex X in the road network of Figure 2.7(a) where we use different colors to denote the

different regions.

Sankaranarayanan et al. [31] propose to index the above partitions by using a variant of the

region quadtree [29]. Each quadtree block records the identity of the region which it belongs to.

Once we locate the block containing the destination vertex,we know what region it is in and hence

the adjacent edge from source vertex which connect the next vertex on the shortest path between

source and destination. Then we move to the quadtree of the next vertex. By repeating this process,

we can obtain the whole path. For example, consider the road network in Figure 2.8(a) where we

want to find the shortest path from vertex s to vertex d, and theshortest-path quadtree for s is given

20

(a) (b) (c)

Figure 2.8: (a) An example of road network, (b) the shortest-path quadtree of vertex s,

and (c) the shortest-path quadtree of vertex t (From figure 3 in [30]).

by Figure 2.8(b). Looking up vertex d in the shortest-path quadtree of s determines that d is in

the region of the quadtree corresponding to the edge from vertex s to t. Therefore, the shortest-path

from s to d passes through t. Next, we obtain the shortest-path quadtree of t which is given by Figure

2.8(c). Looking up vertex d in the shortest-path quadtree oft determines that d is in the region of the

quadtree corresponding to the edge from vertex t to u. This process is continued until encountering

an edge to vertex d.

Based on the above representation of road networks, authorsin [30] propose KNEARESTSPA-

TIALNETWORK for KNN retrieval. The actual mechanics of the algorithm are similar to the

best-first algorithm [31] with the difference that objects are associated with network distance in-

tervals instead of distances. When a nonleaf block b is removed from the queue, the children of

b are inserted into the queue with their corresponding minimum network distances. When a leaf

block b is removed from the queue, the objects in b are enqueued with their corresponding initial

network distance intervals, which can be refined to find the exact network distance later. The whole

algorithm terminates when find the first k objects.

21

Figure 2.9: The 30,000 shortest paths between all pairs of vertices in sets A and B in the

spatial network of Silver Spring, MD (From figure 1 in [32]).

Later on, Sankaranarayanan et al. in [32] propose an approximate distance oracle of sizeO( nǫd )

to obtain theǫ-approximate network distance. The observation is that given a set of source vertices

A and a set of destination vertices B such that A and B are sufficiently far away from each other, the

shortest paths between them may share common vertices. Figure 2.9 illustrates such a case where

all the30, 000 shortest paths between vertices in A and in B have many vertices in common, while

the network distances between them can be partly approximated by a single value.

22

CHAPTER 3

KEYWORD AND APPROXIMATE STRING QUERIES

3.1 Keyword search in text databases

Keyword search has been extensively studied in document retrieval, web and relational databases.

Given a collection of files which store keywords and a list of query keywordsw1, · · · , wm, we want

to retrieve all the files that contains all the query keywords. The methods handling such a search fall

into two categories. One is based on tree index and the other uses the signature. I will explore both

of them in the following.

3.1.1 Tree index

A prefix tree (a.k.a trie) can be used to answer keyword searchefficiently. Figure 3.1 illustrates

an example of a trie. Each wordw corresponds to a unique path from the root of the trie to a leaf

node. Each leaf node has an inverted list of descriptors of files that contain the corresponding word.

For query keywordsw1, · · · , wm, we search on the trie, retrieve all the lists correspondingto the

query keywords and merge them to find the result.

23

d

a

t

e

1,4,7,9

s

k

y

t

u

n c

k

2,4,5,9

2 6,8,9

Figure 3.1: An example of trie with inverted lists.

word hashed bit string

a 0101

b 1001

c 0011

d 0110

Figure 3.2: The example of

bit string computation with

l = 4, m = 2.

3.1.2 Signature file

Signature files have been successfully used in keyword search as it has not only small space

consumption but also fast string matching. A signature file is designed to perform membership

tests: determine whether a query wordw exists in a set of wordsW . If the test returns ’no’, thenw

is definitely not inW . On the other hand, if it returns ’yes’, the true answer can beeither way, in

which case the wholeW must be scanned to avoid a false hit. There are several implementations of

a signature file, and the one adopted in [16] is known as superimposed coding (SC), which has been

shown to be more effective than other variants [15]. Specifically, SC works as follows. It builds a

bit signature of lengthl from W by hashing each word inW to a string ofl bits, and then taking the

disjunction of all bit strings. The bit string of a wordw is denoted ash(w). Initially, all the l bits are

initialized to0. Then, SC repeats the followingm times: randomly choose a bit and set it to1. It’s

important that randomization must usew as its seed to ensure that the samew always ends up with

an identicalh(w). Furthermore, them choices are mutually independent, and may even happen to

24

be the same bit. The values ofl andm affect the space cost and false hit probability. Figure 3.2

gives an example for this process. For example, in the bit string h(b) of b, the first and last bits are

set to1. The bit signature of a set of words simply ORs the bit stringsof all the words in the set.

For instance, the signature of set{a, b} is 1101.

Given a query keywordw, SC performs the membership test by checking if all the1’s of h(w)

appear at the same positions in the signature ofW . If not, it is guaranteed thatw cannot belong to

W . Otherwise, it need to scanW . For example, we want to test whetherc with signature0011 is a

member of set{a, b} with signature1101. Since there are three positions where digits are different

between0011 and1101, SC immediately reports ’no’.

3.2 Approximate string search in text databases

The approximate string search can be described as follows: given a set of strings, we want to

efficiently find strings that are similar to a query string. Ithas plenty of applications such as spell

checking, data cleaning and so on. Before explore the algorithms for the approximate string search,

I will introduce some preliminaries related to the approximate string processing.

3.2.1 Preliminaries

Different string-similarity functions. A basic concept is how we define the similarity be-

tween two strings. Different similarity functions have been studied in the literature, such as edit

distance [23], cosine similarity [4] and Jaccard coefficient [33]. Take edit distance as an example.

Let S be a set of strings. The edit distance between two stringss1 ands2 is the minimum number

of edit operations of single characters that are needed to transforms1 to s2. Edit operations include

insertion, deletion, and substitution. We denote the edit distance between two stringss1 ands2 as

25

ed(s1, s2). For instance, ed(”toy”, ”boy”) =1. Using this function, an approximate string search

with a query string q and thresholdτ is finding all s∈ S such that ed(s, q)≤ τ .

Q-Grams based Inverted lists. Let Σ be an alphabet. For a string s of the characters inΣ, we

use|s| to denote the length of s. We introduce two characters# and$ not inΣ. Given a string s and

a positive integer q, we extend s to a new string s’ by prefixingq−1 copies of# and suffixing q−1

copies of$. A positional q-gram of s is a pair (i, g), where g is the substring of length q starting at

the i-th character of s’. The set of positional q-grams of s, denoted by G(s), is obtained by sliding a

window of length q over the characters of s’. For example, suppose q= 2, and s= theater. G(s, q)

is: (1, #t), (2, th), (3, he), (4, ea), (5, at), (6, te), (7, er), (8, r$). The number of positional q-grams of

the string s is|s| + q − 1. We construct inverted lists of q-grams as follows. For eachgram g of the

strings in S, we have a listlg of the ids of the strings that include this gram.

T-occurrence problem. It is observed in [33] that an approximate query with a strings can

be answered by solving the following problem:

T-occurrence Problem: Find the string ids that appear at least T times on the inverted lists of the

grams in G(s, q), where T is a constant related to the threshold τ in the query, the gram length q and

the similarity function. Assume that the similarity function is edit distance. For a string r∈ S that

satisfies the condition ed(r, s)≤ τ , it should share at least the following number of q-grams with s:

T = max{|s|, |r|} + q − 1 − τ ∗ q (3.1)

Many algorithms [24, 33] are proposed for answering approximate string queries based on the T-

occurrence problem. We will elaborate them in Section 3.2.2.

String filters. Various filters have been proposed in the literature to eliminate strings that can-

not be similar enough to a query string. In this section, we will introduce three filtering techniques:

26

length filter, position filter [17] and prefix filter [9].

Length Filtering: If two stringss1 ands2 are within edit distanceτ , the difference between their

lengths cannot exceedτ . Therefore, given a strings1, we only need to exam strings2 ∈ S such that

the difference between|s1| and|s2| is no greater thanτ .

Position Filtering: If two stringss1 ands2 are within edit distanceτ , then a q-gram ins1 cannot

correspond to a q-gram ins2 that differs by more thanτ positions. Therefore, given a positional

gram(i1, g1) in the query string, we only need to consider the other corresponding gram(i2, g2) in

the string set, such that|i1 − i2| ≤ τ .

Prefix Filtering: Given two q-gram sets G(s1) and G(s2) for stringss1 ands2, we can fix an

orderingO of the universe from which all set elements are drawn. Let p(n, s) denotes the n-th prefix

element in G(s) as per the orderingO. For simplicity, p(1, s) is abbreviated asps. An important

property is that, if|G(s1)⋂

G(s2)| ≥ T , thenps2≤ p(n, s1), wheren = |s1| − T + 1.

3.2.2 Merging-list algorithms for approximate string search

Several existing algorithms [24,33] assume an index of inverted lists for the grams of the strings

in the data setS to answer approximate string queries. Most of them focus on reducing the running

time to merge the string-id (SID) lists of the grams of the query string. We will briefly describe

them in the following sections. Notice that an optimizationis to sort the string ids on each inverted

list in an ascending order.

Heap algorithm. When merging the lists, we just need to maintain a frontier ofthe lists and

at each step advance the frontier to the next SID which appears in lists whose total number adds up

to more than T. Such an SID can be found efficiently by using a heap to maintain the frontiers of

all lists being merged. We then repeatedly pop the minimum SID from the heap, count its number

27

MergeOpt(Q, T, I)

1. LetA = l1, l2, · · · lN be the lists of index I in decreasing order of length

corresponding to the N grams of Q;

2. LetL = l1, l2, · · · lT−1 be firstT − 1 longest lists of A;

Insert frontiers of listsS = A − L in a heap H;

3. while H not empty do

4. pop from H current minimum records m, count them ascm,

and push in H next records from lists in S that popped.

5. for i = T − 1 down to1 do

6. search for m inli using a binary search method, and if foundcm++;

7. if (cm ≥ T ) output m;

Figure 3.3: The MergeOpt Algorithm

if successive popped SIDs are the same, and push in the heap the next SID from the frontier of the

popped list. Let N =|G(Q, q)| denotes the number of lists corresponding to the grams from the

query string, and M denote the total size of these N lists. This algorithm requiresO(MlogN) time

andO(N) storage space (not including the size of the inverted lists)for storing the heap.

MergeOpt algorithm. In [33], authors propose a MergeOpt algorithm to improve theeffi-

ciency of the Heap algorithm. It treats theT − 1 longest inverted lists of G(Q, q) separately. For the

remainingN − (T −1) relatively short inverted lists, it uses the Heap algorithmto merge them with

28

a lower threshold, i.e.,1. For each candidate string, it does a binary search on each oftheT −1 long

lists to verify if the string appears on at least T times amongall the lists. This algorithm is based on

the observation that a record in the answer must appear on at least one of the short lists. And doing

a binary search on long lists is much more efficient than scanning them. The algorithm is formally

descried in Figure 3.3.

ScanCount algorithm. In [24], authors propose ScanCount algorithm, which improves the

Heap algorithm by eliminating the heap data structure and the corresponding operations on the

heap. Instead, it just maintains an array of counters for allthe SIDs inS. It scans the N inverted

lists one by one. For each SID on each list, it increments the counter corresponding to the string by

1. It reports the SIDs that appear at least T times on the lists. The time complexity of this algorithm

is O(M). The space complexity isO(|S|), where|S| is the size of the string set, since it needs to

keep a counter for each SID. Authors argue that this higher space complexity is not a major concern,

since this extra space tends to be much smaller than that of the inverted lists. Despite its simplicity,

this algorithm can achieve a good performance when combinedwith various filtering techniques in

Section 3.2.1.

MergeSkip algorithm. To further improve the running time for merging list, authors in [24]

propose MergeSkip algorithm (Figure 3.4). The intuition isto skip on the lists those SIDs that

cannot be in the answer to the query, by utilizing the threshold T. Similar to Heap algorithm, it also

maintains a heap for the frontiers of these lists. A key difference is that, during each iteration, it

pops those records from the heap that have the same value as the top record t on the heap. Let the

number of popped records be n. If there are at least T such records, it adds t to the result set (line 7

in Figure 3.4), and add their next records on the lists to the heap. Otherwise, record t cannot be in

29

MergeSkip(I, T)

1. Insert the frontier SIDs of the lists to a heap H;

2. Initialize a result set R to be empty;

3. while H not empty do

4. Let t be the top record on the heap;

5. Pop from H those records equal to t;

6. Let n be the number of popped records;

7. if ( n ≥ T ) {Add t to R;

8. Push next record on each popped list to H;}

9. else{PopT − 1 − n smallest records from H;

10. Let t’ be the current top record on H;

11. for each of theT − 1 popped lists

12. Locate its smallest record r≥ t’;

13. Push this record to H;}

14.return R;

Figure 3.4: The MergeSkip Algorithm

DivideSkip( I, T )

1. Initialize a result set R to be empty;

2. LetLlong be the set of L longest lists

among the lists I;

3. LetLshort be the remaining short lists;

4. Use MergeSkip onLshort to find SIDs that appear

at leastT − L times;

5. for each record r found{

6. for each list inLlong

7. Check if r appears on this list;

8. if ( r appears≥ T times among all lists ){

9. Add r to R;

}

}

10.return R;

Figure 3.5: The DivideSkip Algorithm

the answer. In addition to popping these n records, it popsT − 1 − n additional records from the

heap (line 9). Therefore, it has poppedT − 1 records from the heap. Let t’ be the current top record

on the heap. For each of theT − 1 popped lists, it locates its smallest record r such that r≥ t’ by

using a binary search (line 12). It then pushes r to the heap (line 13).

30

DivideSkip algorithm. The DivideSkip algorithm [24] is a combination of MergeSkipand

MergeOpt. Figure 3.5 summarizes the algorithm. Given a set of SID lists, it first sorts these lists

based on their lengths. It divides the lists into two groups.It groups theL (a tunable parameter)

longest lists to a setLlong, and the remaining short lists as another setLshort. It uses the MergeSkip

algorithm onLshort to find records r that appear at leastT −L times on the short lists. For each such

record r and each listllong in Llong, it checks if r appears onllong. If the total number of occurrences

of r among all these lists is at least T , we add it to the result set R. The parameterL can greatly

affect the performance of the algorithm and is discussed in details in [24].

3.2.3 Variable-length-gram algorithms for approximate string search

The above algorithms are all based on the fixed-length grams.A dilemma for these algorithms

is how to select a good gram length. If we increase the gram length, there could be fewer strings

sharing a gram, causing the inverted lists to be shorter. Therefore, it may decrease the time to merge

the inverted lists. On the other hand, we will have a lower threshold on the number of common

grams shared by similar strings, causing a less selective count filter to eliminate dissimilar string

pairs. The number of false positives after merging the listswill increase, causing more time to

compute their real edit distances (a costly computation) inorder to verify if they are in the answer

to the query. Based on this dilemma, authors in [25] propose avariable-length index (VGRAM) to

improve the performance. In a high level, VGRAM works as follows: (1) It analyzes the frequencies

of variable-length grams in the strings, and select a set of grams, called gram dictionary, such that

each selected gram in the dictionary is not too frequent in the strings; (2) For a string, it generates a

set of grams of variable lengths using the gram dictionary; (3) It can be showed that if two strings

are within edit distanceτ , then their sets of vgrams also have enough similarity, which is related to

31

τ . This set similarity can be used to improve the performance of existing algorithms. Figure 3.6,

3.7, 3.8, 3.9 (from Figure 2 in [25]) show a VGRAM index for strings. In the following, I will

explore each component in this index.

id string

0 stick

1 stich

2 such

3 stuck

Figure 3.6: Strings.

n1

n2 n3 n4 n5 n6

n7 n8 n9 n10 n11 n12 n13

n14 n15 n16 n17 n18 n19 n20 n21

n22

ci

st

u

h k c t u u c

# # # i # # # #

#

Figure 3.7: Gram dictionary as a trie.

n1

n2 n3 n4 n5 n6 n7

n8 n9 n10 n11 n12 n13 n14 n15

n16 n17 n18 n19 n20 n21 n22

ch k t

u

i u c t c s t

# # # s ## # #

#

n23

n24

l

s

Figure 3.8: Reversed-gram trie.

id NAG vector

0 2,3

1 2,3

2 2,3

3 3,4

Figure 3.9: NAG vectors.

Variable-length grams. A gram dictionary is a set of gramsD of lengths betweenqmin and

qmax. A gram dictionaryD can be stored as a trie. The trie is a tree, and each edge is labeled with a

32

VGEN( D, string s,qmin, qmax )

1. positionp = 1; VG = empty set;

2. whilep ≤ |s| − qmin + 1

3. Find a longest gram inD using the trie to match a substringt of s starting at positionp;

4. if( t is not found )t = s[p, p + qmin − 1];

5. if( positional gram (p,t) is not subsumed by any positional gram in VG ) Insert (p, t) to VG;

6. p = p + 1;

7. return VG;

Figure 3.10: The VGEN Algorithm

character. To distinguish a gram from its extended grams, itpreprocesses the grams inD by adding

to the end of each gram a special endmarker symbol, e.g., #. A path from the root node to a leaf

node corresponds to a gram inD. For example, Figure 3.7 shows a trie for a gram dictionary ofthe

four strings in Figure 3.6, whereqmin = 2 andqmax = 3. The dictionary includes the following

grams: ch, ck, ic, sti, st, su, tu, uc. The pathn1 −n4 −n10 −n17 −n22 corresponds to the gram sti.

To generate variable-length grams, it still uses a window toslide over s, but the window size

varies, depending on the string s and the grams inD. Intuitively, it generates a gram for the longest

substring (starting from the current position) that matches a gram inD. If no such gram exists inD,

it will generate a gram of lengthqmin. In addition, for a positional gram (a,g) whose corresponding

substring s[a, b] has been subsumed by the substring s[a’, b’] of an earlier positional gram (a’, g’),

i.e., a′ ≤ a ≤ b ≤ b′, it ignores the positional gram (a, g). The algorithm (VGEN)is formally

33

described in Figure 3.10.

Constructing gram dictionary. The construction of gram dictionary is achieved by two steps.

In the first step, it analyzes the frequencies of q-grams for the strings, where q is withinqmin and

qmax. In the second step, it selects grams with a small frequency.

To collect the frequencies of grams, it uses a trie (called frequency trie) to collect gram fre-

quencies. The algorithm avoids generating all the grams forthe strings based on the following

observation. Given a string s, for each integer q in [qmin,qmax − 1], for each positional q-gram

(p, g), there is a positional gram (p,g’) for its extendedqmax-gram g’. Therefore, we can generate

qmax-grams for the strings to do the counting on the trie without generating the shorter grams.

To select high-quality grams, it judiciously prunes the frequency trie and uses the remaining

grams to form a gram dictionary. The intuition of the pruningprocess is the following: (1) Keep

short grams if possible. If a gram g has a low frequency, it eliminates from the trie all the extended

grams of g. (2) If a gram is very frequent, keep some of its extended grams. As a simple example,

consider a gram ab. If its frequency is low, then we will keep it in the gram dictionary. If its

frequency is very high, we will consider keeping this gram and its extended grams, such as aba,

abb, abc, etc. The goal is that, by keeping these extended grams in the dictionary, the number of

strings that generate an ab gram by the VGEN algorithm could become smaller, since they may

generate the extended grams instead of ab. The details of thealgorithm is illustrated in Figure 3.11.

Similarity of gram sets. Next, I will elaborate the relationship between the similarity of two

strings and the similarity of their gram sets generated using the same gram dictionary. For each

string s in the setS, we want to know how many grams in VG(s) can be affected byτ edit operations.

We precompute an upper bound of this number for each possibleτ value, and store the values (for

34

Prune( Node n, Threshold T )

1. if ( each child of n is not a leaf node ) // the root→n path is shorter thanqmin

2. for ( each child c of n ) call Prune(c; T); // recursive call

3. return;

4. L = the (only) leaf-node child of n; // a gram corresponds tothe leaf-node child of n

5. if( n.freq ≤ T )

6. keep L, and remove other children of n; L.freq = n.freq;

7. else

8. select a maximal subset of children of n (excluding L),

so that the summation of their freq values and L.freq is stillnot greater than T;

9. add the freq values of these children to that of L, and remove these children from n;

10. for ( each remaining child c of n excluding L ) call Prune(c, T); // recursive call

Figure 3.11: Pruning a subtrie to select grams

differentτ values) in a vector for s, called thevector of number of affected grams(NAG vector) of

string s. Theτ -th number in the vector is denoted by NAG(s,τ ). Such upper bounds can be used to

improve the performance of existing algorithms. Figure 3.9shows the NAG vectors for the strings.

From lemma 1 in [25], we have the following equation:

T = max{|V G(s1)| − NAG(s1, τ), |V G(s2)| − NAG(s2, τ)} (3.2)

To adopt VGRAM in the existing algorithms e.g., ProbeCluster [33], we just need to make the

35

same two minor modifications: (1) call VGEN to convert a string to a set of variable-length grams;

(2) use Equation 3.2 instead of Equation 3.1 as a set-similarity threshold for the sets of two similar

strings.

3.2.4 Bed-tree: index for string similarity search based on the edit distance

Zhang et al. in [41] propose a novel tree index (Bed-tree) for approximate string search recently.

Bed-tree is a B+-tree based index for evaluating all types of similarity queries (top-K, range, etc.)

on edit distance and normalized edit distance.

To index the strings with the B+-tree, it is necessary to construct a mapping from the string

domain to integer space. The insertion and deletion operations on the B+-tree rely only on the

comparability of the string order. The comparability is defined as follows: A string orderφ is

efficiently comparable if it takes linear time to verify ifφ(si) is larger thanφ(sj) for any string

pair si andsj. To handle range queries and top-K queries, it needs lower bounding property, which

is defined as: A string orderφ is lower bounding if it efficiently returns the minimal edit distance

between string q and anysl ∈ [si, sj ].

With this property, the B+-tree can handle range queries using algorithm in Figure 3.13. The

VerifyED function (Figure 3.12) is for testing if the edit distance of two string is within some

thresholdθ. In the algorithm they use LB(si, [sj−1, sj ]) to denote the lower bound on the edit

distance betweensi and any stringsl ∈ [sj−1, sj]. The algorithm in Figure 3.13 iteratively visits the

nodes with lower bound edit distance no larger thanθ and verifies the strings found at the leaf level

of the tree using VerifyED function. The minimal and maximalstringssmin andsmax indicate the

boundaries of any string in a given subtree with respect to the string orderφ. This information can

be retrieved from the parent node, as the algorithm implies.

36

VerifyED(strings1, strings2, thresholdθ)

1. if ||s1| − |s2|| > θ return FALSE

2. Construct a table T of 2 rows and|s2| + 1 columns

3. for j = 1 to min(|s2| + 1, 1 + θ) T[1][j] = j-1

4. Set m =θ + 1

5. for i = 2 to |s1| + 1

6. for j =min(1, i -θ) to min(|s2| + 1, i + θ)

7. d1=(j<i+θ) ? T[1][j] + 1 : θ + 1

8. d2=(j>1) ? T[2][j-1] + 1 : θ + 1

9. d3=(j>1)?T[1][j-1]+(s1 [i-1]=s2[j-1])?0:1):θ+1

10. T[2][j] = min(d1, d2, d3); m = min(m, T[2][j]);

11. if m > θ return FALSE

12. for j = 0 to|s2| + 1 T[1][j] = T[2][j]

13.return TRUE;

Figure 3.12: The VerifyED function

RangeQuery(string q, B+-tree node N, thresholdθ,

minimal stringsmin, maximal stringsmax)

1. if N is the leaf node

2. for eachsj ∈ N

3. if VerifyED(q,sj , θ)

4. Includesj in query result

5. else

6. if LB(q, [smin, s1]) ≤ θ

7. RangeQuery(q,N1 , θ, smin, s1)

8. for j = 2 to m

9. if LB(q, [sj−1, sj ]) ≤ θ

10. RangeQuery(q,Nj , θ, sj−1, sj)

11. if LB(q, [sm, smax]) ≤ θ

12. RangeQuery(q,Nm+1 , θ, sm, smax)

Figure 3.13: The RangeQuery Algorithm

If a string orderφ supports range queries, it also directly supports top-k selection queries on the

B+-tree structure. It simply uses a min-heap to keep the current top-k similar strings and update the

thresholdθ with the distance value of the top element in the heap. The details refer to the paper [41].

Authors in [41] also discuss three concrete string mappings. According to their experiments,

the most effective mapping is gram counting order (GCO). I will briefly describe it in the following.

37

GCO is based on counting the number of q-grams within a string. It uses a hash function to map

q-grams to a set of L buckets, and count the number of q-grams in each bucket. Thus, the q-gram

set is transformed into a vector of L non-negative integers.The edit distance between two stringssi

andsj is no smaller than

max(|Q(si)|/|Q(sj)|

q,|Q(sj)|/|Q(si)|

q)

Here,Q(si)|/|Q(sj) is the set of q-grams in Q(si) and not Q(sj), and vice versa. After mapping

strings from gram space to the bucket space, a new lower boundholds. If vi andvj are the L-

dimensional bucket vector representations ofsi andsj respectively, the edit distance betweensi

andsj is no smaller than

max(∑

vi[l]>vj [l]

vi[l] − vj[l]

q,

∑

vj [l]>vi[l]

vj[l] − vi[l]

q)

To achieve a tighter lower bound they apply z-order on the q-gram counting vectors. Eventually,

GCO maps the strings into the z-values, which can be index by the B+-tree.

38

CHAPTER 4

SPATIAL KEYWORD AND APPROXIMATE STRING

QUERIES

4.1 Keyword search on spatial databases

Many applications require to find objects closest to a specified location that contains a set of

keywords, which leads to a novel type of search: spatial keyword search (SKS). An SKS consists

of a query point and a set of keywords. The answer is a list of objects ranked according to a

combination of their distance to the query point and the relevance of their text description to the

query keywords. There are several variants of such a query being studied since SKS emerged in the

database community recently [16]. I will briefly introduce them in the following.

4.1.1 Distance-first top-k spatial keyword query

Felipe et al. in [16] propose a variant of SKS: distance-firsttop-k spatial keyword query. For-

mally, an object O is defined as a pair (O.p,O.t), where O.p is alocation descriptor in the multidi-

mensional space and O.t is a textual description. Let D be theuniverse of all objects in a database.

A top-k spatial queryQp retrieves the k nearest objects to the specified query point q. A keyword

39

N1 N2

N3 N4 N5 N6

o1 o2 o3 o4 o5 o6 o7 o8

1011 1111 0111 0011 1101 1001 1010 1111

1111 0111 1101 1111

1111 1111

Figure 4.1: The example of an IR2-tree.

queryQt consists of a set of keywordsw1, · · · , wm. The result ofQt is a list of objects ordered

by the relevance IRscore(O.t,Qt) of their textual descriptions to the query keywords, as measured

by an IR ranking function [34]. In particular, authors in [16] deal with boolean keyword query,

which returns the set of all objects whose text contains all of w1, · · · , wm. A distance-first top-k

spatial keyword query is a combination of a top-k spatial query and a Boolean keyword query, i.e.

it retrieves k objects that contain all ofw1, · · · , wm and are closest toQp.

The state-of-the-art solution to the above problem is base on the information retrieval R-tree

(IR2-tree), which combines the R-tree [5, 18] with the signaturefile [15]. For the review of the

signature file, please refer to Section 3.1.2.

The IR2-tree is an R-tree where each entry N is augmented with a signature that summarizes

the union of the texts of the objects in the subtree of N. Figure 4.1 demonstrates an example. Note

that the signature of a nonleaf entry N can be conveniently obtained as the disjunction of all the

signatures in the child node of N.

To answer NN queris, Felipe et al. [16] adapt the best-first algorithm [19] to the IR2-tree. In

particular, given a query q and a keyword setWq, the algorithm accesses the entries of an IR2-tree in

ascending order of the distance of their MBRs to q, pruning those entries whose signatures indicate

40

the absence of at least one word ofWq in their subtrees. If a leaf entry, say a point p, cannot be

pruned, it retrieves the textWp. If Wq is a subset ofWp, then it returns p as the answer. Otherwise,

it continues until no more entry need to be processed.

4.1.2 The m-closest keywords query on spatial databases

Zhang et al. [39] deals with another variant of SKS that is defined as follows. LetD be a set of

points each of which associates with a single keyword. Givena setWq of query keywords, the goal

is to retrieve|Wq| points fromD such that (1) each point has a distinct keyword inWq and (2) the

maximum mutual distance of these points is minimized. This problem is denoted as the m-closest

keywords (mCK) query.

To efficiently answer mCK queries, Zhang et al. [39] propose anovel index bR∗-tree, which is an

extension of the R∗-tree [5]. Besides the MBR, each node is associated with additional information

of keywords. A straightforward extension is to summarize the keywords in the node. Assuming

there are N keywords in the database, the keywords for each node can be represented using a bitmap

of size N, with a ’1’ indicating its existence in the node and a’0’ otherwise. For example, a bitmap

B = 1001 indicates that there are four keywords in the database and the current node is associated

with the keywords in the first and last positions of the bitmap. Given a queryQ = 0110, if we have

B&Q = 0, it implies that the given node does not include any query keywords and can be pruned

from the search space. Besides the keyword bitmap, it also stores the keyword MBR in the node for

facilitating pruning. The keyword MBR of keywordwi is the MBR covering all the objects in the

nodes that are associated withwi. Using this information, we know the approximate area in thenode

where each keyword is distributed. Figure 4.2 illustrates an example of an internal node containing

three keywordsw1, w2 andw3 represented as111. It also maintains the keyword MBRs ofw1, w2

41

node bitmap

MBR keyword MBR

w3

w3

w1

w1

w1

w2

w2

w2

w3

w3

w1

w1

w1

w2

w2

w2

1 1 1

C1

C2

C3

W3

W2

W1

Figure 4.2: An example of the node information of the bR∗-tree.

andw3. To construct the bR∗-tree, it follows the means of the original R∗-tree algorithm [5] by

adding the operations of updating keywords and keyword MBR when AdjustTree is invoked. The

set of keywords of the parent node is the union of the sets of keywords in the child nodes. And

the keyword MBR ofwi in the parent node is actually the minimum bound of the corresponding

keyword MBRs in the child nodes.

The search algorithm starts from the root node. The result can be located within one child node

or across multiple child nodes of the root. If a child node contains all the m query keywords, it is

included in candidate search space. Similarly, if multiplechild nodes together can contribute all the

query keywords and they are close to each other, then they arealso included in the search space.

After exploring the root node, it obtains a list of candidatesubsets of its child nodes. Note that

during the whole search process, the number of nodes in a nodeset will never exceed m because the

target m tuples can only reside in at most m child nodes. This provides an additional constraint to

reduce the search space.

The first step is to find a relatively small diameter for branch-and-bound pruning before it starts

the search. It starts from the root node and choose a child node with the smallest MBR that contains

all the query keywords and traverse down that node. The process is repeated until it reaches the

42

leaf level or until it is unable to find any child node with all the query keywords. Then it performs

exhaustive search within the node it found and use the diameter of the result as its initial diameter

for searching (denoted asδ∗).

SubsetSearch (current subset curSet)

1. if(curSet contains leaf nodes)

2. δ = ExhaustiveSearch(curSet);

3. if (δ < δ∗) δ∗ = δ;

4. else

5. if (curSet has only one node)

6. setList = SearchInOneNode(curSet);

7. for each S∈ setList

8. δ∗ = SubsetSearch(S);

9. if (curSet has multiple nodes)

10. setList = SearchInMultiNode(curSet);

11. for each S∈ setList

12. δ∗ = SubsetSearch(S);

13.output distance of m closest keywords.

Figure 4.3: The SubsetSearch funtion

SearchInOneNode (a node N in bR∗-tree)

1. L1 = all the child nodes in N;

2. for i from 2 to m

3. for each NodeSetC1 ∈ Li−1

4. for each NodeSetC2 ∈ Li−1

5. if C1 andC2 share the firsti − 1 nodes

6. C = NodeSet(C1, C2);

7. if (C has subset not appear inLi−1) continue;

8. if (C is neither distance mutex nor keyword

9. mutex)Li = Li ∪ C;

10.for each NodeSet S∈ ∪mi=1Li

11. if (S contains all query keywords )

12. add S to cList;

13.return cList;

Figure 4.4: The SearchInOneNode funtion

With this initial δ∗, it starts search from the root node. The function SubsetSearch in Figure

4.3 traverses the tree in a depth-first manner so as to visit the data objects in leaf entries as soon as

43

possible. An effective strategy for reducing the number of candidate subsets is importance as each

subset will later spawn an exponential number of new subsets. Authors utilize the priori algorithm

proposed in [1] combined with two constraints called distance mutex and keyword mutex. The defi-

nition of distance mutex is based on the observation that if the minimum distance between two node

MBRs of N and N’ is larger thanδ∗, then the node set{N, N’} does not give a result with diameter

better thanδ∗. We have similar property for keyword MBRs, which is defined as keyword mu-

tex. To demonstrate how to utilize this two constraints, Figure 4.4 illustrates the SearchInOneNode

function, which is the essential part of the whole algorithm.

First (in line 1 of Figure 4.4), it puts all the child nodes in the bottom level of the lattice. The

lattice is built level by level with increasing number of child nodes in the NodeSet. In level i, each

NodeSet contains exactly i child nodes. For a query with m keywords, we only need to check

NodeSet with at most m nodes, leading to a lattice with at mostm levels. Lines5 − 6 show two

setsC1 andC2 in level i − 1 being joined, they must havei − 2 nodes in common. Lines7 − 12

check if any of its subsets in leveli−1 is pruned due to distance mutex or keyword mutex. If all the

subsets are legal, we check whether this new candidate itself is distance mutex or keyword mutex

for pruning. If it is not pruned, we add it to level i. In lines10 − 12, after all the candidates have

been generated, it checks each one to see if it contains all the query keywords. Those missing any

keywords are eliminated. It does not check this constraint while building the lattice because if a

node does not contain all the query keywords, we can still combine with other nodes to cover the

missing keywords. As long as it is neither distance mutex norkeyword mutex, we keep it in the

lattice.

The drawback of the bR∗-tree lies that it may suffer from high IOs cost if the total number of

keywords is large. The root node has to maintain all keyword MBRs. The internal nodes become

44

R2

R3 R4 R5 R6

ba a

loc1 loc5

loc2 loc3 loc4 loc6 loc8

loc3 loc4 loc6 loc7

loc7

R1

a a aa a a

a a a

aa

a

a

a

a

a

a

b b b b b b b

b

b

b

bbb

b

b

b

b b

b b

b b

a a b

loc8

aa loc1 a a loc2 b a loc5

b

Figure 4.5: An example of the index of R∗-tree and inverted lists (from Figure 3 in [40]).

overloaded and occupy a large amount of storage space. Such an index structure prevents the search

algorithm from handling spatial database with massive number of keywords. Later on, Zhang et

al. [40] remedy this drawback by proposing a novel index based on the R∗-tree and inverted list.

The new index is excellent for the scalable mCK queries and applied to locate mapped resources in

web 2.0 [40].

The new light-weighted index structure is based on the R∗-tree and inverted lists. The R∗-tree is

used to index all the spatial locations associated with keywords. It is constructed in the same manner

as described in [5] except that each node is assigned a label indicating the path to the root node. As

an example, Figure 4.5 (from Figure 3 in [40]) illustrates anindex of the R∗-tree and inverted lists.

Node R5 and R6 are both labeled as ’b’ because they share the same path to the root node’s second

entry. Given a node label, we can find where the node is locatedin the R∗-tree without explicitly

accessing to the tree. Two locations close to each other probably have the same prefix of node label.

45

R2R1

Query :

a a a ab

R3a

R5b

aba aaa

R3a

R5b

abaa a aa

Figure 4.6: An example of bottom-up construction of virtualbR∗-tree (from Figure 4 in [40]).

Thus, the label can be used to approximate the distance between the points. An inverted index is

built along with the R∗-tree. It maintains inverted lists for all the keywords. Each element in the list

consists of the node label and the actual location. Note thatthe list of locations are ordered by the

label to preserve the spatial proximity of nodes.

For the query processing, given m query keywords, it first retrieves m lists of data points from

the inverted lists. Authors in [40] propose an solution to construct a virtual bR∗-tree using the labels

and locations of the data points. The virtual bR∗-tree is built level by level in bottom-up manner as

illustrated in Figure 4.6 (from Figure 4 in [40]). Then, m inverted lists corresponding to the query

tags are merged into one list ordered by the node label. Thosepoints with the same label are used

to construct a new virtual node which has a counterpart in theoriginal R∗-tree. Note that the MBR

of the virtual node is much smaller than its counterpart as itis built on the points only relevant to

the query. For each virtual node, it maintains the keyword bitmap and the keyword MBRs as it does

in [5]. Then, the a priori-based search strategy can be applied in the virtual bR∗-tree. The detailed

search algorithm is shown in Figure 4.7. When a virtual node is constructed, it will be treated as a

subtree and fed to the algorithm SubsetSearch proposed in [5].

46

Bottom-Up search (m query keywords, inverted index)

1. Retrieve m inverted lists for all query keywords;

2. Merge the lists into one list L ordered by the label;

3. Initialize a virtual node curNode;

4. while L.level< tree.height

5. for each element vnode in L

6. if (vnode has the same label with its previous element)

7. add vnode into curNode;

8. else

9. SubsetSearch(curNode);

10. add curNode to List L’;

11. L=L’;

12.output distance and location of m closest keywords.

Figure 4.7: The bottom-up search

4.2 Approximate string search on spatial databases

In this section, we consider another problem related to the spatial keyword search. In reality, for

many scenarios, keyword search for retrieving approximatestring matches is required [2,8,10,21,22,

24,28,36]. Approximate string search could be necessary when users have a fuzzy search condition

or simply a spelling error when submitting the query, or the strings in the database contain some

47

p7: theatre

p3: theaters

p5: grocery

p2: ymca club

p10: gym

p4: theaterp6: barnes noble

p9: shaw’s market

p8: Commonwealth ave

p1: Moe’sQ : r, theatre, τ = 2

Figure 4.8: Approximate string search with a range query.

degree of uncertainty or error. In the context of spatial databases approximate string search could

be combined with any type of spatial queries, including range and nearest neighbor queries. An

example for the approximate string match range query is shown in Figure 4.8, depicting a common

scenario in location-based services: find all objects within a spatial ranger that have a description

that is similar to ’theatre’. Similar examples could be constructed for KNN queries. We refer to these

queries asSpatial Approximate String(SAS) queries. This question is first handled by Yao et al.

in [37]. The approximate string search has been studied independently in the database community

(see Chapter 3). The main challenge for SAS problem is that the basic solution of simply integrating

approximate edit distance evaluation techniques into a normal R-tree is expensive and impractical.

Yao et al. [37] propose the MHR-tree that takes into account the potential pruning capability

provided by the string match predicate and the spatial predicate simultaneously. The MHR-tree is

based on the R∗-tree. In each node, it stores the signature of the string information of its children

nodes. Before exploring how to construct the MHR-tree and answer SAS queries with MHR-tree, I

will introducing the min-wise signature, which is used to summarize the string information.

48

4.2.1 The min-wise signature

The min-wise independent families of permutations were first introduced in [6, 12], which can

be used for estimatingset resemblance. Broder et al. has shown in [6] that a min-wise independent

permutationπ could be used to construct an unbiased estimator for set resemblance. However, there

is no efficient implementation of the min-wise families of permutations in practice. Thus, prior art

often uses linear hash functions based on Rabin fingerprintsto simulate the behavior of the min-wise

independent permutations [6].

4.2.2 The construction of the MHR-tree

To incorporate the pruning power of edit distances into the R-tree, we can utilize the result from

Lemma 1 in [37]. The intuition is that if we store theq-grams for all strings in a subtree rooted

at an R-tree nodeu, denoted asGu, given a query stringσ, we can extract the queryq-gramsGσ

and check the size of the intersection betweenGu andGσ , i.e., |Gu ∩ Gσ|. Then we can possibly

prune nodeu by Lemma 1 in [37], even ifu does intersect with the query ranger or it needs to be

explored based on the MinDist metric for the KNN query.

To estimate the intersection size betweenGu andGσ , we embed the min-wise signature ofGu

in an R-tree node, instead ofGu itself. The min-wise signatures(Gu) has a constant size, which

make it suitable to be stored in the R-tree node.

For a leaf level nodeu, let the set of points contained inu beup. For every pointp in up, we

compute itsq-gramsGp and the corresponding min-wise signatures(Gp). In order to compute the

min-wise signature for the node, lets(A)[i] be theith element for the min-wise signature ofA, i.e.,

s(A)[i] = min{πi(A)}. Given the set of min-wise signatures{s(A1), . . . , s(Ak)} of k sets, one

49

Construct-MHR(Data set P, Hash Functions{h1, . . . , hℓ})

1. Use any existing bulk-loading algorithmA for R-tree;

2. Letu be an R-tree node produced byA overP ;

3. if (u is a leaf node)

4. ComputeGp ands(Gp) for every pointp ∈ up;

5. Stores(Gp) together withp in u;

6. else

7. for (every child entryci with child nodewi)

8. Store MBR(wi), s(Gwi), and pointer towi in ci;

9. for (i = 1, . . . , ℓ)

10. Lets(Gu)[i] = min(s(Gw1)[i], . . . , s(Gwf

[i]));

11. Stores(Gu) in parent ofu;

Figure 4.9: The construction of the MHR-tree.

can easily derive that:

s(A1 ∪ · · · ∪ Ak)[i] = min{s(A1)[i], . . . , s(Ak)[i]} (4.1)

for i = 1, . . . , ℓ, since each element in a min-wise signature always takes thesmallest image for a

set.

We can obtains(Gu) using Equation 4.1 ands(Gp)s for every pointp ∈ up, directly. Finally,

we store all (p, s(Gp)) pairs in nodeu, ands(Gu) in the index entry that points tou in u’s parent.

50

For an index level nodeu, let its child entries be{c1, . . . , cf} wheref is the fan-out of the

R-tree. Each entryci points to a child nodewi of u, and contains the MBR forwi. We also store the

min-wise signature of the node pointed to byci, i.e.,s(wi). Clearly,Gu =⋃

i=1,...,f Gwi. Hence,

s(Gu) = s(Gw1∪ · · · ∪ Gwf

); based on Equation 4.1, we can computes(Gu) usings(Gwi)s.

This procedure is recursively applied in a bottom-up fashion until the root node of the R-tree

has been processed. The complete construction algorithm ispresented in Figure 4.9.

4.2.3 The query algorithm for the MHR-tree

The query algorithms for the MHR-tree generally follow the same principles as the correspond-

ing algorithms for the spatial query component. However, wewould like to incorporate the pruning

method based onq-grams and set resemblance estimation without the explicitknowledge ofGu

for a given R-tree nodeu. We need to achieve this with the help ofs(Gu). Thus, the key issue

boils down to estimating|Gu ∩ Gσ | usings(Gu) andσ. The details of estimating|Gu ∩ Gσ | is

demonstrated in Section IV in [37].

The SAS range query algorithm is presented in Figure 4.10. One can also revise the KNN

algorithm for the normal R-tree to derive the KNN-MHR algorithm. The basic idea is to use a

priority queue that orders objects in the queue with respectto the query point using the MinDist

metric. However, only nodes or data points that can pass the string pruning test will be inserted into

the queue. Whenever a point is removed from the head of the queue, it is inserted inA. The search

terminates whenA hask points or the priority queue becomes empty.

51

Range-MHR(MHR-treeR, Ranger, Stringσ, int τ )

1. LetB be a FIFO queue initialized to∅, letA = ∅;

2. Letu be the root node ofR; insertu into B;

3. while (B 6= ∅)

4. Letu be the head element ofB; pop outu;

5. if (u is a leaf node)

6. for (every pointp ∈ up)

7. if (p is contained inr)

8. if (|Gp ∩ Gσ| ≥ max(|σp|, |σ|) − 1 − (τ − 1) ∗ q)

9. if (ε(σp, σ) < τ ) Insertp in A;

10. else

11. for (every child entryci of u)

12. if (r and MBR(wi) intersect)

13. Calculates(G = Gwi∪ Gσ) based ons(Gwi

), s(Gσ) and Equation 4.1;

14. Calculate ̂|Gwi∩ Gσ| using Equation 8 in [37];

15. if ( ̂|Gwi∩ Gσ | ≥ |σ| − 1 − (τ − 1) ∗ q)

16. Read nodewi and insertwi into B;

17.ReturnA.

Figure 4.10: The range queries with the MHR-tree.

52

CHAPTER 5

CONCLUSION

In this document, I survey a type of query: spatial keyword search. Since it’s a combination of

spatial search and text search, I first review some critical issues in both fields. Range and nearest

neighbor queries are fundamental problems in spatial databases. I go over them in both Euclidean

space and road networks. Next, I present the state-of-the-art techniques for processing the string

in the literature. The approximate string search is a hot spot research issue in string databases. I

introduce the most popular q-grams based indices and algorithms for the approximate string search.

Lastly, I reach to the spatial keyword and approximate string queries. Few variants of the spatial

keyword and approximate string queries have been studied and large of them remain open. I briefly

describe three variants that have been explored.

53

BIBLIOGRAPHY

[1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. InVLDB, 1994.

[2] A. Arasu, S. Chaudhuri, K. Ganjam, and R. Kaushik. Incorporating string transformations in

record matching. InSIGMOD, 2008.

[3] S. Arya, D. M. Mount, N. S. Netanyahu, R. Silverman, and A.Y. Wu. An optimal algorithm for

approximate nearest neighbor searching in fixed dimensions. Journal of ACM, 45(6):891–923,

1998.

[4] R. Bayardo, Y. Ma, and R. Srikant. Scaling up all-pairs similarity search. InWWW, 2007.

[5] N. Beckmann, H. P. Kriegel, R. Schneider, and B. Seeger. The R∗-tree: an efficient and robust

access method for points and rectangles. InSIGMOD, 1990.

[6] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent per-

mutations (extended abstract). InSTOC, 1998.

[7] T. M. Chan. Approximate nearest neighbor queries revisited. InSoCG, 1997.

[8] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. Robustand efficient fuzzy match for

online data cleaning. InSIGMOD, 2003.

54

[9] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data

cleaning. InICDE, 2006.

[10] S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data

cleaning. InICDE, 2006.

[11] H.-J. Cho and C.-W. Chung. An efficient and scalable approach to CNN queries in a road

network. InVLDB, 2005.

[12] E. Cohen. Size-estimation framework with applications to transitive closure and reachability.

Journal of Computer and System Sciences, 55(3):441–453, 1997.

[13] M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based

on p-stable distributions. InSoCG, 2004.

[14] E. W. Dijkstra. A note on two problems in connexion with graphs.Numerische Mathematik,

1:269–271, 1959.

[15] C. Faloutsos and S. Christodoulakis. Signature files: an access method for documents and its

analytical performance evaluation.TOIS, 2:267–288, 1984.

[16] I.D. Felipe, V. Hristidis, and N. Rishe. Keyword searchon spatial databases. InICDE, 2008.

[17] L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava.

Approximate string joins in a database (almost) for free. InVLDB, 2001.

[18] A. Guttman. R-trees: A dynamic index structure for spatial searching. InSIGMOD, 1984.

[19] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases.TODS, 24(2):265–318,

1999.

55

[20] H. V. Jagadish, B. C. Ooi, K. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+-tree based

indexing method for nearest neighbor search.ACM Trans. Database Syst., 30(2):364–397,

2005.

[21] N. Koudas, A. Marathe, and D. Srivastava. Flexible string matching against large databases in

practice. InVLDB, 2004.

[22] H. Lee, R. Ng, and K. Shim. Extending q-grams to estimateselectivity of string matching with

low edit distance. InVLDB, 2007.

[23] V. Levenshtein. Binary codes capable of correcting spurious insertions and deletions of ones.

Inf. Transmission, 1:8–17, 1965.

[24] C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string

searches. InICDE, 2008.

[25] C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on

string collections using variable-length grams. InVLDB, 2007.

[26] D. Papadias, J. Zhang, N. Mamoulis, and Y. Tao. Query processing in spatial network

databases. InVLDB, 2003.

[27] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. InSIGMOD, 1995.

[28] S. Sahinalp, M. Tasan, J. Macker, and Z. Ozsoyoglu. Distance based indexing for string

proximity search. InICDE, 2003.

[29] H. Samet.Foundations of Multidimensional and Metric Data Structures. Morgan-Kaufmann,

San Francisco, 2006.

56

[30] H. Samet, J. Sankaranarayanan, and H. Alborzi. Scalable network distance browsing in spatial

databases. InSIGMOD, 2008.

[31] J. Sankaranarayanan, H. Alborzi, and H. Samet. Efficient query processing on spatial net-

works. InACM GIS, 2005.

[32] J. Sankaranarayanan and H. Samet. Distance oracles forspatial networks. InICDE, 2009.

[33] S. Sarawagi and A. Kirpal. Efficient set joins on similarity predicate. InSIGMOD, 2004.

[34] A. Singhal. Modern information retrieval: A brief overview. In IEEE Data Eng. Bull., 2001.

[35] Y. Tao, K. Yi, C. Sheng, and P. Kalnis. Quality and efficiency in high dimensional nearest

neighbor search. InSIGMOD, 2009.

[36] X. Yang, B. Wang, and C. Li. Cost-based variable-length-gram selection for string collections

to support approximate queries efficiently. InSIGMOD, 2008.

[37] B. Yao, F. Li, M. Hadjieleftheriou, and K. Hou. Approximate string search in spatial databases.

In ICDE, 2010.

[38] B. Yao, F. Li, and P. Kumar. K nearest neighbor queries and knn-joins in large relational

databases (almost) for free. InICDE, 2010.

[39] D. Zhang, Y. M. Chee, A. Mondal, A. K. H. Tung, and M. Kitsuregawa. Keyword search in

spatial databases: Towards searching by document. InICDE, 2009.

[40] D. Zhang, B. C. Ooi, and A. K. H. Tung. Locating mapped resources in web 2.0. InICDE,

2010.

57

[41] Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index

structure for string similarity search based on edit distance. InSIGMOD, 2010.

58

survey on spatial keyword search - sjtuyaobin/papers/skssv.pdfsurvey on spatial keyword search bin...

Documents