journal of la best keyword cover search · best keyword cover search ke deng, xin li, jiaheng lu,...

1041-4347 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TKDE.2014.2324897, IEEE Transactions on Knowledge and Data Engineering

JOURNAL OF LATEX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1

Best Keyword Cover SearchKe Deng, Xin Li, Jiaheng Lu, and Xiaofang Zhou

Abstract—It is common that the objects in a spatial database (e.g., restaurants/hotels) are associated with keyword(s) to indicatetheir businesses/services/features. An interesting problem known as Closest Keywords search is to query objects, called keywordcover, which together cover a set of query keywords and have the minimum inter-objects distance. In recent years, we observethe increasing availability and importance of keyword rating in object evaluation for the better decision making. This motivates usto investigate a generic version of Closest Keywords search called Best Keyword Cover which considers inter-objects distanceas well as the keyword rating of objects. The baseline algorithm is inspired by the methods of Closest Keywords search whichis based on exhaustively combining objects from different query keywords to generate candidate keyword covers. When thenumber of query keywords increases, the performance of the baseline algorithm drops dramatically as a result of massivecandidate keyword covers generated. To attack this drawback, this work proposes a much more scalable algorithm called keywordnearest neighbor expansion (keyword-NNE). Compared to the baseline algorithm, keyword-NNE algorithm significantly reducesthe number of candidate keyword covers generated. The in-depth analysis and extensive experiments on real data sets havejustified the superiority of our keyword-NNE algorithm.

Index Terms—Spatial database, Point of Interests, Keywords, Keyword Rating, Keyword Cover

F

1 INTRODUCTION

Driven by mobile computing, location-based services andwide availability of extensive digital maps and satelliteimagery (e.g., Google Maps and Microsoft Virtual Earthservices), the spatial keywords search problem has attractedmuch attention recently [3, 4, 5, 6, 8, 10, 15, 16, 18].

In a spatial database, each tuple represents a spatialobject which is associated with keyword(s) to indicate theinformation such as its businesses/services/features. Given aset of query keywords, an essential task of spatial keywordssearch is to identify spatial object(s) which are associatedwith keywords relevant to a set of query keywords, andhave desirable spatial relationships (e.g., close to each otherand/or close to a query location). This problem has uniquevalue in various applications because users’ requirementsare often expressed as multiple keywords. For example,a tourist who plans to visit a city may have particularshopping, dining and accommodation needs. It is desirablethat all these needs can be satisfied without long distancetraveling.

Due to the remarkable value in practice, several variantsof spatial keyword search problem have been studied. Theworks [5, 6, 8, 15] aim to find a number of individualobjects, each of which is close to a query location and theassociated keywords (or called document) are very relevantto a set of query keywords (or called query document).The document similarity (e.g., [14]) is applied to measurethe relevance between two sets of keywords. Since it is

• Ke Deng is with Huawei Noah’s Ark Research Lab, Hong Kong. Thiswork is completed when he was with the University of Queensland.E-mail: [email protected]

• Xin Li and Jiaheng Lu are with Renmin University, China.E-mail: lixin2007,[email protected]

• Xiaofang Zhou is with the University of Queensland, Australia.E-mail: [email protected]

Keyword rating

BKC returns {t1,s1,c1} mCK returns {t2, s2, c2}

Hotel

0.1

0.5

0.2

0.10.5

0.2

0.50.2

0.3

0.5

0.1

0.4

0.9

0.3

0.60.8

0.4

0.1

0.3

s1t1

c1

s2 t2

c2

Restaurant Bar

0.5

0.5

Fig. 1. BKC vs. mCK

likely none of individual objects is associated with allquery keywords, this motivates the studies [4, 17, 18]to retrieve multiple objects, called keyword cover, whichtogether cover (i.e., associated with) all query keywords andare close to each other. This problem is known as m ClosestKeywords (mCK) query in [17, 18]. The problem studiedin [4] additionally requires the retrieved objects close to aquery location.

This paper investigates a generic version of mCK query,called Best Keyword Cover (BKC) query, which considersinter-objects distance as well as keyword rating. It ismotivated by the observation of increasing availability andimportance of keyword rating in decision making. Millionsof businesses/services/features around the world have beenrated by customers through online business review sitessuch as Yelp, Citysearch, ZAGAT and Dianping, etc. Forexample, a restaurant is rated 65 out of 100 (ZAGAT.com)and a hotel is rated 3.9 out of 5 (hotels.com). Accordingto a survey in 2013 conducted by Dimensional Research 1,

1. www.dimensionalresearch.com




an overwhelming 90 percent of respondents claimed thatbuying decisions are influenced by online business re-view/rating. Due to the consideration of keyword rating, thesolution of BKC query can be very different from that ofmCK query. Figure 1 shows an example. Suppose the querykeywords are “Hotel”, “Restaurant” and “Bar”. mCK queryreturns {t2, s2, c2} since it considers the distance betweenthe returned objects only. BKC query returns {t1, s1, c1}since the keyword ratings of object are considered inaddition to the inter-objects distance. Compared to mCKquery, BKC query supports more robust object evaluationand thus underpins the better decision making.

This work develops two BKC query processing algo-rithms, baseline and keyword-NNE. The baseline algorithmis inspired by the mCK query processing methods [17, 18].Both the baseline algorithm and keyword-NNE algorithmare supported by indexing the objects with an R*-tree likeindex, called KRR*-tree. In the baseline algorithm, theidea is to combine nodes in higher hierarchical levels ofKRR*-trees to generate candidate keyword covers. Then,the most promising candidate is assessed in priority bycombining their child nodes to generate new candidates.Even though BKC query can be effectively resolved, whenthe number of query keywords increases, the performancedrops dramatically as a result of massive candidate keywordcovers generated.

To overcome this critical drawback, we developed muchscalable keyword nearest neighbor expansion (keyword-NNE) algorithm which applies a different strategy.Keyword-NNE selects one query keyword as principalquery keyword. The objects associated with the principalquery keyword are principal objects. For each principalobject, the local best solution (known as local best keywordcover (lbkc)) is computed. Among them, the lbkc with thehighest evaluation is the solution of BKC query. Givena principal object, its lbkc can be identified by simplyretrieving a few nearby and highly rated objects in eachnon-principal query keyword (2-4 objects in average asillustrated in experiments). Compared to the baseline algo-rithm, the number of candidate keyword covers generatedin keyword-NNE algorithm is significantly reduced. The in-depth analysis reveals that the number of candidate keywordcovers further processed in keyword-NNE algorithm isoptimal, and each keyword candidate cover processinggenerates much less new candidate keyword covers thanthat in the baseline algorithm.

The remainder of this paper is organized as follows. Theproblem is formally defined in section 2 and the relatedwork is reviewed in section 3. After that, section 4 dis-cusses keyword rating R*-tree (KRR*-tree). The baselinealgorithm is introduced in section 5 and keyword-NNEalgorithm is proposed in section 6. Section 7 discussesthe situation of weighted average of keyword ratings. Anin-depth analysis is given in section 8. Then section 9reports the experimental results. The paper is concludedin section 10.

2 PRELIMINARY

Given a spatial database, each object may be associatedwith one or multiple keywords. Without loss of general-ity, the object with multiple keywords are transformed tomultiple objects located at the same location, each witha distinct single keyword. So, an object is in the form〈id, x, y, keyword, rating〉 where x, y define the locationof the object in a 2-dimensional geographical space. Nodata quality problem such as misspelling exists in key-words.

Definition 1 (Diameter): Let O be a set of objects{o1, · · · , on}. For oi, oj ∈ O, dist(oi, oj) is the Euclideandistance between oi, oj in the 2-dimensional geographicalspace. The diameter of O is:

diam(O) = maxoi,oj∈O

dist(oi, oj). (1)

The score of O is a function with respect to not onlythe diameter of O but also the keyword rating of objects inO. Users may have different interests in keyword rating ofobjects. We first discuss the situation that a user expects tomaximize the minimum keyword rating of objects in BKCquery. Then we will discuss another situation in section 7that a user expects to maximize the weighted average ofkeyword ratings.

The linear interpolation function [5, 16] is used to obtainthe score of O such that the score is a linear interpolationof the individually normalized diameter and the minimumkeyword rating of O.

O.score = score(A,B)

= α(1− A

max dist) + (1− α) B

max rating.

(2)A = diam(O).

B = mino∈O

(o.rating).

where B is the minimum keyword rating of objects in Oand α (0 ≤ α ≤ 1) is an application specific parameter. Ifα = 1, the score of O is solely determined by the diameterof O. In this case, BKC query is degraded to mCK query.If α = 0, the score of O only considers the minimumkeyword rating of objects in Q where max dist andmax rating are used to normalize diameter and keywordrating into [0,1] respectively. max dist is the maximumdistance between any two objects in the spatial databaseD, and max rating is the maximum keyword rating ofobjects.

Lemma 1: The score is of monotone property.Proof: Given a set of objects Oi, suppose Oj is a

subset of Oi. The diameter of Oi must be not less thanthat of Oj , and the minimum keyword rating of objects inOi must be not greater than that of objects in Oj . Therefore,Oi.score ≤ Oj .score.

Definition 2 (Keyword Cover): Let T be a set of key-words {k1, · · · , kn} and O a set of objects {o1, · · · , on},




O is a keyword cover of T if one object in O is associatedwith one and only one keyword in T .

Definition 3 (Best Keyword Cover Query (BKC)):Given a spatial database D and a set of query keywordsT , BKC query returns a keyword cover O of T (O ⊂ D)such that O.score ≥ O′.score for any keyword cover O′

of T (O′ ⊂ D).The notations used in this work are summarized in

Table 1.

TABLE 1Summery of Notations

Notation Interpretation

D A spatial database.T A set of query keywords.

Ok The set of objects associated with keyword k.ok An object in Ok .

KCo The set of keyword covers in each of which o is a member.kco A keyword cover in KCo.

lbkco The local best keyword cover of o, i.e., the keyword coverin KCo with the highest score.

ok.NNnki ok’s nth keyword nearest neighbor in query keyword ki.

KRR*k-tree The keyword rating R*-tree of Ok .Nk A node of KRR*k-tree.

3 RELATED WORKSpatial Keyword SearchRecently, the spatial keyword search has received consid-erable attention from research community. Some existingworks focus on retrieving individual objects by specifyinga query consisting of a query location and a set of querykeywords (or known as document in some context). Eachretrieved object is associated with keywords relevant to thequery keywords and is close to the query location [3, 5, 6, 8,10, 15, 16]. The similarity between documents (e.g., [14])are applied to measure the relevance between two sets ofkeywords.

Since it is likely no individual object is associated withall query keywords, some other works aim to retrieve mul-tiple objects which together cover all query keywords [4,17, 18]. While potentially a large number of object com-binations satisfy this requirement, the research problemis that the retrieved objects must have desirable spatialrelationship. In [4], authors put forward the problem toretrieve objects which 1) cover all query keywords, 2) haveminimum inter-objects distance and 3) are close to a querylocation. The work [17, 18] study a similar problem calledm Closet Keywords (mCK). mCK aims to find objectswhich cover all query keywords and have the minimuminter-objects distance. Since no query location is askedin mCK, the search space in mCK is not constrained bythe query location. The problem studied in this paper is ageneric version of mCK query by also considering keywordrating of objects.

Access MethodsThe approaches proposed by Cong et al. [5] and Li etal. [10] employ a hybrid index that augments nodes in

non-leaf nodes of an R/R*-tree with inverted indexes. Theinverted index at each node refers to a pseudo-documentthat represents the keywords under the node. Therefore,in order to verify if a node is relevant to a set of querykeywords, the inverted index is accessed at each node toevaluate the matching between the query keywords and thepseudo-document associated with the node.

In [18], bR*-tree was proposed where a bitmap is keptfor each node instead of pseudo-document. Each bit corre-sponds to a keyword. If a bit is “1”, it indicates some ob-ject(s) under the node is associated with the correspondingkeyword; “0” otherwise. A bR*-tree example is shown inFigure 2(a) where a non-leaf node N has four child nodesN1, N2, N3, and N4. The bitmaps of N1, N2, N4 are 111and the bitmap of N3 is 101. In specific, the bitmap 101 in-dicates some objects under N3 are associated with keyword“hotel” and “restaurant” respectively, and no object underN3 is associated with keyword “bar”. The bitmap allows tocombine nodes to generate candidate keyword covers. If anode contains all query keywords, this node is a candidatekeyword cover. If multiple nodes together cover all querykeywords, they constitute a candidate keyword cover. Sup-pose the query keywords are 111. When N is visited, itschild node N1, N2, N3, N4 are processed. N1, N2, N4 areassociated with all query keywords and N3 is associatedwith two query keywords. The candidate keyword coversgenerated are {N1}, {N2}, {N4}, {N1, N2}, {N1, N3},{N1, N4}, {N2, N3}, {N2, N4}, {N3, N4}, {N1, N2, N3},{N1, N3, N4} and {N2, N3, N4}. Among the candidatekeyword covers, the one with the best evaluation is pro-cessed by combining their child nodes to generate morecandidates. However, the number of candidates generatedcan be very large. Thus, the depth-first bR*-tree browsingstrategy is applied in order to access the objects in leafnodes as soon as possible. The purpose is to obtain thecurrent best solution as soon as possible. The current bestsolution is used to prune the candidate keyword covers. Inthe same way, the remaining candidates are processed andthe current best solution is updated once a better solutionis identified. When all candidates have been pruned, thecurrent best solution is returned to mCK query.

In [17], a virtual bR*-tree based method is introduced tohandle mCK query with attempt to handle data set withmassive number of keywords. Compared to the methodin [18], a different index structure is utilized. In virtual bR*-tree based method, an R*-tree is used to index locationsof objects and an inverted index is used to label the leafnodes in the R*-tree associated with each keyword. Sinceonly leaf nodes have keyword information the mCK queryis processed by browsing index bottom-up. At first, minverted lists corresponding to the query keywords areretrieved, and fetch all objects from the same leaf nodeto construct a virtual node in memory. Clearly, it has acounterpart in the original R*-tree. Each time a virtualnode is constructed, it will be treated as a subtree which isbrowsed in the same way as in [18]. Compared to bR*-tree,the number of nodes in R*-tree has been greatly reducedsuch that the I/O cost is saved.




1 1 1N1

N2

N3

N4

1 1 11 0 11 1 1

Rating

X

Y

N1

N2

N4

N3

(a) (b)

N

Hotel

0.1

0.5

0.2

0.10.5

0.2

0.5

0.2

0.3

0.5

0.1

0.4

0.9

0.3

0.6

0.8

0.4

0.1

0.3

Restaurant Bar

0.5

0.5

N1

N2

N3N4

N

Fig. 2. (a) A bR*-tree. (b) The KRR*-tree for keyword“restaurant”.

As opposed to employing a single R*-tree embeddedwith keyword information, multiple R*-trees have beenused to process multiway spatial join (MWSJ) which in-volves data of different keywords (or types). Given a num-ber of R*-trees, one for each keyword, the MWSJ techniqueof Papadias et al. [13] (later extended by Mamoulis andPapadias [11]) uses the synchronous R*-tree approach [2]and the window reduction (WR) approach [12]. Given twoR*-tree indexed relations, SRT performs two-way spatialjoin via synchronous traversal of the two R*-trees basedon the property that if two intermediate R*-tree nodesdo not satisfy the spatial join predicate, then the MBRsbelow them will not satisfy the spatial predicate either. WRuses window queries to identify spatial regions which maycontribute to MWSJ results.

4 INDEXING KEYWORD RATINGS

To process BKC query, we augment R*-tree with one addi-tional dimension to index keyword ratings. Keyword ratingdimension and spatial dimension are inherently differentmeasures with different ranges. It is necessary to makeadjustment. In this work, a 3-dimensional R*-tree calledkeyword rating R*-tree (KRR*-tree) is used. The ranges ofboth spatial and keyword rating dimensions are normalizedinto [0,1]. Suppose we need construct a KRR*-tree over aset of objects D. Each object o ∈ D is mapped into a newspace using the following mapping function:

f : o(x, y, rating)→ o(x

maxx,

y

maxy,

rating

max rating).

(3)where maxx,maxy,max rating are the maximum valueof objects in D on x, y and keyword rating dimensionsrespectively. In the new space, KRR*-tree can be con-structed in the same way as constructing a conventional 3-dimensional R*-tree. Each node N in KRR*-tree is definedas N(x, y, r, lx, ly, lr) where x is the value of N in x axleclose to the origin, i.e., (0,0,0,0,0,0), and lx is the width ofN in x axle, so do y, ly and r, lr. The Figure 2 (b) givesan example to illustrate the nodes of KRR*-tree indexingthe objects in keyword “restaurant”.

In [17, 18], a single tree structure is used to indexobjects of different keywords. In the similar way as dis-cussed above, the single tree can be extended with anadditional dimension to index keyword rating. A singletree structure suits the situation that most keywords arequery keywords. For the above mentioned example, allkeywords, i.e., “hotel”, “restaurant” and “bar”, are querykeywords. However, it is more frequent that only a smallfraction of keywords are query keywords. For example inthe experiments, only less than 5% keywords are querykeywords. In this situation, a single tree is poor to approxi-mate the spatial relationship between objects of few specifickeywords. Therefore, multiple KRR*-trees are used in thiswork, each for one keyword 2. The KRR*-tree for keywordki is denoted as KRR*ki-tree.

Given an object, the rating of an associated keywordis typically the mean of ratings given by a number ofcustomers for a period of time. The change does happen butslowly. Even though dramatic change occurs, the KRR*-tree is updated in the standard way of R*-tree update.

5 BASELINE ALGORITHM

The baseline algorithm is inspired by the mCK queryprocessing methods [17, 18]. For mCK query processing,the method in [18] browses index in top-down mannerwhile the method in [17] does bottom-up. Given the samehierarchical index structure, the top-down browsing mannertypically performs better than the bottom-up since thesearch in lower hierarchical levels is always guided by thesearch result in the higher hierarchical levels. However, thesignificant advantage of the method in [17] over the methodin [18] has been reported. This is because of the differentindex structures applied. Both of them use a single treestructure to index data objects of different keywords. Butthe number of nodes of the index in [17] has been greatlyreduced to save I/O cost by keeping keyword informationwith inverted index separately. Since only leaf nodes andtheir keyword information are maintained in the invertedindex, the bottom-up index browsing manner is used. Whendesigning the baseline algorithm for BKC query process-ing, we take the advantages of both methods [17, 18]. First,we apply multiple KRR*-trees which contain no keywordinformation in nodes such that the number of nodes of theindex is not more than that of the index in [17]; second,the top-down index browsing method can be applied sinceeach keyword has own index.

Suppose KRR*-trees, each for one keyword, havebeen constructed. Given a set of query keywords T ={k1, .., kn}, the child nodes of the root of KRR*ki-tree(i ≤ i ≤ n) are retrieved and they are combined to generatecandidate keyword covers. Given a candidate keywordcover O = {Nk1, · · · , Nkn} where Nki is a node of

2. If the total number of objects associated with a keyword is verysmall, no index is needed for this keyword and these objects are simplyprocessed one by one.




KRR*ki-tree.

O.score = score(A,B). (4)A = max

Ni,Nj∈Odist(Ni, Nj)

B = minN∈O

(N.maxrating).

Where N.maxrating is the maximum value of objectsunder N in keyword rating dimension; dist(Ni, Nj) isthe minimum Euclidean distance between Ni and Nj inthe 2-dimensional geographical space defined by x and ydimensions.

Lemma 2: Given two keyword covers O and O′, O′

consists of objects {ok1, .., okn} and O consists of nodes{Nk1, .., Nkn}. If oki is under Nki in KRR*ki-tree for1 ≤ i ≤ n, it is true that O′.score ≤ O.score.

Algorithm 1: Baseline(T,Root)Input: A set of query keywords T , the root nodes of

all KRR*-trees Root.Output: Best Keyword Cover.

1 bkc← ∅;2 H ← Generate Candidate(T,Root, bkc);3 while H is not empty do4 can← the candidate in H with the highest score;5 Remove can from H;6 Depth First Tree Browsing(H,T, can, bkc);7 foreach candidate ∈ H do8 if (candidate.score ≤ bkc.score) then9 remove candidate from H;

10 return bkc;

Algorithm 2: Depth First Tree Browsing (H , T ,can, bkc)

Input: A set of query keywords T , a candidate can,the candidate set H , and the current bestsolution bkc.

1 if can consists of leaf nodes then2 S ← objects in can;3 bkc′ ← the keyword cover with the highest score

identified in S;4 if bkc.score < bkc′.score then5 bkc← bkc′;

6 else7 New Cans←

Generate Candidate(T, can, bkc);8 Replace can by New Cans in H;9 can← the candidate in New Cans with the

highest score;10 Depth First Tree Browsing(H,T, can, bkc);

Algorithm 1 shows the pseudo-code of the baseline al-gorithm. Given a set of query keywords T , it first generates

Algorithm 3: Generate Candidate(T, can, bkc)Input: A set of query keywords T , a candidate can,

the current best solution bkc.Output: A set of new candidates.

1 New Cans← ∅;2 COM ← combining child nodes of can to generate

keyword covers;3 foreach com ∈ COM do4 if com.score > bkc.score then5 New Cans← com;

6 return New Cans;

candidate keyword covers using Generate Candidatefunction which combines the child nodes of the roots ofKRR*ki-trees for all ki ∈ T (line 2). These candidatesare maintained in a heap H . Then, the candidate withthe highest score in H is selected and its child nodes arecombined using Generate Candidate function to gener-ate more candidates. Since the number of candidates can bevery large, the depth-first KRR*ki-tree browsing strategyis applied to access the leaf nodes as soon as possible(line 6). The first candidate consisting of objects (notnodes of KRR*-tree) is the current best solution, denotedas bkc, which is an intermediate solution. According toLemma 2, the candidates in H are pruned if they have scoreless than bkc.score (line 8). The remaining candidatesare processed in the same way and bkc is updated if thebetter intermediate solution is found. Once no candidateis remained in H , the algorithm terminates by returningcurrent bkc to BKC query.

In Generate Candidate function, it is unnecessary toactually generate all possible keyword covers of input nodes(or objects). In practice, the keyword covers are generatedby incrementally combining individual nodes (or objects).An example in Figure 3 shows all possible combinations ofinput nodes incrementally generated bottom up. There arethree keywords k1, k2 and k3 and each keyword has twonodes. Due to the monotonic property in Lemma 1, theidea of Apriori algorithm [1] can be applied. Initially, eachnode is a combination with score = ∞. The combinationwith the highest score is always processed in priority tocombine one more input node in order to cover a keyword,which is not covered yet. If a combination has score lessthan bkc.score, any superset of it must have score lessthan bkc.score. Thus, it is unnecessary to generate the su-perset. For example, if {N2k2, N2k3}.score < bkc.score,any superset of {N2k2, N2k3} must has score less thanbkc.score. So, it is not necessary to generate {N2k2, N2k3,N1k1} and {N2k2, N2k3, N2k1}.

6 KEYWORD NEAREST NEIGHBOR EXPAN-SION (KEYWORD-NNE)Using the baseline algorithm, BKC query can be ef-fectively resolved. However, it is based on exhaustively




N1k3 N2k3N1k2 N2k2N1k1 N2k1

N1k2

N1k1N2k2

N1k1N1k3

N1k1N2k3N1k1

N1k2N2k1

N2k2

N2k1N1k3

N2k1N2k3N2k1

N1k3N1k2

N2k3

N1k2N1k3N2k2

N2k3

N2k2

N1k2

N1k1

N1k3

N1k2

N1k1

N2k3

N1k2

N2k1

N1k3

N1k2

N2k1

N2k3

N2k2

N1k1

N1k3

N2k2

N1k1

N2k3

N2k2

N2k1

N1k3

N2k2

N2k1

N2k3

Fig. 3. Generating candidates.

combining objects (or their MBRs). Even though pruningtechniques have been explored, it has been observed thatthe performance drops dramatically, when the number ofquery keywords increases, because of the fast increaseof candidate keyword covers generated. This motivates usto develop a different algorithm called keyword nearestneighbor expansion (keyword-NNE).

We focus on a particular query keyword, called principalquery keyword. The objects associated with the principalquery keyword are called principal objects. Let k be theprincipal query keyword. The set of principle objects isdenoted as Ok.

Definition 4 (Local Best Keyword Cover): Given a setof query keywords T and the principal query keywordk ∈ T , the local best keyword cover of a principal objectok is

lbkcok = {kcok|kcok ∈ KCok,

kcok.score = maxkc∈KCok

kc.score}. (5)

Where KCok is the set of keyword covers in each of whichthe principal object ok is a member.

For each principal object ok ∈ Ok, lbkcok is identified.Among all principal objects, the lbkcok with the highestscore is called global best keyword cover (GBKCk).

Lemma 3: GBKCk is the solution of BKC query.Proof: Assume the solution of BKC query is a

keyword cover kc other than GBKCk, i.e., kc.score >GBKCk.score. Let ok be the principal object inkc. By definition, lbkcok.score ≥ kc.score, andGBKCk.score ≥ lbkcok.score. So, GBKCk.score ≥kc.score must be true. This conflicts to the assumption thatBKC is a keyword cover kc other than GBKCk.

The sketch of keyword-NNE algorithm is as follows:

Sketch of Keyword-NNE Algorithm

Step 1. One query keyword k ∈ T is selected as theprincipal query keyword;

Step 2. For each principal object ok ∈ Ok, lbkcok iscomputed;

Step 3. In Ok, GBKCk is identified;Step 4. return GBKCk.

Conceptually, any query keyword can be selected as theprincipal query keyword. Since computing lbkc is requiredfor each principal object, the query keyword with theminimum number of objects is selected as the principalquery keyword in order to achieve high performance.

6.1 LBKC ComputationGiven a principal object ok, lbkcok consists of ok and theobjects in each non-principal query keyword which is closeto ok and have high keyword ratings. It motivates us tocompute lbkcok by incrementally retrieving the keywordnearest neighbors of ok.

6.1.1 Keyword Nearest NeighborDefinition 5 (Keyword Nearest Neighbor (Keyword-NN)):

Given a set of query keywords T , the principal querykeyword is k ∈ T and a non-principal query keyword iski ∈ T/{k}. Ok is the set of principal objects and Oki

is the set of objects of keyword ki. The keyword nearestneighbor of a principal object ok ∈ Ok in keyword ki isoki ∈ Oki iif {ok, oki}.score ≥ {ok, o′ki}.score for allo′ki ∈ Oki.

The first keyword-NN of ok in keyword ki is denotedas ok.nn1ki and the second keyword-NN is ok.nn2ki, andso on. These keyword-NNs can be retrieved by browsingKRR*ki-tree. Let Nki be a node in KRR*ki-tree.

{ok, Nki}.score = score(A,B). (6)A = dist(ok, Nki.)

B = min(Nki.maxrating, ok.rating).

where dist(ok, Nki) is the minimum distance between okand Nki in the 2-dimensional geographical space defined byx and y dimensions, and Nki.maxrating is the maximumvalue of Nki in keyword rating dimension.

Lemma 4: For any object oki under node Nki inKRR*ki-tree,

{ok, Nki}.score ≥ {ok, oki}.score. (7)

Proof: It is a special case of Lemma 2.To retrieve keyword-NNs of a principal object ok in

keyword ki, KRR*ki-tree is browsed in the best-first strat-egy [9]. The root node of KRR*ki-tree is visited first bykeeping its child nodes in a heap H . For each node Nki ∈H , {ok, Nki}.score is computed. The node in H with thehighest score is replaced by its child nodes. This operationis repeated until an object oki (not a KRR*ki-tree node) isvisited. {ok, oki}.score is denoted as current best and okis the current best object. According to Lemma 4, any nodeNki ∈ H is pruned if {ok, Nki}.score ≤ current best.When H is empty, the current best object is ok.nn1ki. Inthe similar way, ok.nn

jki (j > 1) can be identified.

6.1.2 lbkc Computing AlgorithmComputing lbkcok is to incrementally retrieve keyword-NNs of ok in each non-principal query keyword. Anexample is shown in Figure 4 where query keywords




kc.score

0.3

0.3

0.4

0.4

Incrementally retrieve keyword-NN of t3

“hotel”“restaurant”

Computing lbkct3

S

{t3 , c2 }.score =0.7

{t3 , c3 }.score =0.6

1

2

3

t3 , s3 , c2

t3 ,s3 , s2 , c2

steps

{t3 , s3 }.score = 0.8

{t3 , s2 }.score = 0.28

t3 ,s3 , s2 , c2 , c3

{t3 , c4 }.score =0.35 4 t3 ,s3 , s2 , c2 , c3 , c4

Hotel

0.1

0.5

0.2

0.10.5

0.2

0.5

0.2

0.3

0.5

0.1

0.4

0.9

0.3

0.6

0.8

0.4

0.1

0.3

s2

c2

Restaurant Bar

0.5

0.5

s3

t3

c2

c4

Fig. 4. An example of lbkc computation.

are “bar”, “restaurant” and “hotel”. The principal querykeyword is “bar”. Suppose we are computing lbkct3. Thefirst keyword-NN of t3 in “restaurant” and “hotel” are c2and s3 respectively. A set S is used to keep t3, s3, c2.Let kc be the keyword cover in S which has the highestscore (the idea of Apriori algorithm can be used, seesection 5). After step 1, kc.score = 0.3. In step 2, “hotel”is selected and the second keyword-NN of t3 in “hotel” isretrieved, i.e., s2. Since {t3, s2}.score < kc.score, s2 canbe pruned and more importantly all objects not accessed in“hotel” can be pruned according to Lemma 5. In step 3,the second keyword-NN of t3 in “restaurant” is retrieved,i.e., c3. Since {t3, c3}.score > kc.score, c3 is insertedinto S. As a result, kc is updated to 0.4. Then, the thirdkeyword-NN of t3 in “restaurant” is retrieved, i.e., c4. Since{t3, c4}.score < kc.score, c4 and all objects not accessedyet in “restaurant” can be pruned according to Lemma 5.To this point, the current kc is lbkct3.

Lemma 5: If kc.score > {ok, ok.nntki}, ok.nntki andok.nn

t′

ki (t′ > t) must not be in lbkcok.Proof: By definition, kc.score ≤ lbkcok.score.

Since {ok, ok.nntki}.score < kc.score, we have {ok,ok.nn

tki}.score < lbkcok.score and in turn {ok,

ok.nnt′

ki}.score < lbkcok.score. If ok.nntki is in

lbkcok, {ok, ok.nntki}.score ≥ lbkcok.score according toLemma 1, so is ok.nnt

′

ki. Thus, ok.nntki and ok.nnt′

ki mustnot be in lbkcok.

For each non-principal query keyword ki, after retrievingthe first t keyword-NNs of ok in keyword ki, we useki.score to denoted {ok, ok.nntki}.score. For example inFigure 4, “restaurant”.score is 0.7, 0.6 and 0.35 afterretrieving the 1st, 2nd and 3rd keyword-NN of t3 in“restaurant”. From Lemma 5,

Lemma 6: lbkcok ≡ kc once kc.score >max

ki∈T/{k}(ki.score).

Algorithm 4: Local Best Keyword Cover(ok, T )

Input: A set of query keywords T , a principal objectok

Output: lbkcok1 foreach non-principal query keyword ki ∈ T do2 S ← retrieve ok.nn1ki;3 ki.score← {ok, ok.nn1ki}.score;4 ki.n = 1;

5 kc← the keyword cover in S;

6 while T 6= ∅ do7 Find ki ∈ T/{k}, ki.score = max

kj∈T/{k}(kj .score);

8 ki.n← ki.n+ 1;9 S ← S∪ retrieve ok.nnki.nki ;

10 ki.score← {ok, ok.nnki.nki }.score;11 temp kc← the keyword cover in S;12 if temp kc.score > kc.score then13 kc← temp kc;14 foreach ki ∈ T/{k} do15 if ki.score ≤ kc.score then16 remove ki from T ;

17 return kc;

Algorithm 4 shows the pseudo-code of lbkcok com-puting algorithm. For each non-principal query key-word ki, the first keyword-NN of ok is retrieved andki.score = {ok, ok.nn1ki}.score. They are kept in Sand the best keyword cover kc in S is identified us-ing Generate Candidate function in Algorithm 3. Theobjects in different keywords are combined. Each timethe most promising combination are selected to furtherdo further combination until the best keyword cover isidentified. When the second keyword-NN of ok in kiis retrieved, ki.score is updated to {ok, ok.nn2ki}.score,and so on. Each time one non-principal query key-word is selected to search next keyword-NN in it. Notethat we always select keyword ki ∈ T/{k} whereki.score = maxkj∈T/{k}(kj .score) to minimize the num-ber of keyword-NNs retrieved (line 7). After the nextkeyword-NN of ok in this keyword is retrieved, it is insertedinto S and kc is updated. If ki.score < kc.score, allobjects in ki can be pruned by deleting ki from T accordingto Lemma 5. When T is empty, kc is returned to lbkcokaccording to Lemma 6.

6.2 Keyword-NNE AlgorithmIn keyword-NNE algorithm, the principal objects are pro-cessed in blocks instead of individually. Let k be theprincipal query keyword. The principal objects are indexedusing KRR*k-tree. Given a node Nk in KRR*k-tree, alsoknown as a principal node, the local best keyword cover of




Nk, lbkcNk, consists of Nk and the corresponding nodesof Nk in each non-principal query keyword.

Definition 6 (Corresponding Node): Nk is a node ofKRR*k-tree at the hierarchical level i. Given a non-principal query keyword ki, the corresponding nodes of Nk

are nodes in KRR*ki-tree at the hierarchical level i.The root of a KRR*-tree is at hierarchical level 1, its

child nodes are at hierarchical level 2, and so on. Forexample, if Nk is a node at hierarchical level 4 in KRR*k-tree, the corresponding nodes of Nk in keyword ki arethese nodes at hierarchical level 4 in KRR*ki-tree. From thecorresponding nodes, the keyword-NNs of Nk are retrievedincrementally for computing lbkcNk.

Lemma 7: If a principal object ok is an object under aprincipal node Nk in KRR*k-tree

lbkcNk.score ≥ lbkcok.score.

Proof: Suppose lbkcNk = {Nk, Nk1, .., Nkn} andlbkcok = {ok, ok1, .., okn}. For each non-principal querykeyword ki, oki is under a corresponding node of Nk, sayN ′ki. Note that N ′ki can be in lbkcNk or not. By definition,

lbkcNk.score ≥ {Nk, N′k1, .., N

′kn}.score.

According to Lemma 2

{Nk, N′k1, .., N

′kn}.score ≥ lbkcok.score.

So, we have

lbkcNk.score ≥ lbkcok.score.

The lemma is proved.The pseudo-code of keyword-NNE algorithm is pre-

sented in Algorithm 5. Keyword-NNE algorithm starts byselecting a principal query keyword k ∈ T (line 2). Then,the root node of KRR*k-tree is visited by keeping its childnodes in a heap H . For each node Nk in H , lbkcNk.score iscomputed (line 5). In H , the one with the maximum score,denoted as H.head, is processed. If H.head is a node ofKRR*k-tree (line 8-14), it is replaced in H by its childnodes. For each child node Nk, we compute lbkcNk.score.Correspondingly, H.head is updated. If H.head is a princi-ple object ok rather than a node in KRR*k-tree (line 15-21),lbkcok is computed. If lbkcok.score is greater than the scoreof the current best solution bkc (bkc.score = 0 initially),bkc is updated to be lbkcok. For any Nk ∈ H , Nk ispruned if lbkcNk.score ≤ bkc.score since lbkcok.score ≤lbkcok for every ok under Nk in KRR*k-tree according toLemma 7. Once H is empty, bkc is returned to BKC query.

7 WEIGHTED AVERAGE OF KEYWORD RAT-INGSTo this point, the minimum keyword rating of objects in Ois used in O.score. However, it is unsurprising that a userprefers the weighted average of keyword ratings of objectsin O to measure O.score.

O.score = α×(1− diam(O)

max dist)+(1−α)×W Average(O)

max rating.

(8)

Algorithm 5: Keyword-NNE(T,D)

Input: A set of query keywords T , a spatial databaseD

Output: Best Keyword Cover

1 bkc.score← 0;2 k ← select the principal query keyword from T ;3 H ← child nodes of the root in KRR*k-tree;4 foreach Nk ∈ H do5 Compute lbkcNk.score;

6 H.head← Nk ∈ H with maxNk∈H

lbkcNk.score;

7 while H 6= ∅ do8 while H.head is a node in KRR*k-tree do9 N← child nodes of H.head;

10 foreach Nk in N do11 Compute lbkcNk.score;12 Insert Nk into H;

13 Remove H.head from H;14 H.head← Nk ∈ H with max

Nk∈HlbkcNk.score;

/* H.head is a principal object(i.e., not a node in KRR*k-tree)

*/15 ok ← H.head;16 Compute lbkcok.score;17 if bkc.score < lbkcok.score then18 bkc← lbkcok;

19 foreach Nk in H do20 if lbkcNk.score ≤ bkc.score then21 Remove Nk from H;

22 return bkc;

where W Average(O) is defined as:

W Average(O) =

∑oki∈O

wki ∗ oki.rating

|O|. (9)

where wki is the weight associated with the query keywordki and

∑ki∈T wki = 1. For example, a user may give

higher weight to “hotel” but lower weight to “restaurant”in a BKC query. Given the score function in Equation (8),the baseline algorithm and keyword-NNE algorithm canbe used to process BKC query with minor modification.The core is to maintain the property in Lemma 1 andin Lemma 2 which are the foundation of the pruningtechniques in the baseline algorithm and the keyword-NNEalgorithm.

However, the property in Lemma 1 is invalid given thescore function defined in Equation (9). To maintain thisproperty, if a combination does not cover a query keywordki, this combination is modified by inserting a virtual objectassociated with ki. This virtual object does not changethe diameter of the combination, but it has the maximumrating of ki (for the combination of nodes, a virtual node




NK

cNK

kc1 = {NK, NK1, NK2}

kc2 = {NK, N1

K1, N5

K2}

kc3 = {NK, N2K1, N1K1}

GNk

kc4 = {NK, N2K1, N1K1}

kc5 = {cNK, cNK1, cNK2}kc6 = {cNK, cN3

K1, cNK2}

GcNk

> BKC.score

<= BKC.score

kc7 = {cNK, cNK1, cN4K2}

kc8 = {cNK, cN2K1, cN4

K2}

> BKC.score

<= BKC.score

KRR*k-tree

lbkcNk

lbkccNk

BF-baseline Algorithm Keyword-NNE Algorithm

G1Nk

G2Nk

G1cNk

G2cNk

Fig. 5. BF-baseline vs. keyword-NNE.

is inserted in the similar way). The W Average(O) isredefined to W Average∗(O).

W Average∗(O) =E + F

|T |. (10)

E =∑

oki∈Owki ∗ oki.rating.

F =∑

kj∈T/O.T

wkj ∗Okj .maxrating.

where T/O.T is the set of query keywords not covered byO, Okj .maxrating is the maximum rating of objects inOkj . For example in Figure 1, suppose the query keywordsare “restaurant”, “hotel” and “bar”. For a combination O ={t1, c1}, W Average(O) = wt∗t1.rating + wc∗c1.ratingwhile W Average∗(O) = wt ∗ t1.rating + wc ∗ c1.rating+ ws ∗max ratings where wt, wc and ws are the weightsassigned to “bar”, “restaurant” and “hotel” respectively, andmax ratings is the highest keyword rating of objects in“hotel”.

Given O.score with W Average∗(O), it is easy toprove that the property in Lemma 1 is valid. Note thatthe purpose of W Average∗(O) is to apply the pruningtechniques in the baseline algorithm and keyword-NNE al-gorithm. It does not affect the correctness of the algorithms.In addition, the property in Lemma 2 is valid no matterwhich of W Average∗(O) and W Average(O) is usedin O.score.

8 ANALYSISTo help analysis, we assume a special baseline algorithmBF -baseline which is similar to the baseline algorithm butthe best-first KRR*-tree browsing strategy is applied. Foreach query keyword, the child nodes of the KRR*-treeroot are retrieved. The child nodes from different query

keywords are combined to generate candidate keywordcovers (in the same way as in the baseline algorithm, seesection 5) which are stored in a heap H . The candidatekc ∈ H with the maximum score is processed by retrievingthe child nodes of kc. Then, the child nodes of kc arecombined to generate more candidates which replace kc inH . This process continues until a keyword cover consistingof objects only is obtained. This keyword cover is thecurrent best solution bkc. Any candidate kc ∈ H ispruned if kc.score ≤ bkc.score. The remaining candi-dates in H are processed in the same way. Once H isempty, the current bkc is returned to BKC query. In BF -baseline algorithm, only if a candidate keyword cover kchas kc.score > BKC.score, it is further processed byretrieving the child nodes of kc and combining them togenerate more candidates.

8.1 BaselineHowever, BF -baseline algorithm is not feasible in practice.The main reason is that BF -baseline algorithm requires tomaintain H in memory. The peak size of H can be verylarge because of the exhaustive combination until the firstcurrent best solution bkc is obtained. To release the memorybottleneck, the depth-first browsing strategy is applied inthe baseline algorithm such that the current best solution isobtained as soon as possible (see section 5). Compared tothe best-first browsing strategy which is global optimal, thedepth-first browsing strategy is a kind of greedy algorithmwhich is local optimal. As a consequence, if a candidatekeyword cover kc has kc.score > bkc.score, kc is furtherprocessed by retrieving the child nodes of kc and combiningthem to generate more candidates. Note that bkc.scoreincreases from 0 to BKC.score in the baseline algorithm.Therefore, the candidate keyword covers which are furtherprocessed in the baseline algorithm can be much more thanthat in BF -baseline algorithm.

Given a candidate keyword cover kc, it is further pro-cessed in the same way in both the baseline algorithm andBF -baseline algorithm, i.e., retrieving the child nodes ofkc and combine them to generate more candidates usingGenerate Candidate function in Algorithm 3. Since thecandidate keyword covers further processed in the baselinealgorithm can be much more than that in BF -baselinealgorithm, the total candidate keyword covers generated inthe baseline algorithm can be much more than that in BF -baseline algorithm.

Note that the analysis captures the key characters of thebaseline algorithm in BKC query processing which areinherited from the methods for mCK query processing [17,18]. The analysis is still valid if directly extending themethods [17, 18] to process BKC query as introduced insection 4.

8.2 Keyword-NNEIn keyword-NNE algorithm, the best-first browsing strategyis applied like BF -baseline but large memory requirementis avoided. For the better explanation, we can imagine




all candidate keyword covers generated in BF -baselinealgorithm are grouped into independent groups. Each groupis associated with one principal node (or object). That is,the candidate keyword covers fall in the same group if theyhave the same principal node (or object). Given a principalnode Nk, let GNk be the associated group. The examplein Figure 5 shows GNk where some keyword covers suchas kc1, kc2 have score greater than BKC.score, denotedas G1

Nk, and some keyword covers such as kc3, kc4 havescore not greater than BKC.score, denoted as G2

Nk. InBF -baseline algorithm, GNk is maintained in H before thefirst current best solution is obtained, and every keywordcover in G1

Nk needs to be further processed.In keyword-NNE algorithm, the keyword cover in GNk

with the highest score, i.e., lbkcNk, is identified and main-tained in memory. That is, each principal node (or object)keeps its lbkc only. The total number of principal nodes (orobjects) is O(n log n) where n is the number of principalobjects. So, the memory requirement for maintaining H isO(n log n). The (almost) linear memory requirement makesthe best-first browsing strategy practical in keyword-NNEalgorithm. Due to the best-first browsing strategy, lbkcNk

is further processed in keyword-NNE algorithm only iflbkcNk.score > BKC.score.

8.2.1 Instance OptimalityThe instance optimality [7] corresponds to the optimalityin every instance, as opposed to just the worst case or theaverage case. There are many algorithms that are optimalin a worst-case sense, but are not instance optimal. Anexample is binary search. In the worst case, binary searchis guaranteed to require no more than logN probes forN data items. By linear search which scans through thesequence of data items, N probes are required in the worstcase. However, binary search is not better than linear searchin all instances. When the search item is in the very firstposition of the sequence, a positive answer can be obtainedin one probe and a negative answer in two probes usinglinear search. The binary search may still require logNprobes.

Instance optimality can be formally defined as follows:for a class of correct algorithms A and a class of valid inputD to the algorithms, cost(A,D) represents the amount of aresource consumed by running algorithm A ∈ A on inputD ∈ D. An algorithm B ∈ A is instance optimal over Aand D if cost(B,D) = O(cost(A,D)) for ∀A ∈ A and∀D ∈ D. This cost could be running time of algorithm Aover input D.

Theorem 1: Let D be the class of all possible spatialdatabases where each tuple is a spatial object and isassociated with a keyword. Let A be the class of anycorrect BKC processing algorithm over D ∈ D. Forall algorithms in A, multiple KRR*-trees, each for onekeyword, are explored by combining nodes at the samehierarchical level until leaf node, and no combination ofobjects (or nodes of KRR*-trees) has been pre-processed,keyword-NNE algorithm is optimal in terms of the numberof candidate keyword covers which are further processed.

Proof: Due to the best-first browsing strategy, lbkcNk

is further processed in keyword-NNE algorithm only iflbkcNk.score > BKC.score. In any algorithm A ∈ A, anumber of candidate keyword covers need to be generatedand assessed since no combination of objects (or nodesof KRR*-trees) has been pre-processed. Given a node (orobject) N , the candidate keyword covers generated can beorganized in a group if they contain N . In this group, ifone keyword cover has score greater than BKC.score, thepossibility exists that the solution of BKC query is relatedto this group. In this case, A needs to process at least onekeyword cover in this group. If A fails to do this, it maylead to an incorrect solution. That is, no algorithm in A canprocess less candidate keyword covers than keyword-NNEalgorithm.

8.2.2 Candidate Keyword Covers ProcessingEvery candidate keyword cover in G1

Nk is further processedin BF -baseline algorithm. In the example in Figure 5, kc1is further processed, so does every kc ∈ G1

Nk. Let us lookcloser at kc1 = {Nk, Nk1, Nk2} processing. As introducedin section 4, each node N in KRR*-tree is defined asN(x, y, r, lx, ly, lr) which can be represented with 48 bytes.If the disk pagesize is 4096 bytes, the reasonable fan-outof KRR*-tree is 40-50. That is, each node in kc1 (i.e., Nk,Nk1 and Nk2) has 40-50 child nodes. In kc1 processing inBF -baseline algorithm, these child nodes are combined togenerate candidate keyword covers using Algorithm 3.

In keyword-NNE algorithm, one and only one keywordcover in G1

Nk, i.e., lbkcNk, is further processed. Foreach child node cNk of Nk, lbkccNk is computed. Forcomputing lbkccNk, a number of keyword-NNs of cNk

are retrieved and combined to generate more candidatekeyword covers using Algorithm 3. The experiments on realdata sets illustrate that only 2-4 keyword-NNs in average ineach non-principal query keyword are retrieved in lbkccNk

computation.When further processing a candidate keyword cover,

keyword-NNE algorithm typically generates much less newcandidate keyword covers compared to BF -baseline al-gorithm. Since the number of candidate keyword coversfurther processed in keyword-NNE algorithm is optimal(Theorem 1), the number of keyword covers generated inBF -baseline algorithm is much more than that in keyword-NNE algorithm. In turn, we conclude that the number ofkeyword covers generated in baseline algorithm is muchmore than that in keyword-NNE algorithm. This conclusionis independent of the principal query keyword since theanalysis does not apply any constraint on the selectionstrategy of principal query keyword.

9 EXPERIMENT

In this section we experimentally evaluate keyword-NNEalgorithm and the baseline algorithm. We use four real datasets, namely Yelp, Yellow Page, AU, and DE. Specifically,Yelp is a data set extracted from Yelp Academic Dataset(www.yelp.com) which contains 7707 POIs (i.e., points




of interest, which are equivalent to the objects in thiswork) with 27 keywords where the average, maximum andminimum number of POIs in each keyword are 285, 1353and 120 respectively. Yellow Page is a data set obtainedfrom yellowpage.com.au in Sydney which contains 30444POIs with 26 keywords where the average, maximum andminimum number of POIs in each keyword are 1170, 10284and 154 respectively. All POIS in Yelp has been ratedby customers from 1 to 10. About half of the POIs inYellow Page have been rated by Yelp, the unrated POIs areassigned average rating 5. AU and US are extracted from apublic source 3. AU contains 678581 POIs in Australia with187 keywords where the average, maximum and minimumnumber of POIs in each keyword are 3728, 53956 and 403respectively. US contains 1541124 POIs with 237 keywordswhere the average, maximum and minimum number ofPOIs in each keyword are 6502, 122669 and 400. In AUand US, the keyword ratings from 1 to 10 are randomlyassigned to POIs. The ratings are in normal distributionwhere the mean µ = 5 and the standard deviation σ = 1.The distribution of POIs in keywords are illustrated inFigure 6. For each data set, the POIs of each keyword areindexed using a KRR*-tree.

Average = 3728

Average = 6502

Average = 285 Average=1170

Fig. 6. The distribution of keyword size in test datasets.

(a) Yellow Page and Yelp (b) AU and US

Fig. 7. Baseline, Virtual bR*-tree and bR*-tree.

We are interested in 1) the number of candidate keywordcovers generated, 2) BKC query response time, 3) the

3. http://s3.amazonaws.com/simplegeo-public/places dump 20110628.zip

maximum memory consumed, and 4) the average numberof keyword-NNs of each principal node (or object) retrievedfor computing lbkc and the number of lbkcs computed foranswering BKC query. In addition, we test the perfor-mance in the situation that the weighted average of keywordratings is applied as discussed in section 7. All algorithmsare implemented in Java 1.7.0. and all experiments havebeen performed on a Windows XP PC with 3.0 Ghz CPUand 3 GB main memory.

In Figure 7, the number of keyword covers generated inbaseline algorithm is compared to that in the algorithmsdirectly extended from [17, 18] when the number of querykeywords m changes from 2 to 9. It shows that the baselinealgorithm has better performance in all settings. This isconsistent with the analysis in section 5. The test results onYellow Page and Yelp data sets are shown in Figure 7 (a)which represents data sets with small number of keywords.The test results on AU and US data sets are shown inFigure 7 (b) which represents data set with large numberof keywords. As observed, when the number of keywordsin a data set is small, the difference between baselinealgorithm and the directly extended algorithms is reduced.The reason is that the single tree index in the directlyextended algorithms has more pruning power in this case(as discussed in section 4).

9.1 Effect of mThe number of query keywords m has significant impactto query processing efficiency. In this test, m is changedfrom 2 to 9 when α = 0.4. Each BKC query is generatedby randomly selecting m keyword from all keywords as thequery keywords. For each setting, we generate and perform100 BKC queries, and the averaged results are reported 4.Figure 8 shows the number of candidate keyword coversgenerated for BKC query processing. When m increases,the number of candidate keyword covers generated in-creases dramatically in the baseline algorithm. In contrast,keyword-NNE algorithm shows much better scalability. Thereason has been explained in section 8.

Figure 9 reports the average response time of BKCquery when m changes. The response time is closely relatedto the candidate keyword covers generated during the queryprocessing. In the baseline algorithm, the response timeincreases very fast when m increases. This is consistentwith the fast increase of the candidate keyword coversgenerated when m increases. Compared to the baselinealgorithm, the keyword-NNE algorithm shows much slowerincrease when m increases.

For processing the BKC queries at the same settingsof m and α, the performances are different on datasetsUS, AU, YELP and Yellow Page as shown in Figure 8and Figure 9. The reason is that the average number ofobjects in one keyword in datasets US, AU, YELP andYellow Page are 6502, 3728, 1170 and 285 respectively asshown in Figure 6; in turn, the average numbers of objectsin one query keyword in the BKC queries on dataset US,

4. In this work, all experimental results are obtained in the same way.




(a) (b) (c) (d)

Fig. 8. Number of candidate keyword covers generated vs. m (α=0.4).

(a) (b) (c) (d)

Fig. 9. Response time vs. m (α=0.4).

(a) (b) (c) (d)

Fig. 10. Number of candidate keyword covers generated vs. α (m=4).

(a) (b) (c) (d)

Fig. 11. Response time vs. α (m=4).

AU, YELP and Yellow Page are expected to be the same.The experimental results show the higher average numbersuch as on dataset US leads to the more candidate keywordcovers and the more processing time.

9.2 Effect of α

This test shows the impact of α to the performance.As shown in Equation (2), α is an application specificparameter to balance the weight of keyword rating andthe diameter in the score function. Compared to m, theimpact of α to the performance is limited. When α = 1,BKC query is degraded to mKC query where the distancebetween objects is the sole factor and keyword rating is

ignored. When α changes from 1 to 0, more weight isassigned to keyword rating. In Figure 10, an interestingobservation is that with the decrease of α the number ofkeyword covers generated in both the baseline algorithmand keyword-NNE algorithm shows a constant trend ofslight decrease. The reason behind is that KRR*-tree hasa keyword rating dimension. Objects close to each othergeographically may have very different ratings and thusthey are in different nodes of KRR*-tree. If more weight isassigned to keyword ratings, KRR*-tree tends to have morepruning power by distinguishing the objects close to eachother but with different keyword ratings. As a result, lesscandidate keyword covers are generated. Figure 11 presents




the average response time of queries which are consistentwith the number of candidate keyword covers generated.BKC query provides robust solutions to meet various

practical requirements while mCK query cannot. Supposewe have three query keywords in Yelp dataset, namely,“bars”, “hotels & travel”, and “fast food”. When α= 1,the BKC query (equivalent to mCK query) returns PitaHouse, Scottsdale Neighborhood Trolley, and Schlotzskys(the names of the selected objects in keyword “bars”,“hotels & travel”, and “fast food” respectively) where thelowest keyword rating is 2.5 and the maximum distanceis 0.045km. When α= 0.4, the BKC query returns TheAttic, Enterprise Rent-A-Car and Chick-Fil-A where thelowest keyword rating is 4.5 and the maximum distanceis 0.662km.

9.3 Maximum Memory ConsumptionThe maximum memory consumed by the baseline algorithmand keyword-NNE algorithm are reported in Figure 12 (theaverage results of 100 BKC queries on each of four datasets). It shows that the maximum memory consumed inkeyword-NNE algorithm is up to 0.5MB in all settings ofm while it increases very fast when m increases in thebaseline algorithm. As discussed in section 8, keyword-NNE algorithm only maintains the principal nodes (orobjects) in memory while the baseline algorithm maintainscandidate keyword covers in memory.

M

Fig. 12. Maximum memory consumed vs. m (α=0.4).

9.4 Keyword-NNEThe high performance of keyword-NNE algorithm is dueto that each principal node (or object) only retrieves a fewkeyword-NNs in each non-principal query keyword. Sup-pose all retrieved keyword-NNs in keyword-NNE algorithmare kept in a set S. In Figure 13 (a), the average size ofS is shown. The data sets are randomly sampled so thatthe number of objects in each query keyword in a BKCquery is from 100 to 3000. It illustrates that the impactof the number of objects in query keywords to the sizeof S is limited. On the contrary, it shows that the sizeof S is clearly influenced by m. When m increases from2 to 9, S increases linearly. In average, a principal node(or object) only retrieves 2-4 keyword-NNs in each non-principal query keyword. Figure 13 (b) shows the numberof lbkcs computed in query processing. We can see lessthan 10% of principal nodes (or objects) need to compute

their lbkcs in different sizes of data sets. In other words,90% of the overall principal nodes (or objects) are prunedduring the query processing.

M

(a) Average size of S vs. m.

M

(b) Number of lbkcs computed vs.m.

Fig. 13. Features of keyword-NNE (α=0.4).

9.5 Weighted Average of Keyword RatingsThe tests compare the weighted average of keyword ratingand the minimum keyword rating to performance. Theaverage experimental results of 100 BKC queries on eachof four data sets are reported in Figure 14. We can seethe difference between these two situations is trivial. Thisis because the score computation in the situation of theminimum keyword rating is fundamentally equivalent tothat in the situation of weight average. In the formersituation, if a combination O of objects (or their MBRs)does not cover a keyword, the rating of this keyword usedfor computing O.score is 0 while it is the maximum ratingof this keyword in the latter situation.

M

(a) Response time vs. m.

M

(b) Number of nodes visited vs. m.

Fig. 14. Weighted average vs. minimum (α=0.4).

10 CONCLUSION

Compared to the most relevant mCK query, BKC queryprovides an additional dimension to support more sensi-ble decision making. The introduced baseline algorithmis inspired by the methods for processing mCK query.The baseline algorithm generates a large number of candi-date keyword covers which leads to dramatic performancedrop when more query keywords are given. The proposedkeyword-NNE algorithm applies a different processingstrategy, i.e., searching local best solution for each objectin a certain query keyword. As a consequence, the numberof candidate keyword covers generated is significantlyreduced. The analysis reveals that the number of candidate




keyword covers which need to be further processed inkeyword-NNE algorithm is optimal and processing eachkeyword candidate cover typically generates much less newcandidate keyword covers in keyword-NNE algorithm thanin the baseline algorithm.

REFERENCES

[1] Rakesh Agrawal and Ramakrishnan Srikant. “Fastalgorithms for mining association rules in largedatabases”. In: VLDB. 1994, pp. 487–499.

[2] T. Brinkhoff, H. Kriegel, and B. Seeger. “Efficientprocessing of spatial joins using R-trees”. In: SIG-MOD (1993), pp. 237–246.

[3] Xin Cao, Gao Cong, and Christian S. Jensen. “Re-trieving top-k prestige-based relevant spatial web ob-jects”. In: Proc. VLDB Endow. 3.1-2 (2010), pp. 373–384.

[4] Xin Cao et al. “Collective spatial keyword querying”.In: ACM SIGMOD. 2011.

[5] G. Cong, C. Jensen, and D. Wu. “Efficient retrievalof the top-k most relevant spatial web objects”. In:Proc. VLDB Endow. 2.1 (2009), pp. 337–348.

[6] Ian De Felipe, Vagelis Hristidis, and Naphtali Rishe.“Keyword Search on Spatial Databases”. In: ICDE.2008, pp. 656–665.

[7] R. Fagin, A. Lotem, and M. Naor. “Optimal Aggre-gation Algorithms for Middleware”. In: Journal ofComputer and System Sciences 66 (2003), pp. 614–656.

[8] Ramaswamy Hariharan et al. “Processing Spatial-Keyword (SK) Queries in Geographic InformationRetrieval (GIR) Systems”. In: Proceedings of the19th International Conference on Scientific and Sta-tistical Database Management. 2007, pp. 16–23.

[9] G. R. Hjaltason and H. Samet. “Distance browsing inspatial databases”. In: TODS 2 (1999), pp. 256–318.

[10] Z. Li et al. “IR-tree: An efficient index for geo-graphic document search”. In: TKDE 99.4 (2010),pp. 585–599.

[11] N. Mamoulis and D. Papadias. “Multiway spatialjoins”. In: TODS 26.4 (2001), pp. 424–475.

[12] D. Papadias, N. Mamoulis, and B. Delis. “Algorithmsfor querying by spatial structure”. In: VLDB (1998),p. 546.

[13] D. Papadias, N. Mamoulis, and Y. Theodoridis. “Pro-cessing and optimization of multiway spatial joinsusing R-trees”. In: PODS (1999), pp. 44–55.

[14] J. M. Ponte and W. B. Croft. “A language modelingapproach to information retrieval”. In: SIGIR (1998),pp. 275–281.

[15] Joao B. Rocha-Junior et al. “Efficient processing oftop-k spatial keyword queries”. In: Proceedings ofthe 12th international conference on Advances inspatial and temporal databases. 2011, pp. 205–222.

[16] S. B. Roy and K. Chakrabarti. “Location-Aware TypeAhead Search on Spatial Databases: Semantics andEfficiency”. In: SIGMOD (2011).

[17] Dongxiang Zhang, Beng Chin Ooi, and Anthony K.H. Tung. “Locating mapped resources in web 2.0”.In: ICDE (2010).

[18] Dongxiang Zhang et al. “Keyword Search in SpatialDatabases: Towards Searching by Document”. In:ICDE. 2009, pp. 688–699.

PLACEPHOTOHERE

Ke Deng was awarded a PhD degree incomputer science from The University ofQueensland, Australia, in 2007 and a Masterdegree in Information and CommunicationTechnology from Griffith University, Australia,in 2001. His research background includehigh performance database system, spa-tiotemporal data management, data qualitycontrol and business information system. Hiscurrent research interest focuses on big spa-tiotemporal data management and mining.

PLACEPHOTOHERE

Xin Li is a Master candidate in ComputerScience at Renmin University of China. Hereceived the Bachelor degree from School ofInformation at Renmin University of China in2007. His current research focuses on spatialdatabase and geo-positioning.

PLACEPHOTOHERE

Jiaheng Lu is a professor in Computer Sci-ence at the Renmin University of China.He received the Ph.D. degree from NationalUniversity of Singapore in 2007 and M.S.from Shanghai Jiaotong University in 2001.His research interests span many aspectsof data management. His current researchfocuses on developing an academic searchengine, XML data management, and big datamanagement. He has served in the organi-zation and program committees for various

conferences, including SIGMOD, VLDB, ICDE and CIKM.

PLACEPHOTOHERE

Xiaofang Zhou received the BS and MSdegrees in computer science from NanjingUniversity, China, in 1984 and 1987, respec-tively, and the PhD degree in computer sci-ence from The University of Queensland,Australia, in 1994. He is a professor of com-puter science with The University of Queens-land. He is the head of the Data and Knowl-edge Engineering Research Division, Schoolof Information Technology and Electrical En-gineering. He is also an specially appointed

Adjunct Professor at Soochow University, China. From 1994 to 1999,he was a senior research scientist and project leader in CSIRO.His research is focused on finding effective and efficient solutions tomanaging integrating and analyzing very large amounts of complexdata for business and scientific applications. His research interestsinclude spatial and multimedia databases, high performance queryprocessing, web information systems, data mining, and data qualitymanagement. He is a senior member of the IEEE.

journal of la best keyword cover search · best keyword cover search ke deng, xin li, jiaheng lu,...

Documents