computer science and engineering inverted linear quadtree: efficient top k spatial keyword search...

32
Computer Science and Engineering Inverted Linear Quadtree: Ef cient Top K Spatial Keyword Search Chengyuan Zhang 1 ,Ying Zhang 1 ,Wenjie Zhang 1 , Xuemin Lin 2,1 1 The University of New South Wales, Australia 2 East China Normal University

Upload: berenice-maxted

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Computer Science and Engineering

Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search

Chengyuan Zhang1,Ying Zhang1,Wenjie Zhang1, Xuemin Lin2,1

1The University of New South Wales, Australia

2 East China Normal University

An enormous amount of spatio-textual objects available in many applications

online local search

e.g., online yellow pages

social network services

e.g., Facebook, Flickr

Background

p1 (pizza,coffee,sushi)

p3 (pizza,sushi)

p2 (pizza,coffee,steak)

p4 (coffee,sushi)

p5 (pizza,steak,seafood)

pizza,coffee

4

Top k spatial keyword search (TOPK-SK)

DataA set of spatio-textual objects

Each object is represented a location and a set of keywords

QueryQuery location (q.loc)

A set of query keywords (q.T)

AnswerThe closest k objects, each of which contains all query keywords

Naïve Approach

11 spatio-textual objects

Vocabulary {t1, t2, t3}

Query q with q.T = {t1, t2} and k =1

p4 (t1)

p6 (t2,t3) P10 (t1)

p1 (t1,t2)

p10 (t1)

p3 (t1,t3)

p5 (t2,t3)

p8 (t3)

p7 (t3)

p11 (t2)

p9 (t2)

p2 (t1,t2)

Distance Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

Running Example

Inverted R-tree [Y. Zhou,et al., CIKM 2005] Distance

Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

For each keyword t, construct an R tree for objects containing t

E1 E2

R1 (t1)

R2 (t2)

R3 (t3)

P4 P10P1 P2 P3

E1 E2

P2 P5P1 P6 P11

E1 E2

P6 P7 P9P3 P5 P8

E1 E2

E1E2

E1 E2

7

IR2-tree [ I. D. Felipe, et. al., ICDE 2008]

Index Structure

Combination of an R-Tree and signature technique

Each node contains a rectangle and a signature ( a fixed length bitmap)

Each word is hashed to a particular bit

The signature of a node is the “ Bitwise OR ” of all the signatures of its child nodes

8

Example

E11 E12

11 11

Distance Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

E9 E10

11 11E7 E8

11 11

E6

11

E4 E5

01 11

E3

11

E1 E2

11 01

p2

11

p8 p5

01 01

p10 p11

10 01

p6 P9

01 10

p1 p3

11 11

p4 p7

10 01

E11

E7

t1

E10E9E8

E6E5E4

E3E2E1

E12

10

t3 01

t201

9

Observations

Naïve approach

Disadvantages: all objects in the search region are accessed ( large s and p=1 )

Inverted R-tree

Advantages: exclude unrelated objects ( small s )

Disadvantages: cannot take advantage of AND semantics (p=1)

IR2-tree

Advantages: have filtering technique to reduce p

Disadvantages: large s and p is affected by non-related objects

Other Single Augmented R-tree

Other spatial keyword search : KR tree [R. Hariharan, et al., SSDBM 2007]

WIR tree [D. Wu , et al., TKDE 2011]

Spatial keyword ranking query : IR tree [G. Cong ,et al., PVLDB 2009]

CM-CDIR tree [D. Wu ,et al., VLDBJ 2012]

Their shortcomings: same as IR2-tree

10

Motivation

Index structure have a small number of objects within the search region

can prune objects within the search region

Propertiesfalls in the category of inverted index

exploit the AND semantics

adaptive to the distribution of the objects for each keyword

11

Motivation

non-Empty 1

0

0

1

Empty

Regular space partition based indexing

Each node can be identified by its split sequence (Morton code, a.k.a Z order)

A circle and a square to denote the non-leaf node and leaf node

A leaf node is set black if it is not empty, otherwise, it is a white leaf node

Keep the black leaf nodes (B+ tree)

Linear Quadtree Structure

SW, SE

0001

NE1100

IL-QuadtreeFor each keyword ti V ∈ we build a linear quadtree, denoted by LQi, for the objects which contain the keyword ti

Besides the black leaf nodes we also keep the quadtree node information ( signature )

1 for black leaf nodes and non-leaf nodes and 0 otherwise

14

Search Algorithm Distance Order

P3

P4

P7

P8

P5

P1

P10

P9

P6

P2

P11

DataA set of spatio-textual objects

Each objects has a location and a set of keywords

QueryA location (q.loc)

A set of query keywords (q.T)

A direction [, ]

AnswerThe closest k objects, each of which contains all keywords in q.T, and in the search direction

Direction-aware spatial keyword search [G. Li, et al., ICDE 2012]

16

Spatial Keyword Based Ranking [G. Cong ,et al., PVLDB 2009, VLDBJ 2012]

Query – Spatial location

– Query keywords

Returns the k best objects ranked by– Spatial distance to the query location

– Textual relevance to the query keywords

Spatio-textual ranking Score

The spatial proximity (δ) is the normalized Euclidean distance between p and qThe textual relevance (θ) is the tf-idf based textual similarity between the description of p and the query keywords.

• Our Solutionthe maximal keywords weight replaces the bit signature – aggregate inverted linear quadtree

spatial distance ranking function replaced by spatio-textual ranking score function

Score based pruning based on weight and region of the quadtree node

),()1(),(),( qpqpqp

17

Experimental Setting

Implemented in Java

Debian Linuxo Intel Xeon 2.40GHz dual CPUo 4 GB memory

Dataset

GN : US Board on Geographic NamesTigers, Cars :

o Spatial datasets from Rtree-Portalo Textual content from 20 Newsgroups

SYN: synthetic dataset

Query (1000) : location , #l query keywords

Evaluate Response time and # I/O

18

Definition Notation Default ValueNumber of required result k 10

Number of query keywords l 3

Term frequency of vocabulary z 1.1

Number of objects n 1,000,000

Vocabulary size v 100,000

Avg. keywords per object m 15

Parameters evaluated

Important Statistics

19

Tuning

w’ : Minimal depth of the black leaf node

c: The split threshold

Best performance:

– w’ = 8 and c = 64

20

l: The number of query keywords

Gird : [ M. Christoforaki,et al., CIKM, 2011]

Grid+SIG: the extension of Grid, utilizing signature technique

21

Algorithms Evaluated

ILQ– Inverted Linear Quadtree based techniques

IVR– inverted Rtree [Y. Zhou, et al., CIKM 2005]

MIR2– [I. D. Felipe,et al., ICDE 2008]

KR– [R. Hariharan,et al., SSDBM 2007]

WIR –[D. Wu ,et al., TKDE 2011]

IR– [G. Cong ,et al., PVLDB 2009]

CM-CDIR– [D. Wu ,et al., VLDBJ 2012]

22

Evaluation on different datasets

Comparison – Varying l

24

Comparison – Varying k

Comparison – Varying Parameters

26

Conclusion

Important properties of indexing techniques to support top k spatial keyword search

Propose the inverted linear quadtree structure to efficiently support top k spatial keyword search

Extensive experiment on both real and synthetic data

Future workEnhance the region based signature technique – group objects to reduce false positive.

Support top k spatial keyword search on other metric spaces

27

Our AlgorithmAggregate ILQ

Compare with

IR [G. Cong, et al., PVLDB 2009]

CM-CDIR [D. Wu ,et al., VLDBJ 2012]

Dataset: Tiger

Spatial Keyword Ranking Query

Direction-Aware TOPK-SK Query

Our AlgorithmILQ

Compare withDESKS [G.Li,et al., ICDE 2012]

30

Comparison – Varying k

31

IR-Tree

32

KR* Tree