supporting location-based approximate-keyword queries acm international conference on geographical...

43
Supporting Location- based Approximate- Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li – University of California, Irvine Presenter: Raghav Karumur Date: 3/30/2011 Course: [CSCI 8735] Advanced Database Systems Department of Computer Science and Engineering University of Minnesota, Twin Cities Spring 2011

Upload: shawn-barson

Post on 30-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Supporting Location-based Approximate-Keyword Queries

ACM International conference on Geographical Information Systems 2010

S Alsubaiee, A Behm, C Li – University of California, Irvine

Presenter: Raghav KarumurDate: 3/30/2011

Course: [CSCI 8735] Advanced Database SystemsDepartment of Computer Science and Engineering

University of Minnesota, Twin CitiesSpring 2011

Page 2: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Lunch Time!

2Advanced Database Systems Raghav Karumur Spring 2011

I’ll go for Chinese food! What was the restaurant’s name???Uh…… Ch-o-chi???

Page 3: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Let me Find It!

Page 4: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Errr… Just one typo!

Page 5: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Outline

• Overview• Problem Formulation and Preliminaries• Contributions• Algorithms• Index Construction• Experiments and Analysis• Conclusion• References

5Advanced Database Systems Raghav Karumur Spring 2011

Page 6: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Overview

6Advanced Database Systems Raghav Karumur Spring 2011

Terminology – Clear?

Location-based keyword search consists of : a set of key words + spatial location

Goal: Find objects with these key words close to the location.Ex: User is looking for a restaurant named Chaochi close to San Jose. Consider the query: Q1 : (Chaochi) near (San Jose) The website returns listings close to San Jose that have the key word Chaochi

Problem: Inconsistencies can exist either in user queries/data or both. - Users/ up loaders may enter wrong spelling!Q1’ : (Chochi) near (San Jose)

Therefore, Q1’ may not be able to find the restaurant with the mistyped title.

Hence, support of approximate key word search is necessary!

Page 7: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Overview

7Advanced Database Systems Raghav Karumur Spring 2011

Approach used so far:

Build a collection of keywords similar to the mistyped keyword, and suggest another query, or find objects with these keywords. Drawback of this approach:

No support for simultaneous spatial and textual information.

Page 8: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Problem Formulation

8Advanced Database Systems Raghav Karumur Spring 2011

Object Collection

chaochi restaurant <37.39, -121.87>starbucks <37.79, -122.40>starbucks <40.72, -73.99>apple store <44.59, -92.99>sam’s club <43.59, -116.47>…

Object Collection

chaochi restaurant <37.39, -121.87>starbucks <37.79, -122.40>starbucks <40.72, -73.99>apple store <44.59, -92.99>sam’s club <43.59, -116.47>…

Find objects in “San Jose” with keywords similar to “chochi” & “resturant”

Page 9: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Problem FormulationLocation Based Keyword Search:

• Given a collection of strings, find those that are similar to the given query string.• Consider a collection of spatial objects o1, … , on each having a set of keywords

and a location. • A spatial approximate-keyword query Q = <Qs,Qt> consists of two conditions:

- a spatial condition Qs such as a rectangle or a circle, and

- an approximate keyword condition Qt having a set of k pairs

each representing a keyword wi with an associated similarity threshold i

• Goal: Find all objects in the collection within Qs that satisfy Qt

• An object satisfies Qt if for each keyword wi in Qt , the object has a keyword in

its description whose similarity to wi is within the corresponding threshold i

9Advanced Database Systems Raghav Karumur Spring 2011

},,...,,,,{ 2211 kkwww

Page 10: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Problem Formulation

Approaches:

• Combine these two indexes • Search the resultant index called LBAK-tree to find answers

10Advanced Database Systems Raghav Karumur Spring 2011

Trie-based method

Inverted-index method

Page 11: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Preliminaries: Location-Based Keyword Search

Find objects within a given spatial region that have a given set of keywords

Augment a hierarchal spatial index with textual information

11Advanced Database Systems Raghav Karumur Spring 2011

Page 12: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Preliminaries: Approximate String Search

… chaochi

chucho

church

Query q:

chochi

Collection of strings s

Search

Output: strings s that satisfy Sim(q,s)≤δSim functions: Edit distance, Jaccard, Cosine, etc

12Advanced Database Systems Raghav Karumur Spring 2011

Page 13: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Preliminaries: Approximate String Search

chaochi

2-grams {ch, ha, ao, oc, ch, hi}

Intuition: similar strings share a certain number of grams

Sliding Window

Gram-based inverted-index

13Advanced Database Systems Raghav Karumur Spring 2011

Page 14: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Solution

Tree-based spatial index Approximate string search capability

Keyword search capability

LBAK-Tree

14Advanced Database Systems Raghav Karumur Spring 2011

Page 15: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Contributions

• How to combine those indexes• Three Algorithms

1) Simple fixed-level solution2) Utilizing local spatial distribution of objects3) Exploiting frequency distribution of keywords

15Advanced Database Systems Raghav Karumur Spring 2011

Page 16: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

What is used:• Queries with spatial condition are typically supported by a tree-based

index such as R*-tree, KD-tree, Quad-tree etc.• R*-tree is used in this paper.• Most trie-based indexes are specific to edit distance and its variants, and

do not support other similarity measures such as Jaccard.• However, inverted indexes usually support a family of similarity metrics

such as edit distance, Jaccard, etc. • Inverted-index is therefore used in this paper.• In this paper, LBAK tree is used and is augmented with capabilities for

approximate keyword search. • Gram-based inverted index is used to perform approximate string search.

16Advanced Database Systems Raghav Karumur Spring 2011

Page 17: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

The LBAK tree

17Advanced Database Systems Raghav Karumur Spring 2011

LBAK nodes may be classified into three categories: • S-Nodes: -Do not store any textual information.

- Used only for pruning based on spatial condition

• SA-Nodes: - Store union of keywords of their sub tree.- Stores an approximate index on these keywords.- Used for finding similar keywords, - Used for pruning based on spatial and approximate conditions.

• SK-Nodes: - Store union of keywords of their sub tree.- Used for pruning with spatial condition and keywords.- Must have previously identified relevant similar keywords by the time we reach this node

Page 18: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Alg 1: Simple Fixed Level Solution

18Advanced Database Systems Raghav Karumur Spring 2011

Page 19: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Alg 1: Simple Fixed Level Solution

19Advanced Database Systems Raghav Karumur Spring 2011

Query: objects in “San Jose” with keywords similar to “chochi” & “resturant”– Based on edit distance of 1– Expressed as Q: <{San Jose}; {<chochi, 1>, <resturant, 1>}>.

• The query clearly has typos.. •Assume nodes A, B, C, D satisfy the spatial condition San Jose.• Throughout the traversal of the tree we always check the spatial condition.•At the S-Node A, we only rely on spatial condition for pruning.

Page 20: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Alg 1: Simple Fixed Level Solution

20

•When we reach SA-node B, we search its approximate index to find keywords similar to chochi and resturant according to the edit-distance threshold of 1.• We can find two keywords similar to chochi (namely, chaochi and choochi), and one keyword similar to resturant(namely restaurant).

Page 21: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Alg 1: Simple Fixed Level Solution

21

• Once we visit the SK-nodes C and D, we intersect their stored keywords with {chaochi, choochi} and {restaurant} respectively.

• Clearly, node C can be pruned as it does not have the keyword restaurant.

Page 22: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Alg 1: Simple Fixed Level Solution

22

• Since node D has the keywords chaochi and restaurant, we traverse its children.

• We repeat the process until we find the answers.

Page 23: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

How to Choose Level L?Trade off between space and time – until “some” level (both increase)• Usually, about 90% of query time is spent in approx. index lookups. • Therefore, choose an optimal level L for placement of approx. indexes and

this can greatly improve avg. query time .

23Advanced Database Systems Raghav Karumur Spring 2011

Page 24: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Observations• Query time & index size sensitive to approximate-index locations• Fixed-level solution ignores local spatial distribution of objects• If a node is sparse, we might consider placing the index at its descendents.• If a node is dense, we build the index at the node itself because a query

region is likely to overlap with many of its children.

Prefer to build approximateindex at parent

Prefer to build approximateindexes at children

24Advanced Database Systems Raghav Karumur Spring 2011

Page 25: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Algorithm 2: Placing Approximate Indexes at Variable Levels

(Spatial Nodes)

(Spatial-Approximate Nodes)

(Spatial-Keyword Nodes)

25Advanced Database Systems Raghav Karumur Spring 2011

Page 26: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Selecting Nodes for Approximate Indexes

• Goal: Find optimal set of nodes that should have approximate indexes

•Optimization problem: “Given an R*-tree and a space budget, choose nodes from the tree to

store approximate indexes, such that the average query time of a given workload is minimized. ”

-- NP Hard Problem!

26Advanced Database Systems Raghav Karumur Spring 2011

Page 27: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Greedy Algorithm: Selecting Nodes for Approximate Indexes

N6

N3

N1

N2

N4 N7N5

N12 N13 N14N8 N9 N10 N11 N15

✔✔

27Advanced Database Systems Raghav Karumur Spring 2011

• A greedy algorithm SelectSANodes is developed that traverses the tree top-down and tries to push approx. indexes down the most promising paths.

Page 28: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Selecting Nodes for Approximate Indexes• Algorithm maintains a priority queue of nodes to be traversed.• Priority of node n is defined as the benefit of storing multiple approximate

indexes at its children as compared to building a single index at n.• For each visited node n, if the benefit of building multiple approximate

indexes at n’s children is negative, then the algorithm selects n to be an SA-Node, and it will not traverse its children.

• If the algorithm reaches a leaf node, it immediately selects the leaf to be an SA-Node.

• The algorithm terminates when the space budget is exhausted or there is no more benefit to pushing approximate indexes down the tree.

• If pTime denotes average query time of probing approx. index at parent, cTime denotes this time if the indexes were built at the children, and pSpace and cSpace are corresponding space costs of indexes, then

28Advanced Database Systems Raghav Karumur Spring 2011

Page 29: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Selecting Nodes for Approximate Indexes• Wn denotes set of stored keywords at node n.• If r is the root, the benefit of storing the approximate index at r’s children

is computed byb(n) =

Benefit of a node can also be given as

• The algorithm starts traversing the tree by popping the pair with the highest benefit.

• The cost of building multiple approx. indexes at n’s children is called space cost and is computed by

s(W) = |W|*( - q + 1)*q – number of grams, W – set of keywords, is avg. keyword length of a

particular data set, and is the size of each inverted-list element.

29Advanced Database Systems Raghav Karumur Spring 2011

|)()(|

)()(

ncSpacenpSpace

ncTimenpTime

|)()(|

)(*)()(*)(

1

1

m

i nn

m

i nin

i

i

WsWs

WtnpWtnp

Page 30: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Cost/Benefit Estimation• Effects of pushing index down– Increase space cost– Increase or decrease average query time

• Typically– Higher levels: good to push index down– Intermediate levels: unclear whether to push it down

Page 31: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Lookup time of an approx.index• Clearly depends on size of the index.• Experimentally determined to be of linear nature with slope .

• Thus the avg. lookup time of an approximate index on W keywords is estimated to be

t(W) = *|W| +

where slope and intercept are implementation dependent and can be experimentally determined.

31Advanced Database Systems Raghav Karumur Spring 2011

Size Time Slope

1 0.02 -

10000 0.207 0.000019

1M 22.253 0.000022

10M 210.152 0.000021

Page 32: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Algorithm3: Exploiting Frequency Distribution of Keywords

32Advanced Database Systems Raghav Karumur Spring 2011

•Frequency distribution of keywords is in general skewed in nature. Ex: A business listings dataset has a keyword such as restaurant more frequently than consulate.

• In order to reduce the no. of keywords in the approx. indexes, we remove frequent keywords from sibling nodes, and place them in their common parent instead. • As a result, approx indexes now appear even in the S-nodes.

•Thus, S-Nodes now contain approx. indexes for frequent words where as SA-Nodes contain approx. indexes for infrequent words.

Page 33: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Index Construction

33Advanced Database Systems Raghav Karumur Spring 2011

Page 34: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Index Construction• A node n is said to be frequent if the fraction of n’s children having

that keyword is greater than certain threshold value .• A small decreases the space cost of approx. indexes.• On the other hand, avg. query time may increase because we could

visit false-positive nodes, since not all of n’s children actually contain the frequent keywords.

• Those false positives will be pruned at SK nodes.• Updated benefit of a node:

34Advanced Database Systems Raghav Karumur Spring 2011

m

innnn

nn

m

innnin

nn

FFWsFsncSpace

FWsnpSpace

FFWtnpFtnpncTime

FWtnpnpTime

ii

ii

1

1

))(()()(

)()(

))((*)()(*)()(

)(*)()(

Page 35: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Index ConstructionUpdated SelectSANodes Algorithm:• To discover frequent keywords in the tree, for each node n two sets of

keywords are maintained: a set of infrequent keywords Wn and a set of frequent keywords Fn.

• Frequent/infrequent keywords are identified by examining its children.• Also, it is ensured that popular keywords appear only at the root of a

sub tree i.e., if a keyword w is frequent at node n, then w is removed from the approx. keyword sets in all of n’s children.

• The propagation of frequent and infrequent keywords is performed bottom-up until the keyword sets of all nodes have been filled.

• The next step is to choose nodes to build approx. indexes on.• We use the updated benefit of a node , instead of benefit of a node.• P(n) denotes the probability of n satisfying the spatial condition of any

query in a workload.

35Advanced Database Systems Raghav Karumur Spring 2011

Page 36: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Incremental Maintenance of Indexes

36Advanced Database Systems Raghav Karumur Spring 2011

If (split in R*-tree)

•For the two new nodes, generated after split, recompute the stored set of keywords (frequent, and

infrequent) by examining their children.

•Propagate all the new keywords up to the root, retraverse the tree and rebuild approx. indexes at places

where split has occurred (identified by a split marker).

Else

•First insert the object into the leaf acc. to standard R*-tree procedure.

•Then the keywords of new objects are propagated bottom up.

•At an SK-Node, we add the new keywords to its stored set of keywords.

•At an SA-Node, we add the keyword to its approx. index.

•At an S –Node, we check its children for new frequent keywords, and add them to its approx. index.

Page 37: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Experiments and Analysis

37Advanced Database Systems Raghav Karumur Spring 2011

• Datasets used: CoPhIR Test Collection – Flickr Business listings data – Florida International University.

• Packages used: Flamingo• Approaches evaluated:

Fixed level approach (FL) Variable Level approach (VL)

• Processed dataset to extract photos taken in US based on their latitude and longitude values.• Used the keywords in the title, description and tags of a photo as its textual attribute.• Compared with MHR tree (contemporary paper)• Used edit distance with threshold 2 for both approaches.• Since MHR-tree is probabilistic, it could miss answers, but this tree doesn’t.• However, MHR has a comparably small index size, that this one doesn’t.

Page 38: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Experiments and Analysis

38Advanced Database Systems Raghav Karumur Spring 2011

•Recall of MHR tree – constantly below 50%• Fig(b) – increased signature size to achieve higher recall.• Query time also increased as the no. edit distance of computations increase, because approx. keyword condition is validated at level.

•Compare VLF with MHR tree• MHR has smaller index size• But, VLF has smaller query time.

Page 39: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Experiments and Analysis

39Advanced Database Systems Raghav Karumur Spring 2011

Size of index components for various construction algorithms.

• As the approx. indexes are pushed down the tree, space requirement increased because of redundant keywords in adj. nodes•Query time decreased as fewer smaller indexes are searched than one big index

Page 40: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Experiments and Analysis

40Advanced Database Systems Raghav Karumur Spring 2011

•Effect on query performance vs index construction methods.•VL and VLF curves are smoother because they are more flexible than FL!•They intersect at some point because of redundant keywords.•At points of intersection, obviously VLF performs better!

•How frequent are key words? Decided by !• = 0 every keyword is frequent•>1 no keyword is frequent•Whole range of values from [0 1] are plotted.•Clear space-time tradeoff with keyword frequency threshold!•Increase in threshold more keywords pushed to lower levels space overhead due to infrequent keywords being duplicated at multiple nodes.

Page 41: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Conclusion

41Advanced Database Systems Raghav Karumur Spring 2011

• Spatial index + Approximate index = LBAK-tree Simple fixed-level solution Utilizing local spatial distribution of objects Exploiting frequency distribution of keywords

• Developed a cost-based model with reduced index size and query times.• Conducted experiments and verified with contemporary techniques.• Can improve over minimizing the index size.

Page 42: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

References[1] http://ir.iit.edu/~dagr/cs529/files/ir_book/CHAP%204%20Inverted%20Index.PDF[2] http://en.wikipedia.org/wiki/N-gram[3] http://en.wikipedia.org/wiki/R*-tree[4] www.cs.fsu.edu/~lifeifei/papers/icde10_sas.pdf[5] http://flamingo.ics.uci.edu/releases/4.0/

42Advanced Database Systems Raghav Karumur Spring 2011

Page 43: Supporting Location-based Approximate-Keyword Queries ACM International conference on Geographical Information Systems 2010 S Alsubaiee, A Behm, C Li –

Thank You!

Questions?

43Advanced Database Systems Raghav Karumur Spring 2011