index structures and top-k joins for native keyword search databases

35
KIT – University of the State of Baden-Württemberg and National Large-scale Research Center of the Helmholtz Association Institute of Applied Informatics and Formal Description Methods (AIFB) www.kit.edu Index Structures and Top-k Joins for Native Keyword Search Databases Günter Ladwig , Thanh Tran Conference on Information and Knowledge Management (CIKM2011)

Upload: thanh-tran

Post on 10-May-2015

221 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Index Structures and Top-k Joins for Native Keyword Search Databases

KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association

Institute of Applied Informatics and Formal Description Methods (AIFB)

www.kit.edu

Index Structures and Top-k Joins for Native Keyword Search DatabasesGünter Ladwig, Thanh TranConference on Information and Knowledge Management (CIKM2011)

Page 2: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)2 October 25th, 2011

Contents

Introduction: Native keyword search

Contributions

Index Structuresd-length 2-Hop Cover

Path indexes

Keyword Query ProcessingIntegrated Query Plan

Operator Ranking

Evaluation

Conclusion

CIKM 2011, Glasgow

Page 3: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)3 October 25th, 2011

Keyword Search on Graph-Structured Data

Keyword queries over structured data

ApproachesQuery translation (based on schema exploration)

Native keyword search (based on data graph exploration)

CIKM 2011, Glasgow

Queries “steve 2009”“john steve alice”

“2009”

“2009”

“john”

“steve”“mary”

“alice”

“acme”

“2009”

Page 4: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)4 October 25th, 2011

Native Keyword Search

Match keywords to elements of the data graphs

Find structures connecting these elements (Steiner graphs)More expensive than query translation approaches

Preprocess data to reduce online effort

CIKM 2011, Glasgow

“2009”

“2009”

“john”

“steve”“mary”

“alice”

“acme”

Queries “steve 2009”“john steve alice”

“2009”

“2009”“john”

“steve”

“2009”

“steve”“mary”

Page 5: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)5 October 25th, 2011

Native Keyword Search: EASE

Indexes at the level of r-maximal subgraphsGiven keyword query find relevant subgraphs using index

Explore subgraphs to construct Steiner graphs

High redundancy

Requires special operations: exploration, pruning

CIKM 2011, Glasgow

“2009”“john”

“steve”“mary”

“acme”

“2009”

“john”

“steve”“mary”

“alice”“2009”“alice” “2009”

“2009”

“steve”“mary”

“2009”“john”

“steve”Exploration

Query“steve 2009”

Page 6: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)6 October 25th, 2011 CIKM 2011, Glasgow

Native Keyword Search using Top-k Joins

Fine-grained indexing at the level of paths

More pruning, less redundancy: less storage required

Enables use of database query processing conceptsData access and top-k joins

Keyword search is now a “traditional” query processing problem

“steve”

“john”

“mary”

“john”

“steve”

“2009” “steve”

“2009”

“mary”

“2009”“steve” “mary”

“2009”“john”“steve”

Joins

Query“steve 2009”

Page 7: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)7 October 25th, 2011

Contributions

We propose a new processing strategy for the keyword search problem based on standard database operations data access and join

For efficient data access we extend the 2-hop cover to pre-compute and materialize neighborhoods of data elements, indexing the data at the level of paths

Keyword search requires consideration of a large number of query plans: push-based top-k join procedure ranks query plans during processing

CIKM 2011, Glasgow

Page 8: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)8 October 25th, 2011

INDEX STRUCTURES

CIKM 2011, Glasgow

Page 9: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)9 October 25th, 2011

d-length 2-Hop Cover

Compact representation of connections in a graphUsed to find paths between two nodes

Extension of 2-Hop Cover to store only paths of length d or less

2-Hop Cover labels all nodes u with neighborhood NBu

If two nodes u,v are connected via paths of length d or less then

All paths of length d or less between center nodes u and v are of the form

w is called a hop node

Construction prunes redundant entries from neighborhoods to reduce size of the cover

CIKM 2011, Glasgow

Page 10: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)10 October 25th, 2011

Finding Paths Using Joins

To find paths between two nodes u and vRetrieve neighborhoods NBu and NBv

Intersect NBuand NBv to obtain all hop nodes

Reconstruct paths between u and v through hop nodes

Intersection is performed as rank join

Rank join requires input to be sorted

CIKM 2011, Glasgow

center node

hop node

“john”

“steve”

“mary”“2009” “mary”

“steve”

“alice”

“2009”

“acme”

Page 11: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)11 October 25th, 2011

Index Storage

Pruned neighborhoods are stored as path entries

Path entry (w,s) for each hop node w in NBu

Path entry index maps nodes to its path entries (sorted)

Path indexStores paths for all center nodes and their path entries

Used to reconstruct paths

CIKM 2011, Glasgow

Node Path Entries

u1

(w1, 1.0)

(w2, 2.0)

(w3, 2.0)

u2 (w5, 1.0)

Page 12: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)12 October 25th, 2011

KEYWORD QUERY PROCESSING

CIKM 2011, Glasgow

Page 13: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)13 October 25th, 2011

Keyword Query Processing

Use joins to find connections between matching elements for all keywords

Base inputs: keyword neighborhood for each keywordUnion of matching elements’ neighborhoods

ProcessData access to retrieve keyword neighborhoods

Joins to connect keyword matchingelements

Are all possible plans valid?

CIKM 2011, Glasgow

alicejohnsteve

Page 14: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)14 October 25th, 2011 CIKM 2011, Glasgow

Query Plans

Join order mattersNo single join order delivers all results (some might even be empty)

We do not know in advance which orders deliver results

Consider all possible join orders

No results!

“steve”

“alice”

“john”

stevejohnalice

d = 2

Page 15: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)15 October 25th, 2011 CIKM 2011, Glasgow

Integrated Query Plan

Join operators in all query plans:

Query plans for different join orders overlapShare as many operators as possible

Join operators with sharing:

|K| N’(K) N(|K|, K)

2 2 1

3 12 6

4 72 24

5 480 100

Page 16: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)16 October 25th, 2011

Top-k Keyword-Join Processing

High number of operators

Terminate early after computing top-k instead of all resultsRank join operators

Top-k union operator

Integrated Query Plan is a composition of many sub-plans

Some sub-plans might produce no results Pull-based operators will block until result can be produced

Use push-based operators: execution driven by inputs instead of results

Some sub-plans might produce results earlier than othersRank not only results, but also rank operators

CIKM 2011, Glasgow

Page 17: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)17 October 25th, 2011

Operator Ranking

Prefer operators that have “promising” results

Global score of rank join operator, based on current results and upper bounds for subsequent join operations

R: intermediate results

NBK: keyword neighborhoods not yet covered

Global score defined as

Join operators have a global score when they have results ready

Only the operator with the highest global score can push results to subsequent operators

Otherwise, lower level data access operators are activated

CIKM 2011, Glasgow

Page 18: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)18 October 25th, 2011

EVALUATION

CIKM 2011, Glasgow

Page 19: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)19 October 25th, 2011

Evaluation

Four approachesEASE: indexing at the level of graphs

KJ: keyword join approach

KJU: keyword join approach without operator ranking

DatasetsBTC: 10M triples

DBLP1/5/10: 1M, 5M, 10M triples (from SP2Bench)

9 keyword queries for each dataset

Reduction of index storage size50% (DBLP1) – 79% (DBLP10)

CIKM 2011, Glasgow

Page 20: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)20 October 25th, 2011

Results

KJ, KJU outperform EASE

Operator ranking is beneficial

CIKM 2011, Glasgow

Page 21: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)21 October 25th, 2011

Results

Benefit of operator ranking more pronounced for larger queries as these need more join operators

CIKM 2011, Glasgow

Page 22: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)22 October 25th, 2011

Conclusion

Native keyword search based on data access and join

d-length 2-Hop CoverIndex at the level of paths, instead of graphs

Top-k Keyword JoinExploration transformed into series of join operators

Operator ranking

Reduces storage requirement and increases performance

CIKM 2011, Glasgow

Page 23: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)23 October 25th, 2011

Thank you for your attention! Questions?

CIKM 2011, Glasgow

Günter Ladwig, [email protected]

Page 24: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)24 October 25th, 2011

BACKUP SLIDES

CIKM 2011, Glasgow

Page 25: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)25 October 25th, 2011

Introduction

Keyword search on graph-structured data (RDF)

Query TranslationTranslate keywords into structured query using schema knowledge

Native Keyword SearchNo translation

Match keywords to elements of the data graphs

Find structures connecting these elements (Steiner graphs)

More expensive than query translation approaches

Preprocess data and create special indexesReduces search space during online query processing

Requires offline preprocessing and storage

CIKM 2011, Glasgow

Page 26: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)26 October 25th, 2011 CIKM 2011, Glasgow

Example

Match keyword elements

Find connections between keyword elements

p4

p3 p2

l1

o2

Malta

ABC Corp

MaryPeter

Alice p1 Richard

l1

o1

Malta

ABC Corp

locatedIn locatedIn

worksAtworksAt

worksAt

knowsknows

knows

Query: “alice malta peter”

Page 27: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)27 October 25th, 2011 CIKM 2011, Glasgow

Problem Definition

Given a graph GE=(NE,ER)

Find Steiner graphs connection keyword elements

Page 28: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)28 October 25th, 2011

Scoring

Assumption: more compact Steiner graphs are more relevant

Scoring functionGS: Steiner graph

P: set of paths connecting its keyword elements

Other functions possible, but not part of this work

CIKM 2011, Glasgow

Page 29: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)29 October 25th, 2011

Approaches

Bidirectional SearchExplore graph from keyword elements to find connections

Does not scale well

EASEIndexes neighborhood graphs to restrict search space for exploration

Our approachUse database operations: data access and join

Transform graph exploration into a series of join operations

Improves storage requirements and performance

CIKM 2011, Glasgow

Page 30: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)30 October 25th, 2011

d-Length 2-Hop Cover

Preliminaries

Compact representation of connections in a graphUsed to find paths between two nodes in a graph

CIKM 2011, Glasgow

Page 31: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)31 October 25th, 2011

Construction

Trivial d-length 2-hop cover is the set of all d-neighborhoods of GE, but contains redundancies

Finding a minimal 2-hop cover is NP-hard (Minimum Set Cover)

Approximation algorithmSelect a “best” node covering a large amount of paths

Use its neighborhood to prune redundant paths from all other neighborhoods

CIKM 2011, Glasgow

Page 32: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)32 October 25th, 2011 CIKM 2011, Glasgow

Example: Pruning

Pruned paths between two nodes can be reconstructed by intersecting their neighborhoods

Store each pruned neighborhood as a list of path entries

p4

p3

p2

o1 p1l1

o1

locatedIn worksAt

worksAt

knows

knows

knows

o2

p2

p3

o1 p4l2

p1

locatedInworksAt

worksAt

knows

knows

knowsprune

center node

hop noded = 2

Page 33: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)33 October 25th, 2011

Neighborhood Join

CIKM 2011, Glasgow

p4

p3

l1

o1

p2

l2

p4

o1

p3

o3

p4 o1 p2

p4 p3 p2

Result: Keyword Graphs

stands for all paths of length d between p4 and p2 through o1

...

center node

hop node

Page 34: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)34 October 25th, 2011

Graph Join

Expand keyword graphs to keyword graph neighborhoods

Graph Join: joins keyword graph neighborhood with keyword neighborhood

CIKM 2011, Glasgow

p4 o1 p2 p4 o1 p2 o3

p4 o1 p2 l2

p4 o1 p2l1

...

Keyword Graph Keyword Graph Neighborhood

Page 35: Index Structures and Top-k Joins for Native Keyword Search Databases

Institute of Applied Informatics and Formal Description Methods (AIFB)35 October 25th, 2011

Integrated Query Plan

Number of join operators without operator sharing

Number of join operators with operator sharing

CIKM 2011, Glasgow

|K| N’(K) N(|K|, K)

2 2 1

3 12 6

4 72 24

5 480 100