index structures and top-k joins for native keyword search databases

KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association

Institute of Applied Informatics and Formal Description Methods (AIFB)

www.kit.edu

Index Structures and Top-k Joins for Native Keyword Search DatabasesGünter Ladwig, Thanh TranConference on Information and Knowledge Management (CIKM2011)

Institute of Applied Informatics and Formal Description Methods (AIFB)2 October 25th, 2011

Contents

Introduction: Native keyword search

Contributions

Index Structuresd-length 2-Hop Cover

Path indexes

Keyword Query ProcessingIntegrated Query Plan

Operator Ranking

Evaluation

Conclusion

CIKM 2011, Glasgow


Keyword Search on Graph-Structured Data

Keyword queries over structured data

ApproachesQuery translation (based on schema exploration)

Native keyword search (based on data graph exploration)

CIKM 2011, Glasgow

Queries “steve 2009”“john steve alice”

“2009”

“2009”

“john”

“steve”“mary”

“alice”

“acme”

“2009”


Native Keyword Search

Match keywords to elements of the data graphs

Find structures connecting these elements (Steiner graphs)More expensive than query translation approaches

Preprocess data to reduce online effort

CIKM 2011, Glasgow

“2009”

“2009”

“john”


“alice”

“acme”

Queries “steve 2009”“john steve alice”

“2009”

“2009”“john”

“steve”

“2009”



Native Keyword Search: EASE

Indexes at the level of r-maximal subgraphsGiven keyword query find relevant subgraphs using index

Explore subgraphs to construct Steiner graphs

High redundancy

Requires special operations: exploration, pruning

CIKM 2011, Glasgow

“2009”“john”


“acme”

“2009”

“john”


“alice”“2009”“alice” “2009”

“2009”


“2009”“john”

“steve”Exploration

Query“steve 2009”

Institute of Applied Informatics and Formal Description Methods (AIFB)6 October 25th, 2011 CIKM 2011, Glasgow

Native Keyword Search using Top-k Joins

Fine-grained indexing at the level of paths

More pruning, less redundancy: less storage required

Enables use of database query processing conceptsData access and top-k joins

Keyword search is now a “traditional” query processing problem

“steve”

“john”

“mary”

“john”

“steve”

“2009” “steve”

“2009”

“mary”

“2009”“steve” “mary”

“2009”“john”“steve”

Joins

Query“steve 2009”


Contributions

We propose a new processing strategy for the keyword search problem based on standard database operations data access and join

For efficient data access we extend the 2-hop cover to pre-compute and materialize neighborhoods of data elements, indexing the data at the level of paths

Keyword search requires consideration of a large number of query plans: push-based top-k join procedure ranks query plans during processing

CIKM 2011, Glasgow


INDEX STRUCTURES

CIKM 2011, Glasgow


d-length 2-Hop Cover

Compact representation of connections in a graphUsed to find paths between two nodes

Extension of 2-Hop Cover to store only paths of length d or less

2-Hop Cover labels all nodes u with neighborhood NBu

If two nodes u,v are connected via paths of length d or less then

All paths of length d or less between center nodes u and v are of the form

w is called a hop node

Construction prunes redundant entries from neighborhoods to reduce size of the cover

CIKM 2011, Glasgow


Finding Paths Using Joins

To find paths between two nodes u and vRetrieve neighborhoods NBu and NBv

Intersect NBuand NBv to obtain all hop nodes

Reconstruct paths between u and v through hop nodes

Intersection is performed as rank join

Rank join requires input to be sorted

CIKM 2011, Glasgow

center node

hop node

“john”

“steve”

“mary”“2009” “mary”

“steve”

“alice”

“2009”

“acme”


Index Storage

Pruned neighborhoods are stored as path entries

Path entry (w,s) for each hop node w in NBu

Path entry index maps nodes to its path entries (sorted)

Path indexStores paths for all center nodes and their path entries

Used to reconstruct paths

CIKM 2011, Glasgow

Node Path Entries

u1

(w1, 1.0)

(w2, 2.0)

(w3, 2.0)

u2 (w5, 1.0)

…


KEYWORD QUERY PROCESSING

CIKM 2011, Glasgow


Keyword Query Processing

Use joins to find connections between matching elements for all keywords

Base inputs: keyword neighborhood for each keywordUnion of matching elements’ neighborhoods

ProcessData access to retrieve keyword neighborhoods

Joins to connect keyword matchingelements

Are all possible plans valid?

CIKM 2011, Glasgow

alicejohnsteve


Query Plans

Join order mattersNo single join order delivers all results (some might even be empty)

We do not know in advance which orders deliver results

Consider all possible join orders

No results!

“steve”

“alice”

“john”

stevejohnalice

d = 2


Integrated Query Plan

Join operators in all query plans:

Query plans for different join orders overlapShare as many operators as possible

Join operators with sharing:

|K| N’(K) N(|K|, K)

2 2 1

3 12 6

4 72 24

5 480 100


Top-k Keyword-Join Processing

High number of operators

Terminate early after computing top-k instead of all resultsRank join operators

Top-k union operator

Integrated Query Plan is a composition of many sub-plans

Some sub-plans might produce no results Pull-based operators will block until result can be produced

Use push-based operators: execution driven by inputs instead of results

Some sub-plans might produce results earlier than othersRank not only results, but also rank operators

CIKM 2011, Glasgow


Operator Ranking

Prefer operators that have “promising” results

Global score of rank join operator, based on current results and upper bounds for subsequent join operations

R: intermediate results

NBK: keyword neighborhoods not yet covered

Global score defined as

Join operators have a global score when they have results ready

Only the operator with the highest global score can push results to subsequent operators

Otherwise, lower level data access operators are activated

CIKM 2011, Glasgow


EVALUATION

CIKM 2011, Glasgow


Evaluation

Four approachesEASE: indexing at the level of graphs

KJ: keyword join approach

KJU: keyword join approach without operator ranking

DatasetsBTC: 10M triples

DBLP1/5/10: 1M, 5M, 10M triples (from SP2Bench)

9 keyword queries for each dataset

Reduction of index storage size50% (DBLP1) – 79% (DBLP10)

CIKM 2011, Glasgow


Results

KJ, KJU outperform EASE

Operator ranking is beneficial

CIKM 2011, Glasgow


Results

Benefit of operator ranking more pronounced for larger queries as these need more join operators

CIKM 2011, Glasgow


Conclusion

Native keyword search based on data access and join

d-length 2-Hop CoverIndex at the level of paths, instead of graphs

Top-k Keyword JoinExploration transformed into series of join operators

Operator ranking

Reduces storage requirement and increases performance

CIKM 2011, Glasgow


Thank you for your attention! Questions?

CIKM 2011, Glasgow

Günter Ladwig, [email protected]


BACKUP SLIDES

CIKM 2011, Glasgow


Introduction

Keyword search on graph-structured data (RDF)

Query TranslationTranslate keywords into structured query using schema knowledge

Native Keyword SearchNo translation

Match keywords to elements of the data graphs

Find structures connecting these elements (Steiner graphs)

More expensive than query translation approaches

Preprocess data and create special indexesReduces search space during online query processing

Requires offline preprocessing and storage

CIKM 2011, Glasgow


Example

Match keyword elements

Find connections between keyword elements

p4

p3 p2

l1

o2

Malta

ABC Corp

MaryPeter

Alice p1 Richard

l1

o1

Malta

ABC Corp

locatedIn locatedIn

worksAtworksAt

worksAt

knowsknows

knows

Query: “alice malta peter”


Problem Definition

Given a graph GE=(NE,ER)

Find Steiner graphs connection keyword elements


Scoring

Assumption: more compact Steiner graphs are more relevant

Scoring functionGS: Steiner graph

P: set of paths connecting its keyword elements

Other functions possible, but not part of this work

CIKM 2011, Glasgow


Approaches

Bidirectional SearchExplore graph from keyword elements to find connections

Does not scale well

EASEIndexes neighborhood graphs to restrict search space for exploration

Our approachUse database operations: data access and join

Transform graph exploration into a series of join operations

Improves storage requirements and performance

CIKM 2011, Glasgow


d-Length 2-Hop Cover

Preliminaries

Compact representation of connections in a graphUsed to find paths between two nodes in a graph

CIKM 2011, Glasgow


Construction

Trivial d-length 2-hop cover is the set of all d-neighborhoods of GE, but contains redundancies

Finding a minimal 2-hop cover is NP-hard (Minimum Set Cover)

Approximation algorithmSelect a “best” node covering a large amount of paths

Use its neighborhood to prune redundant paths from all other neighborhoods

CIKM 2011, Glasgow


Example: Pruning

Pruned paths between two nodes can be reconstructed by intersecting their neighborhoods

Store each pruned neighborhood as a list of path entries

p4

p3

p2

o1 p1l1

o1

locatedIn worksAt

worksAt

knows

knows

knows

o2

p2

p3

o1 p4l2

p1

locatedInworksAt

worksAt

knows

knows

knowsprune

center node

hop noded = 2


Neighborhood Join

CIKM 2011, Glasgow

p4

p3

l1

o1

p2

l2

p4

o1

p3

o3

p4 o1 p2

p4 p3 p2

Result: Keyword Graphs

stands for all paths of length d between p4 and p2 through o1

...

center node

hop node


Graph Join

Expand keyword graphs to keyword graph neighborhoods

Graph Join: joins keyword graph neighborhood with keyword neighborhood

CIKM 2011, Glasgow

p4 o1 p2 p4 o1 p2 o3

p4 o1 p2 l2

p4 o1 p2l1

...

Keyword Graph Keyword Graph Neighborhood


Integrated Query Plan

Number of join operators without operator sharing

Number of join operators with operator sharing

CIKM 2011, Glasgow

|K| N’(K) N(|K|, K)

2 2 1

3 12 6

4 72 24

5 480 100

index structures and top-k joins for native keyword search databases

Documents

keyword query processinguse

keyword query processing12

joins keyword search

keyword neighborhood

hop nodes steve steve

level of paths keyword

keyword search problem

alice keyword queries