index structures and top-k joins for native keyword search databases
TRANSCRIPT
KIT – University of the State of Baden-Württemberg andNational Large-scale Research Center of the Helmholtz Association
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
Index Structures and Top-k Joins for Native Keyword Search DatabasesGünter Ladwig, Thanh TranConference on Information and Knowledge Management (CIKM2011)
Institute of Applied Informatics and Formal Description Methods (AIFB)2 October 25th, 2011
Contents
Introduction: Native keyword search
Contributions
Index Structuresd-length 2-Hop Cover
Path indexes
Keyword Query ProcessingIntegrated Query Plan
Operator Ranking
Evaluation
Conclusion
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)3 October 25th, 2011
Keyword Search on Graph-Structured Data
Keyword queries over structured data
ApproachesQuery translation (based on schema exploration)
Native keyword search (based on data graph exploration)
CIKM 2011, Glasgow
Queries “steve 2009”“john steve alice”
“2009”
“2009”
“john”
“steve”“mary”
“alice”
“acme”
“2009”
Institute of Applied Informatics and Formal Description Methods (AIFB)4 October 25th, 2011
Native Keyword Search
Match keywords to elements of the data graphs
Find structures connecting these elements (Steiner graphs)More expensive than query translation approaches
Preprocess data to reduce online effort
CIKM 2011, Glasgow
“2009”
“2009”
“john”
“steve”“mary”
“alice”
“acme”
Queries “steve 2009”“john steve alice”
“2009”
“2009”“john”
“steve”
“2009”
“steve”“mary”
Institute of Applied Informatics and Formal Description Methods (AIFB)5 October 25th, 2011
Native Keyword Search: EASE
Indexes at the level of r-maximal subgraphsGiven keyword query find relevant subgraphs using index
Explore subgraphs to construct Steiner graphs
High redundancy
Requires special operations: exploration, pruning
CIKM 2011, Glasgow
“2009”“john”
“steve”“mary”
“acme”
“2009”
“john”
“steve”“mary”
“alice”“2009”“alice” “2009”
“2009”
“steve”“mary”
“2009”“john”
“steve”Exploration
Query“steve 2009”
Institute of Applied Informatics and Formal Description Methods (AIFB)6 October 25th, 2011 CIKM 2011, Glasgow
Native Keyword Search using Top-k Joins
Fine-grained indexing at the level of paths
More pruning, less redundancy: less storage required
Enables use of database query processing conceptsData access and top-k joins
Keyword search is now a “traditional” query processing problem
“steve”
“john”
“mary”
“john”
“steve”
“2009” “steve”
“2009”
“mary”
“2009”“steve” “mary”
“2009”“john”“steve”
Joins
Query“steve 2009”
Institute of Applied Informatics and Formal Description Methods (AIFB)7 October 25th, 2011
Contributions
We propose a new processing strategy for the keyword search problem based on standard database operations data access and join
For efficient data access we extend the 2-hop cover to pre-compute and materialize neighborhoods of data elements, indexing the data at the level of paths
Keyword search requires consideration of a large number of query plans: push-based top-k join procedure ranks query plans during processing
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)8 October 25th, 2011
INDEX STRUCTURES
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)9 October 25th, 2011
d-length 2-Hop Cover
Compact representation of connections in a graphUsed to find paths between two nodes
Extension of 2-Hop Cover to store only paths of length d or less
2-Hop Cover labels all nodes u with neighborhood NBu
If two nodes u,v are connected via paths of length d or less then
All paths of length d or less between center nodes u and v are of the form
w is called a hop node
Construction prunes redundant entries from neighborhoods to reduce size of the cover
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)10 October 25th, 2011
Finding Paths Using Joins
To find paths between two nodes u and vRetrieve neighborhoods NBu and NBv
Intersect NBuand NBv to obtain all hop nodes
Reconstruct paths between u and v through hop nodes
Intersection is performed as rank join
Rank join requires input to be sorted
CIKM 2011, Glasgow
center node
hop node
“john”
“steve”
“mary”“2009” “mary”
“steve”
“alice”
“2009”
“acme”
Institute of Applied Informatics and Formal Description Methods (AIFB)11 October 25th, 2011
Index Storage
Pruned neighborhoods are stored as path entries
Path entry (w,s) for each hop node w in NBu
Path entry index maps nodes to its path entries (sorted)
Path indexStores paths for all center nodes and their path entries
Used to reconstruct paths
CIKM 2011, Glasgow
Node Path Entries
u1
(w1, 1.0)
(w2, 2.0)
(w3, 2.0)
u2 (w5, 1.0)
…
Institute of Applied Informatics and Formal Description Methods (AIFB)12 October 25th, 2011
KEYWORD QUERY PROCESSING
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)13 October 25th, 2011
Keyword Query Processing
Use joins to find connections between matching elements for all keywords
Base inputs: keyword neighborhood for each keywordUnion of matching elements’ neighborhoods
ProcessData access to retrieve keyword neighborhoods
Joins to connect keyword matchingelements
Are all possible plans valid?
CIKM 2011, Glasgow
alicejohnsteve
Institute of Applied Informatics and Formal Description Methods (AIFB)14 October 25th, 2011 CIKM 2011, Glasgow
Query Plans
Join order mattersNo single join order delivers all results (some might even be empty)
We do not know in advance which orders deliver results
Consider all possible join orders
No results!
“steve”
“alice”
“john”
stevejohnalice
d = 2
Institute of Applied Informatics and Formal Description Methods (AIFB)15 October 25th, 2011 CIKM 2011, Glasgow
Integrated Query Plan
Join operators in all query plans:
Query plans for different join orders overlapShare as many operators as possible
Join operators with sharing:
|K| N’(K) N(|K|, K)
2 2 1
3 12 6
4 72 24
5 480 100
Institute of Applied Informatics and Formal Description Methods (AIFB)16 October 25th, 2011
Top-k Keyword-Join Processing
High number of operators
Terminate early after computing top-k instead of all resultsRank join operators
Top-k union operator
Integrated Query Plan is a composition of many sub-plans
Some sub-plans might produce no results Pull-based operators will block until result can be produced
Use push-based operators: execution driven by inputs instead of results
Some sub-plans might produce results earlier than othersRank not only results, but also rank operators
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)17 October 25th, 2011
Operator Ranking
Prefer operators that have “promising” results
Global score of rank join operator, based on current results and upper bounds for subsequent join operations
R: intermediate results
NBK: keyword neighborhoods not yet covered
Global score defined as
Join operators have a global score when they have results ready
Only the operator with the highest global score can push results to subsequent operators
Otherwise, lower level data access operators are activated
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)18 October 25th, 2011
EVALUATION
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)19 October 25th, 2011
Evaluation
Four approachesEASE: indexing at the level of graphs
KJ: keyword join approach
KJU: keyword join approach without operator ranking
DatasetsBTC: 10M triples
DBLP1/5/10: 1M, 5M, 10M triples (from SP2Bench)
9 keyword queries for each dataset
Reduction of index storage size50% (DBLP1) – 79% (DBLP10)
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)20 October 25th, 2011
Results
KJ, KJU outperform EASE
Operator ranking is beneficial
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)21 October 25th, 2011
Results
Benefit of operator ranking more pronounced for larger queries as these need more join operators
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)22 October 25th, 2011
Conclusion
Native keyword search based on data access and join
d-length 2-Hop CoverIndex at the level of paths, instead of graphs
Top-k Keyword JoinExploration transformed into series of join operators
Operator ranking
Reduces storage requirement and increases performance
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)23 October 25th, 2011
Thank you for your attention! Questions?
CIKM 2011, Glasgow
Günter Ladwig, [email protected]
Institute of Applied Informatics and Formal Description Methods (AIFB)24 October 25th, 2011
BACKUP SLIDES
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)25 October 25th, 2011
Introduction
Keyword search on graph-structured data (RDF)
Query TranslationTranslate keywords into structured query using schema knowledge
Native Keyword SearchNo translation
Match keywords to elements of the data graphs
Find structures connecting these elements (Steiner graphs)
More expensive than query translation approaches
Preprocess data and create special indexesReduces search space during online query processing
Requires offline preprocessing and storage
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)26 October 25th, 2011 CIKM 2011, Glasgow
Example
Match keyword elements
Find connections between keyword elements
p4
p3 p2
l1
o2
Malta
ABC Corp
MaryPeter
Alice p1 Richard
l1
o1
Malta
ABC Corp
locatedIn locatedIn
worksAtworksAt
worksAt
knowsknows
knows
Query: “alice malta peter”
Institute of Applied Informatics and Formal Description Methods (AIFB)27 October 25th, 2011 CIKM 2011, Glasgow
Problem Definition
Given a graph GE=(NE,ER)
Find Steiner graphs connection keyword elements
Institute of Applied Informatics and Formal Description Methods (AIFB)28 October 25th, 2011
Scoring
Assumption: more compact Steiner graphs are more relevant
Scoring functionGS: Steiner graph
P: set of paths connecting its keyword elements
Other functions possible, but not part of this work
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)29 October 25th, 2011
Approaches
Bidirectional SearchExplore graph from keyword elements to find connections
Does not scale well
EASEIndexes neighborhood graphs to restrict search space for exploration
Our approachUse database operations: data access and join
Transform graph exploration into a series of join operations
Improves storage requirements and performance
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)30 October 25th, 2011
d-Length 2-Hop Cover
Preliminaries
Compact representation of connections in a graphUsed to find paths between two nodes in a graph
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)31 October 25th, 2011
Construction
Trivial d-length 2-hop cover is the set of all d-neighborhoods of GE, but contains redundancies
Finding a minimal 2-hop cover is NP-hard (Minimum Set Cover)
Approximation algorithmSelect a “best” node covering a large amount of paths
Use its neighborhood to prune redundant paths from all other neighborhoods
CIKM 2011, Glasgow
Institute of Applied Informatics and Formal Description Methods (AIFB)32 October 25th, 2011 CIKM 2011, Glasgow
Example: Pruning
Pruned paths between two nodes can be reconstructed by intersecting their neighborhoods
Store each pruned neighborhood as a list of path entries
p4
p3
p2
o1 p1l1
o1
locatedIn worksAt
worksAt
knows
knows
knows
o2
p2
p3
o1 p4l2
p1
locatedInworksAt
worksAt
knows
knows
knowsprune
center node
hop noded = 2
Institute of Applied Informatics and Formal Description Methods (AIFB)33 October 25th, 2011
Neighborhood Join
CIKM 2011, Glasgow
p4
p3
l1
o1
p2
l2
p4
o1
p3
o3
p4 o1 p2
p4 p3 p2
Result: Keyword Graphs
stands for all paths of length d between p4 and p2 through o1
...
center node
hop node
Institute of Applied Informatics and Formal Description Methods (AIFB)34 October 25th, 2011
Graph Join
Expand keyword graphs to keyword graph neighborhoods
Graph Join: joins keyword graph neighborhood with keyword neighborhood
CIKM 2011, Glasgow
p4 o1 p2 p4 o1 p2 o3
p4 o1 p2 l2
p4 o1 p2l1
...
Keyword Graph Keyword Graph Neighborhood
Institute of Applied Informatics and Formal Description Methods (AIFB)35 October 25th, 2011
Integrated Query Plan
Number of join operators without operator sharing
Number of join operators with operator sharing
CIKM 2011, Glasgow
|K| N’(K) N(|K|, K)
2 2 1
3 12 6
4 72 24
5 480 100