on the e ffi ciency of joining group patterns in sparql queries

36
On the Eciency of Joining Group Patterns in SPARQL Queries María-Esther Vidal 1 , Edna Ruckhaus 1 , Tomás Lampo 1 , Amadís Martínez 1 , Javier Sierra 1 and Axel Polleres 2 niversidad Simón Bolívar 1 Universidad Simón Bolívar, Caracas, Venezuela 2 DERI, National University of Ireland, Galway On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres.

Upload: yovela

Post on 20-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

On the E ffi ciency of Joining Group Patterns in SPARQL Queries. Mar ía-Esther Vidal 1 , Edna Ruckhaus 1 , Tomás Lampo 1 , Amadís Martínez 1 , Javier Sierra 1 and Axel Polleres 2. Universidad Simón Bolívar. 1 Universidad Sim ón Bolívar, Caracas, Venezuela - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

On the Efficiency of Joining Group Patterns in SPARQL Queries

María-Esther Vidal1, Edna Ruckhaus1, Tomás Lampo1, Amadís Martínez1, Javier Sierra1 and Axel Polleres2

Universidad Simón Bolívar

1 Universidad Simón Bolívar, Caracas, Venezuela2 DERI, National University of Ireland, Galway

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

Page 2: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Motivation-Cloud of Linked Data

•Explosion in the number of: –Linking Open Data datasets.–Controlled vocabularies:

–MeSH, GO, PO…

•Extremely large datasets of linking data.

•Open Government Data

•Social Networks.

•DBpedia.

•Geonames.

•Sensorpedia.

In October 2007, Cloud of Linked Data Datasets of over two billion RDF triples, interlinked by over two million RDF links. By May 2009 had grown to 4.2 billion RDF triples, interlinked by around 142 billions RDF links! Today the Cloud of Linked Data has at least 13,112,409,691 triples.

Techniques to efficiently store and query Linked Data are required!

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

Page 3: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

?E pro1 obj1

2004-53 winner ‘Nay’

2004-49 winner ‘Nay’

… … …

?E pro2 obj2

2004-53 title 'Dodd Amdt. No. 2762

2004-49 title ‘Motion To Waive CBA’

2004-104 Title ‘Confirmation F. Dennis Saylor’

… … …

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

?E pro3 ?I

2004-178 hasBallot ‘anon1’

2004-53 hasBallot ‘anon2’

2004-187 hasBallot ‘anon3’

… … …

A left-linear query plan: {?E vote:winner 'Nay’ . ?E dc:title ?T . ?E vote:hasBallot ?I . ?I vote:option ?X . ?J vote:option ?X . ?E vote:hasBallot ?J . ?J vote:voter 'people:L000174’. FILTER (?I != ?J)}

Evaluation cost is the sum of all the intermediate results.

82 216

82

‘Nay’

?E pro1 obj1 pro2

obj2

2004-53 winner ‘Nay’ title

'Dodd Amdt. No. 2762’

2004-49 winner ‘Nay’ title

‘Motion To Waive CBA’

… … …

21.600

8.200 21.600

8.200 21.600

78.631.005

21.600

394.720 216

3.826

‘L000174’

Evaluation Cost=82+8200+8.200+78.631.005+ 394.720+3.828= 79.046.033

Query Planning: Example

Page 4: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

{?V voter ‘L000174’ . ?V option ?X . ?E hasBallot ?V}

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

option

hasBallot

njoin njoin

Small star-shaped groups:

Query Planning: Example

voter

?V

Main idea:Optimize and arrange these separetely!

Assumption:Such groups are prototypical to identify/filter objects in many RDF queries

Page 5: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

82 ‘Nay’ 216

82

21.600 21.600

21.600

‘L000174’

216

216 21.600

A bushy query plan:{{{?E vote:winner ’Nay’ . ?E dc:title ?T} . {?E vote:hasBallot ?I . ?I vote:option ?X} } . {?J vote:voter ’people:L000174’ . ?J vote:option ?X . ?E vote:hasBallot ?J}. FILTER (?I != ?J)} 3.826

216

21.600

3.8263.826

8.200

Evaluation cost=216+216+82+21.600 +8.200+3.826=34.140

Query Planning: Example

Page 6: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

‘Nay’

‘L000174’

‘Nay’ ‘L000174’

Evaluation cost: 79.046.033 Evaluation cost: 34.140

Plans comprised of small star-shaped groups may speed-up query Evaluation cost in several orders of magnitude!

Query Planning: Example

Page 7: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

‘Nay’

Transforming a left-linear linear query plan into a bushy query plan

‘L000174’

Query Planning: Example

‘Nay’

‘L000174’

•Fold into a star-shaped group ruleA njoin B {A njoin B}

•Grouping{A njoin B} njoin {C njoin D} {A njoin B} gjoin {C njoin D}

•Associativity{A njoin B} njoin C A njoin {B njoin C}

•SymmetryA njoin B B njoin A

Page 8: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Our Challenges• Develop a query engine able to efficiently

evaluate SPARQL-based queries:

– Provide a set of physical operators able to exploit the properties of the small star-shaped groups.

– Develop cost models to accurately estimate plan Evaluation cost.

– Implement query optimization techniques able to explore the space of plans comprised of small star-shaped groups.

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

Page 9: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Let’s go into details…

• Physical Operators• Query Optimizer:

– Hybrid Cost Model• Adaptive Sampling

– Simulated Annealing Query Optimizer

• … and what did we gain?– Experimental Results (Datasets, Results)

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

Page 10: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Star-Shaped Groups

Groups of pattern combinations according to exactly one common variable.

{?A1 vote:voter people:L000174 . ?A1 vote:option ?O1 . ?E1 vote:hasBallot ?A1 }

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

Page 11: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Physical Operators

Njoin: on star-shaped groups we perform (index) nested loop joins: scan the triples of the outer group and loop on the

inner group for matching triples.

outer and inner groups can be of any shape.

Gjoin: we join star-shaped groups by independently evaluating each group and then

matching their results

Each group can be a star-shaped group or a sub-tree of any shape.

Page 12: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Physical Operators - njoin

?E pro1 obj1

2004-53 winner ‘Nay’

2004-49 winner ‘Nay’

… … …

?E pro2 ?T

2004-53 title 'Dodd Amdt. No. 2762

2004-49 title ‘Motion To Waive CBA’

2004-104 Title ‘Confirmation F. D…’

… … …

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

?E pro3 ?I

2004-178 hasBallot ‘anon1’

2004-53 hasBallot ‘anon2’

2004-187 hasBallot ‘anon3’

… … …

winner

title

hasBallot njoin

Instantiations ?E2004-532004-49

njoinInstantiations ?E

2004-532004-49

Page 13: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Physical Operators - gjoin

?E pro1 obj1

2004-53 winner ‘Nay’

2004-49 winner ‘Nay’

… … …

?E pro2 ?T

2004-53 title 'Dodd Amdt. No. 2762

2004-49 title ‘Motion To Waive CBA’

2004-104 Title ‘Confirmation F. D…’

… … …

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

winner

title

njoin

hasBallot option

njoin

?E pro3 ?I

2004-178 hasBallot ‘anon1’

2004-53 hasBallot ‘anon2’

2004-187 hasBallot ‘anon3’

… … …

?I pro4 ?X

‘anon1’ option ‘Aye’

‘anon2’ option ‘Aye

‘anon3’ option ‘Aye

… … …

?E pro3 ?I Pro4 ?X

2004-178 hasBallot ‘anon1’ option ‘Aye’

2004-53 hasBallot ‘anon2’ option ‘Aye’

2004-187 hasBallot ‘anon3’ option ‘Nay’

… … …

gjoin

?E pro3 ?I Pro4 ?X

2004-178 hasBallot ‘anon1’ option ‘Aye’

2004-53 hasBallot ‘anon2’ option ‘Aye’

2004-187 hasBallot ‘anon3’ option ‘Aye’

… … …

Page 14: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Hybrid Cost Model

Estimate cost and cardinality:1. Techniques based on adaptive sampling

(Lipton et al, 1990) to:– Accurately estimate cost and cardinality for star-

shaped groups

2. Cost Formulas similar to well-known relational database cost models

– Applied to patterns and sub-trees.

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

Page 15: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

15

Adaptive Sampling – Example:Estimating cardinality

P = {?A1 vote:voter people:L000174. ?A1 vote:option ?O1 . ?E1 vote:hasBallot ?A1}

card( P ) = (a in M card({a vote:option ?O1 . ?E1 vote:hasBallot a})/M) x N

•The population u is all the valid instantiations of one of the patterns,

?A1 vote:voter people:L000174

and the population is partitioned according to ?A1, where N is the number of different instantiations.

•M instantiations of ?A1 are randomly selected to evaluate and compute cardinality:

{a vote:option ?O1 . ?E1 vote:hasBallot a} for each a in M

Page 16: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Cost Model-Formulas

Njoin cost model formula:

Gjoin cost model formula:

Page 17: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Simulated Annealing Query Optimizer • Random walks:

– Over the space of bushy query execution plans.

– Performed in stages:• Stages consist of an initial plan generation step.• One or more plan transformation steps.

– Stages start with the generation of a random plan.• Successive plan transformations are applied.• Probability of transforming a plan p into a plan p’ depends

on an acceptance probability function P(p,p’,T), where T is the Temperature.

• Note: Allows “anytime query optimization”

Page 18: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Simulated Annealing Transformation Rules

Rule1: Symmetry:

Rule2: Associativity:

Rule3: Distributivity (Linear to Bushy):

Page 19: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Simulated Annealing Transformation Rules

Rule4: Grouping:

Rule5: Fold into a star-shaped group:

Rule6: Unfold a star-shaped group:

Page 20: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Related Work & Experiments

• RDF-based engines:– Jena-ARQ– Jena TDB– Sesame– YARS2– RDF-3X

Neither has native Optimization strategies tailored to small star-shaped groups !!!

Page 21: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experimental StudyDatasets: DS1: US Congress bills 2004

http://www.govtrack.us/data/

DS2: YAGOhttp://www.mpi-inf.mpg.de/~suchanek/downloads/yago/

Dataset Number Triples

DS1 67,392

YAGO +18 million

Page 22: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experimental StudyQueries:

– Benchmark1 – US Congress Bills 2004

• 9 queries between 3 and 7 patterns

• At least one pattern is instantiated

– Benchmark2 – US Congress Bills 2004

• 60 queries, more than 12 triple patterns, between 1 and 7 gjoins between

small star-shaped groups

– Benchmark3 – US Congress Bills 2004

• 16 queries, between 3 and 9 triple patterns, small star-shaped groups

– Benchmark4 - YAGO

• 11 queries, between 17 and 25 patterns; only one query has a no empty answer.

– Benchmark5 - YAGO

• 9 queries, between 17 and 26 patterns; all the queries have no empty answers.

Page 23: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experimental Set Up

Metric: evaluation time measured by the time command.

Optimizer Set Up: • initial temperature 700, 20 iterations. • Linux Ubuntu machine with an Intel Pentium Core2 Duo

3.0 GHz and 8GB RAM.• Probabilities of transformation rules:

Page 24: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experiment Goals

• Efficiency of the Star-Shaped Group Physical Operators.– Njoin and Gjoin were implemented in:

• Jena 2.3 (ARQ query engine) • Jena 2.7 and Jena TDB (native storage engine):

• RDF-3X0.3.3 and RDF-3X0.3.4:

• DVLDB

• OneQL (our own system)

• RDFJoin (new – experiments not in the paper)

Page 25: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experiment Goals

• Effectiveness of the Proposed Optimization Techniques.– Star-Shaped Optimal Plans generated by

OneQL were evaluated in:• Jena 2.3 and 2.7• Jena TDB • RDF-3X0.3.3 and RDF-3X0.3.4• DVLDB• OneQL• Sesame • RDFJoin

Page 26: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experiment I: Efficiency of the Physical Operators in Different

RDF Query Engines

Page 27: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Performance of Star-Shaped Physical Operators.

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

• Compared the benefits of using our gjoin physical implementation and the njoin implementation provided by Jena versions 2.3, Jena 2.7 and Jena TDB.

•Benchmark2: 60 queries•US Congress Bills 2004

•Our gjoin implementation was able to speed up the evaluation time up to three orders of magnitude.

Page 28: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Efficiency of Star-Shaped Group Physical Operators-RDF-3X

• Compared the benefit of using our Star-Shaped Group physical operators and RDF-3X physical operators:

•Benchmark4: 13 queries•Dataset YAGO

•Star-Shaped Group physical operators are able to speed up the evaluation time up to two orders of magnitude.

Page 29: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experiment II: Effectiveness of the Proposed Optimization in Different RDF Query Engines

Page 30: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Effectiveness of the Proposed Optimization Techniques in different RDF Engines

Benchmark1: 9 queriesDataset: US Congress Bills

2004 Plans comprised of Small Star-

Shaped Group are able to speed up the evaluation time up to two orders of magnitude.

Page 31: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Effectiveness of the Proposed Optimization Techniques in Jena

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

•Comparison original and the optimal evaluation times in GJena 2.3

•Benchmark3: 16 queries•US Congress Bills 2004

•Optimal query has significantly lower cost than original query •Speed-up evaluation time up to two order of magnitude.•Plans with most significant improvement composed of bushy trees with small star-shaped groups

Page 32: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

Effectiveness of the Proposed Optimization and Evaluation Techniques in RDF-3X

•Comparison of Star-Shaped and RDF-3X optimizations versus Optimal Plan:

•Benchmark4:11 queries •YAGO

•Evaluation times of the optimal plans can up to six order of magnitude lower than RDF-3X optimal plans.• Evaluation times of Star-Shaped plans outperforms RDF-3X plans; Star-Shaped plans were comprised of small star-shaped groups.•Results are similar in RDF-3X0.3.3 and RDF-3X0.3.4.

NEW!

Page 33: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Effectiveness of the Proposed Optimization and Evaluation Techniques in RDFJoin

On the Efficiency of Joining Group Patterns in SPARQL Queries. Vidal, Ruckhaus, Lampo, Martinez, Sierra and Polleres. ESWC'2010. Greece

•Comparison of Star-Shaped and RDF-3X optimizations versus Optimal Plan in RDFJoin:

•Benchmark4: 11 queries •YAGO

•Evaluation times of the optimal plans can up to two order of magnitude lower than RDF-3X optimal plans.• Evaluation times of Star-Shaped plans outperforms RDF-3X plans; Star-Shaped plans were comprised of small star-shaped groups.

NEW!

Page 34: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Effectiveness Star-Shaped Optimization Techniques

•Comparison of Star-Shaped and RDF-3X optimizations versus Optimal Plan:

•Benchmark6: 9 queries •YAGO

•Evaluation times of the optimal plans can up to three order of magnitude lower than RDF-3X optimal plans.• Evaluation times of Star-Shaped plans outperforms RDF-3X plans; Star-Shaped plans were comprised of small star-shaped groups.

NEW!

Page 35: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

What’s done, what’s next

We reported on optimization and evaluation techniques for SPARQL:Queries comprised of small star-shaped groups may outperform

original queries in several orders of magnitude.Optimization techniques tailored to identify plans comprised small

star-shaped groups can speed evaluation time of SPARQL queries

Future work includes: the enhancement of the hybrid cost model with Bayesian

inference capabilities that consider correlations between patterns in a query (cf. our IRLMES WS paper)

Implement star-shaped group physical operators in other RDF engines.

Thank you! Questions?

Page 36: On the E ffi ciency of Joining Group Patterns in SPARQL Queries

Experiment Goals• Efficiency of the Star-Shaped Group Physical Operators.

– Njoin and Gjoin were implemented in:• Jena 2.3 (ARQ query engine):

– Gjoin was implemented by extending the method stream in the com.hp.hpl.jena.sparql.engine.main.OpCompiler class to independently evaluate each inner and outer group.

• Jena 2.7 and Jena TDB (native storage engine): – Gjoin was implemented by extending the function execute of the

src.com.hp.hpl.jena.sparql.engine.main.OpExecutor class to independently evaluate each inner and outer group.

• RDF-3X0.3.3 and RDF-3X0.3.4:– Njoin and Gjoin were implemented with RDF-3X Hash Joins.– Joins between Patterns in a Star-Shaped Group were implemented with RDF-3X Merge Join.

• DVLDB:– gjoin inner and outer groups were implemented as relational views that only projected out the

join variables. • OneQL:

– njoin and gjoin were implemented on top of structures that provide direct access to the data. njoin was implemented by extending the side-way passing process in Prolog to use indices; gjoin inner and outer groups are independently evaluated and intermediate results are temporally stored in Prolog main memory to identify matches.

• RDFJoin:– gjoin inner and outer groups were implemented as relational views that only projected out the

join variables.