accumulo summit 2015: rya: optimizations to support real time graph queries on accumulo [frameworks]

Rya: Optimizations to Support Real Time Graph Queries on AccumuloDr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu

DISTRIBUTION STATEMENT A. Approved for

public release; distribution is unlimited.

ONR Case Number 43-279-15 JB.01.2015

22

Acknowledgements

This work is the collective effort of:

Parsons’ Rya Team, sponsored by the Department of

the Navy, Office of Naval Research

Rya Founders: Roshan Punnoose, Adina Crainiceanu,

and David Rapp

33

Overview

Rya Overview

Query Execution within Rya

Query Optimizations

Results

Summary

44

Background: Rya and RDF

Rya: Resource Description Framework (RDF)

Triplestore built on top of Accumulo

RDF: W3C standard for representing

linked/graph data

Represents data as statements (assertions) about

resources

– Serialized as triples in {subject, predicate, object}

form

– Example:

• {Caleb, worksAt, Parsons}

• {Caleb, livesIn, Virginia}

Caleb

ParsonsVirginia

worksAtlivesIn

55

Background: SPARQL

RDF Queries are described using SPARQL

SPARQL Protocol and RDF Query Language

SQL-like syntax for finding triples matching

specific patterns

Look for subgraphs that match triple statement patterns

Joins are performed when there are variables common

to two or more statement patterns

SELECT ?people WHERE {

?people <worksAt> <Parsons>.

?people <livesIn> <Virginia>.

}

66

Rya Architecture

Open RDF Interface for interacting with RDF data

stored on Accumulo

Open RDF (Sesame): Open

Source Java framework for

storing and querying RDF

data

Open RDF Provides several

interfaces/abstractions

central for interacting with

a RDF datastore

– SAIL interface for interacting with underlying persisted

RDF model

– SAIL: Storage And Inference Layer

Data storage layer

Query processing in SAIL layer

SPARQL

Rya Open RDF

Rya QueryPlanner

Accumulo

77

Storage: Triple Table Index

3 Tables

SPO : subject, predicate, object

POS : predicate, object, subject

OSP : object, subject, predicate

Store triples in the RowID of the table

Store graph name in the Column Family

Advantages:

Native lexicographical sorting of row keys fast range queries

All patterns can be translated into a scan of one of these tables

88

Overview

Rya Overview


Query Optimizations

Results

Summary

99

…

worksAt, Netflix, Dan

worksAt, OfficeMax, Zack

worksAt, Parsons, Bob

worksAt, Parsons, Greta

worksAt, Parsons, John

…

Rya Query Execution

Implemented OpenRDF Sesame SAIL API

Parse queries, generate initial query plan, execute plan

Triple patterns map to range queries in Accumulo

SELECT ?x WHERE { ?x <worksAt> <Parsons>.

?x <livesIn> <Virginia>. }

Step 1: POS Table – scan range

…

Bob, livesIn, Georgia

…

Greta, livesIn, Virginia

…

John, livesIn, Virginia

…

Step 2: for each ?x, SPO – index lookup

1010

More Complicated Example of Rya Query Execution

Step 2: For each ?x,

SPO Table lookup

…

Greta, commuteMethod,

bike

…

John, commuteMethod,

Bus

…

Step 3: For each

remaining ?x, SPO

Table lookup

Step 1: POS Table – scan

range for worksAt, Parsons

?x livesIn Virginia?x worksAt Parsons

?x commuteMethod bike

…

worksAt, Netflix, Dan

worksAt, Parsons, Bob

worksAt, Parsons, Greta

worksAt, Parsons, John

worksAt, PlayStation,

Alice

…

…

Bob, livesIn, Georgia

…

Greta, livesIn, Virginia

…

John, livesIn, Virginia

…

SELECT ?x WHERE {

?x <worksAt> Parsons.

?x <livesIn> Virginia.

?x <commuteMethod> bike.

}

1111

Challenges in Query Execution

Scalability and Responsiveness

Massive amounts of data

Potentially large amounts of comparisons

Consider the Previous Example:

Default query execution: comparing each “?x” returned from first

statement pattern query to all subsequent triple patterns

There are 8.3 million Virginia residents, about 15,000 Parsons

employees, and 750,000 people who commute via bike.

Only 100 people who work at Parsons commute via bike while 1000

people who work at Parsons live in Virginia.

Poor query execution plans can result in simple queries

taking minutes as opposed to milliseconds

SELECT ?x WHERE {




}

SELECT ?x WHERE {




}

SELECT ?x WHERE {




}

vs. vs.

1212

Overview

Rya Overview


Query Optimizations

Results

Summary

1313

Rya Query Optimizations

Goal: Optimize query execution (joins) to better support real time responsiveness

Three Approaches:

Reduce the number of joins: Pattern Based Indices

– Pre-calculate common joins

Limit data in joins: Use more stats to improve query planning

– Cardinality estimation on individual statement patterns

– Join selectivity estimation on pairs of statement patterns

Make joins more efficient: Distribute the Join Processing

– Distribute processing using SPARK SQL or MapReduce

– Use Hash Joins and Intersecting Iterators

– Just beginning to start looking at this

1414

Rya Query Optimizations Using Cardinalities

Goal: Optimize ordering of query execution to

reduce the number of comparison operations

Order execution based on the number of triples that

match each triple pattern

SELECT ?x WHERE {




}

8.3M matches

15k matches

750k matches

1515

Rya Cardinality Usage

Maintain cardinalities on the following triple patterns

element combinations: Single elements: Subject, Predicate, Object

Composite elements: Subject-Predicate, Subject-Object,

Predicate-Object

Computed periodically using MapReduce Row ID:

– <CardinalityType><TripleElements>

• OBJECT, Parsons

• PREDICATEOBJECT, worksAt, Parsons

Cardinality stored in the value

Sparse table: Only store cardinalities above a threshold

Only need to recompute cardinalities if the

distribution of the data changes significantly

1616

Limitations of Cardinality Approach

Consider a more complicated query

Cardinality approach does not take into account number of results returned by joins

Solution lies in estimating the “join selectivity” for a each pair of triples

SELECT ?x WHERE {



?vehicle <vehicleType> SUV.


?x <owns> ?vehicle.

}

2.1M matches

15k matches

750k matches

8.3M matches

254M matches

1717

Rya Query Optimizations Using Join Selectivity

Query optimized using

only Cardinality Info:

Query optimized using Cardinality

and Join Selectivity Info:

SELECT ?x WHERE {





?x <owns> ?vehicle.

}

SELECT ?x WHERE {




?x <owns> ?vehicle.


}

Join Selectivity measures number of results returned by joining two

triple patterns Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas

Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008

Due to computational complexity, estimate of join selectivity for triple

patterns is pre-computed and stored in Accumulo

Join selectivity estimated by computing the number of results obtained

when each triple pattern is joined with the full table

1818

Join Selectivity: General Algorithm

For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a

variable and p1, o1 , p2, o2 constant, estimate the number of results

Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>)

give number of results returned by joining a statement pattern with

the full table along the subject component

Full table join statistics precomputed and stored in index

Join statistics for each triple pattern computed using following equation:

Use analogous definition if variables appear in predicate or object position

Join selectivity statistics used with cardinalities to generate more

efficient query plans

1919

Join Selectivity: Integration into Rya

Join Selectivity estimates used to optimize Rya queries

through a greedy algorithm approach

Query constructed starting with first triple pattern to be

evaluated (the pattern with the smallest cardinality) and then

patterns are added based on minimization of a cost function

Cost function

C = leftCard + rightCard + leftCard*rightCard*selectivity

C measures number of entries Accumulo must scan and the

number of comparisons required to perform the join

Selectivity set to one if two triple patterns share no common

variables, otherwise precomputed estimates used

Ensures that patterns with common variables are grouped

together

2020

Construction of Selectivity Tables

For the pattern <?x, p1, o1>, associate each RDF triple of

the form <c, p1, o1> with the cardinality |<c,?y,?z>| and

then sum the results

Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits

the key-value pair (c, (p1, o1))

Map Job 2 processes the cardinality table and emits the key

value pair (c, |<c,?y,?x>|), which consists of the constant c

and its single component, subject cardinality for the table

Map Job 3 merges the results from jobs 1 and 2 by emitting

the key-value pair ((p1, o1), |<c,?y,?x>|)

Map Job 4 sums the cardinalities from those key-value pairs

containing (p1, o1) as a key, and the result is written to the

selectivity table

2121

Query Optimizations Using Pre-Computed Joins

Reduce joins by pre-computing common joins

Approach taken from: Heese, Ralf, et al. "Index Support for

SPARQL." European Semantic Web Conference, Innsbruck,

Austria. 2007.

SELECT ?x WHERE {




?x <owns> ?vehicle.


}

Pre-compute using

batch processing

and look up during

query execution

2222

Query Optimizations Using Pre-Computed Joins

Index Result Table

.…

Aaron, ToyotaRav4

Caleb, JeepCherokee

Puja, HondaCRV

.…

SELECT ?x WHERE {




?x <owns> ?vehicle.


}

SELECT ?person ?car

WHERE {

?person <livesIn> Virginia.

?person <owns> ?car.

?car <vehicleType> SUV.

}

1. Pre-compute a portion of the query

using MapReduce

2. Store SPARQL describing the query

along with pre-computed values in

Accumulo

3. Normalize query variables to match

stored SPARQL variables during

query execution

Stored SPARQL

2323

Overview

Rya Overview


Query Optimizations

Results

Summary

2424

Query Optimization Results

Ran 14 queries against the Lehigh University Benchmark (LUBM)

dataset (33.34 million triples) LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity

– Remaining queries were executed 12 times

Cluster Specs:

– 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and

48 GB RAM

Results indicate that cardinality and join selectivity optimizations provide

improved or comparable performance

2525

Summary

Cardinality estimation and join selectivity can

improve query response times for ad hoc queries

Effects of join selectivity are more apparent for

complex queries over large datasets

Pre-computed joins are extremely useful for

optimizing common queries

Potentially avoid large number of join operations

Maintaining pre-computed join indices is difficult

2626

Questions?

2727

BACK-UP

2828

Useful Links

SPARQL http://www.w3.org/TR/rdf-sparql-query/

http://jena.apache.org/tutorials/sparql.html

RDF http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

Rya https://github.com/LAS-NCSU/rya

– Source on github: Provides documentation and sample client code

– Email Aaron Mihalik ([email protected]) for access (US Citizens only)

Rya Working Group

– Monthly telecon / update on progress, issues, upcoming features

– Email Puja Valiyil [email protected] to join (US Citizens only)

Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting-

started.docbook?view

Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html

Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the

clouds. Proceedings of the 1st International Workshop on Cloud Intelligence.

http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf

Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya.

Information Systems Journal (2013).

http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf

http://www.w3.org/TR/rdf-sparql-query/

http://jena.apache.org/tutorials/sparql.html

http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/

https://github.com/LAS-NCSU/rya

mailto:[email protected]

mailto:[email protected]

http://openrdf.callimachus.net/sesame/tutorials/getting-started.docbook?view

http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf

http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf

2929

Next Steps

Maintaining pre-computed join indices

Dynamically determining potential pre-computed

joins

Distributing query planning and execution

SPARK SQL

Rya backed by other datastores

Fully open sourcing Rya

3030

Sample LUBM Queries (1 of 3)

Query 1

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>

SELECT ?X WHERE

{ GRAPH <http://LUBM>

{?X rdf:type ub:GraduateStudent .

?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>}

}

Query 3



SELECT ?X WHERE


{?X rdf:type ub:Publication .

?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>}

}

3131


Query 7



SELECT ?X ?Y WHERE


{?X rdf:type ub:Student .

?Y rdf:type ub:Course .

?X ub:takesCourse ?Y .

<http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y}

}

Query 8



SELECT ?X ?Y ?Z WHERE



?Y rdf:type ub:Department .

?X ub:memberOf ?Y .

?Y ub:subOrganizationOf <http://www.University0.edu> .

?X ub:emailAddress ?Z}

}

3232


Query 9



SELECT ?X ?Y ?Z WHERE



?Y rdf:type ub:Faculty .

?Z rdf:type ub:Course .

?X ub:advisor ?Y .

?Y ub:teacherOf ?Z .

?X ub:takesCourse ?Z}

}

Query 11



SELECT ?X WHERE


{?X rdf:type ub:ResearchGroup .

?X ub:subOrganizationOf <http://www.University0.edu>}

}

accumulo summit 2015: rya: optimizations to support real time graph queries on accumulo [frameworks]

Technology

large rdf graphs

large amounts of rdf

large scale rdf graphs

massive rdf graphs

parsons rya team

caleb meier

big data architectures

big data paradigm