accumulo summit 2015: rya: optimizations to support real time graph queries on accumulo [frameworks]

of 32 /32
Rya: Optimizations to Support Real Time Graph Queries on Accumulo Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. ONR Case Number 43-279-15 JB.01.2015

Author: accumulo-summit

Post on 15-Jul-2015




10 download

Embed Size (px)


Slide 1

Rya: Optimizations to Support Real Time Graph Queries on AccumuloDr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina CrainiceanuDISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited.

ONR Case Number 43-279-15JB.01.2015#AbstractThe Resource Description Framework (RDF) is a standard model for expressing graph data for the World Wide Web. Developed by the W3C, RDF and related technologies such as OWL and SKOS provide a rich vocabulary for exchanging graph data in a machine understandable manner. As the size of available data continues to grow, there has been an increased desire for methods of storing very large RDF graphs within big data architectures. Rya is a government open source scalable RDF triple store built on top of Apache Accumulo. Originally developed by the Laboratory for Telecommunication Sciences and US Naval Academy, Rya is currently being used by a number of government agencies for storing, inferencing, and querying large amounts of RDF data. As Ryas user base has grown, there has been a stronger requirement for near real time query responsiveness over massive RDF graphs. In this talk, we detail several query optimization strategies the Rya team has pursued to better satisfy this requirement. We describe recent work allowing for the use of additional indices to eliminate large common joins within complex SPARQL queries. Additionally, we explain a number of statistics based optimizations to improve query planning. Specifically, we detail extensions to existing methods of estimating the selectivity of individual statement patterns (cardinality) and the selectivity of joining two statement patterns (join selectivity) to better fit a big data paradigm and utilize Accumulo. Finally, we share preliminary performance evaluation results for the optimizations that have been pursued. SpeakerDr. Caleb Meier, Engineer/Algorithm Developer, Parsons Corporation Dr. Meier received a PhD from the University of California San Diego (UCSD) in Mathematics in 2012. For the past two years, he was a postdoctoral fellow at UCSD's Math department specializing in non-linear elliptic systems of partial differential equations. He received his undergraduate degree in Mathematics from Yale University in 2006. Dr. Meier is currently working as an engineer at Parsons Corporation, specializing in query optimization algorithms for large scale RDF graphs. He is an expert in semantic technologies, Accumulo, the Hadoop Ecosystem, and is actually more fun to be around than his bio suggests. Schedule:2:45-3:20 on April 29, 2015

1AcknowledgementsThis work is the collective effort of:Parsons Rya Team, sponsored by the Department of the Navy, Office of Naval ResearchRya Founders: Roshan Punnoose, Adina Crainiceanu, and David Rapp

##OverviewRya OverviewQuery Execution within RyaQuery OptimizationsResultsSummary

##Background: Rya and RDFRya: Resource Description Framework (RDF) Triplestore built on top of AccumuloRDF: W3C standard for representing linked/graph dataRepresents data as statements (assertions) about resourcesSerialized as triples in {subject, predicate, object} formExample: {Caleb, worksAt, Parsons}{Caleb, livesIn, Virginia}CalebParsonsVirginiaworksAtlivesIn##4Background: SPARQLRDF Queries are described using SPARQLSPARQL Protocol and RDF Query LanguageSQL-like syntax for finding triples matching specific patternsLook for subgraphs that match triple statement patternsJoins are performed when there are variables common to two or more statement patternsSELECT ?people WHERE { ?people . ?people .}##5Rya ArchitectureOpen RDF Interface for interacting with RDF data stored on AccumuloOpen RDF (Sesame): Open Source Java framework for storing and querying RDF dataOpen RDF Provides several interfaces/abstractions central for interacting with a RDF datastoreSAIL interface for interacting with underlying persisted RDF modelSAIL: Storage And Inference LayerData storage layerQuery processing in SAIL layerSPARQLRya Open RDFRya QueryPlannerAccumulo##

Storage: Triple Table Index3 TablesSPO : subject, predicate, objectPOS : predicate, object, subjectOSP : object, subject, predicateStore triples in the RowID of the tableStore graph name in the Column FamilyAdvantages:Native lexicographical sorting of row keys fast range queriesAll patterns can be translated into a scan of one of these tables

##7OverviewRya OverviewQuery Execution within RyaQuery OptimizationsResultsSummary

##worksAt, Netflix, DanworksAt, OfficeMax, Zack worksAt, Parsons, BobworksAt, Parsons, GretaworksAt, Parsons, JohnRya Query ExecutionImplemented OpenRDF Sesame SAIL APIParse queries, generate initial query plan, execute planTriple patterns map to range queries in Accumulo

SELECT ?x WHERE { ?x . ?x . }Step 1: POS Table scan rangeBob, livesIn, GeorgiaGreta, livesIn, VirginiaJohn, livesIn, VirginiaStep 2: for each ?x, SPO index lookup##Find all US citizens that travel to Iran9More Complicated Example of Rya Query ExecutionStep 2: For each ?x, SPO Table lookupGreta, commuteMethod, bikeJohn, commuteMethod, BusStep 3: For each remaining ?x, SPO Table lookupStep 1: POS Table scan range for worksAt, Parsons?x livesIn Virginia?x worksAt Parsons?x commuteMethod bikeworksAt, Netflix, DanworksAt, Parsons, BobworksAt, Parsons, GretaworksAt, Parsons, JohnworksAt, PlayStation, AliceBob, livesIn, GeorgiaGreta, livesIn, VirginiaJohn, livesIn, VirginiaSELECT ?x WHERE { ?x Parsons. ?x Virginia. ?x bike.}##10Challenges in Query ExecutionScalability and ResponsivenessMassive amounts of dataPotentially large amounts of comparisons Consider the Previous Example:

Default query execution: comparing each ?x returned from first statement pattern query to all subsequent triple patternsThere are 8.3 million Virginia residents, about 15,000 Parsons employees, and 750,000 people who commute via bike.Only 100 people who work at Parsons commute via bike while 1000 people who work at Parsons live in Virginia.Poor query execution plans can result in simple queries taking minutes as opposed to millisecondsSELECT ?x WHERE { ?x Virginia.?x Parsons.?x bike.}SELECT ?x WHERE { ?x Parsons.?x Virginia.?x bike.}SELECT ?x WHERE { ?x Parsons.?x bike.?x Virginia.}vs.vs.##OverviewRya OverviewQuery Execution within RyaQuery OptimizationsResultsSummary

##Rya Query OptimizationsGoal: Optimize query execution (joins) to better support real time responsivenessThree Approaches:Reduce the number of joins: Pattern Based IndicesPre-calculate common joinsLimit data in joins: Use more stats to improve query planningCardinality estimation on individual statement patternsJoin selectivity estimation on pairs of statement patternsMake joins more efficient: Distribute the Join ProcessingDistribute processing using SPARK SQL or MapReduceUse Hash Joins and Intersecting IteratorsJust beginning to start looking at this##Rya Query Optimizations Using CardinalitiesGoal: Optimize ordering of query execution to reduce the number of comparison operations Order execution based on the number of triples that match each triple pattern

SELECT ?x WHERE { ?x Parsons. ?x bike. ?x Virginia.} 8.3M matches15k matches 750k matches##14Rya Cardinality UsageMaintain cardinalities on the following triple patterns element combinations:Single elements: Subject, Predicate, ObjectComposite elements: Subject-Predicate, Subject-Object, Predicate-ObjectComputed periodically using MapReduceRow ID:

OBJECT, ParsonsPREDICATEOBJECT, worksAt, ParsonsCardinality stored in the valueSparse table: Only store cardinalities above a thresholdOnly need to recompute cardinalities if the distribution of the data changes significantly##Limitations of Cardinality ApproachConsider a more complicated query

Cardinality approach does not take into account number of results returned by joinsSolution lies in estimating the join selectivity for a each pair of triples

SELECT ?x WHERE { ?x Parsons. ?x bike. ?vehicle SUV. ?x Virginia. ?x ?vehicle.} 2.1M matches15k matches 750k matches 8.3M matches 254M matches##Triple patterns containing no common variables can be joined together creating an external productAmong triple patterns with similar cardinalities and common variables, how should they be joined to obtain best execution plan

16Rya Query Optimizations Using Join SelectivityQuery optimized usingonly Cardinality Info:Query optimized using Cardinality and Join Selectivity Info:SELECT ?x WHERE { ?x Parsons. ?x bike. ?vehicle SUV. ?x Virginia. ?x ?vehicle.}SELECT ?x WHERE { ?x Parsons. ?x bike. ?x Virginia. ?x ?vehicle. ?vehicle SUV. }Join Selectivity measures number of results returned by joining two triple patternsApproach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008 Due to computational complexity, estimate of join selectivity for triple patterns is pre-computed and stored in AccumuloJoin selectivity estimated by computing the number of results obtained when each triple pattern is joined with the full table

##Join Selectivity: General AlgorithmFor statement patterns