· 2018-11-21 · the vldb journal doi 10.1007/s00778-015-0415-0 regular paper processing sparql...

26
The VLDB Journal DOI 10.1007/s00778-015-0415-0 REGULAR PAPER Processing SPARQL queries over distributed RDF graphs Peng Peng 1 · Lei Zou 1 · M. Tamer Özsu 2 · Lei Chen 3 · Dongyan Zhao 1 Received: 30 March 2015 / Revised: 10 September 2015 / Accepted: 17 November 2015 © Springer-Verlag Berlin Heidelberg 2016 Abstract We propose techniques for processing SPARQL queries over a large RDF graph in a distributed environment. We adopt a “partial evaluation and assembly” framework. Answering a SPARQL query Q is equivalent to finding subgraph matches of the query graph Q over RDF graph G. Based on properties of subgraph matching over a dis- tributed graph, we introduce local partial match as partial answers in each fragment of RDF graph G. For assembly, we propose two methods: centralized and distributed assem- bly. We analyze our algorithms from both theoretically and experimentally. Extensive experiments over both real and benchmark RDF repositories of billions of triples confirm that our method is superior to the state-of-the-art methods in both the system’s performance and scalability. Electronic supplementary material The online version of this article (doi:10.1007/s00778-015-0415-0) contains supplementary material, which is available to authorized users. B M. Tamer Özsu [email protected] Peng Peng [email protected] Lei Zou [email protected] Lei Chen [email protected] Dongyan Zhao [email protected] 1 Institute of Computer Science and Technology, Peking University, Beijing, China 2 David R. Cheriton School of Computer Science, University of Waterloo, Waterloo, Canada 3 Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong, China Keywords RDF · SPARQL · RDF graph · Distributed queries 1 Introduction The semantic Web data model, called the “Resource Descrip- tion Framework,” or RDF, represents data as a collection of triples of the form subject, property, object. A triple can be naturally seen as a pair of entities connected by a named rela- tionship or an entity associated with a named attribute value. Hence, an RDF dataset can be represented as a graph where subjects and objects are vertices, and triples are edges with property names as edge labels. With the increasing amount of RDF data published on the Web, system performance and scalability issues have become increasingly pressing. For example, Linking Open Data (LOD) project builds an RDF data cloud by linking more than 3000 datasets, which cur- rently have more than 84 billion triples 1 . The recent work [40] shows that the number of data sources has doubled within 3 years (2011–2014). Obviously, the computational and stor- age requirements coupled with rapidly growing datasets have stressed the limits of single machine processing. There have been a number of recent efforts in dis- tributed evaluation of SPARQL queries over large RDF datasets [20]. We broadly classify these solutions into three categories: cloud-based, partition-based and federated approaches. These are discussed in detail in Sect. 2; the high- lights are as follows. Cloud-based approaches (e.g., [23, 27, 33, 34, 37, 48, 49]) maintain a large RDF graph using existing cloud comput- ing platforms, such as Hadoop (http://hadoop.apache.org) or Cassandra (http://cassandra.apache.org), and employ triple 1 The statistic is reported in http://stats.lod2.eu/ . 123

Upload: others

Post on 18-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

The VLDB JournalDOI 10.1007/s00778-015-0415-0

REGULAR PAPER

Processing SPARQL queries over distributed RDF graphs

Peng Peng1 · Lei Zou1 · M. Tamer Özsu2 · Lei Chen3 · Dongyan Zhao1

Received: 30 March 2015 / Revised: 10 September 2015 / Accepted: 17 November 2015© Springer-Verlag Berlin Heidelberg 2016

Abstract We propose techniques for processing SPARQLqueries over a large RDF graph in a distributed environment.We adopt a “partial evaluation and assembly” framework.Answering a SPARQL query Q is equivalent to findingsubgraph matches of the query graph Q over RDF graphG. Based on properties of subgraph matching over a dis-tributed graph, we introduce local partial match as partialanswers in each fragment of RDF graph G. For assembly,we propose two methods: centralized and distributed assem-bly. We analyze our algorithms from both theoretically andexperimentally. Extensive experiments over both real andbenchmark RDF repositories of billions of triples confirmthat our method is superior to the state-of-the-art methods inboth the system’s performance and scalability.

Electronic supplementary material The online version of thisarticle (doi:10.1007/s00778-015-0415-0) contains supplementarymaterial, which is available to authorized users.

B M. Tamer Ö[email protected]

Peng [email protected]

Lei [email protected]

Lei [email protected]

Dongyan [email protected]

1 Institute of Computer Science and Technology,Peking University, Beijing, China

2 David R. Cheriton School of Computer Science,University of Waterloo, Waterloo, Canada

3 Department of Computer Science and Engineering,Hong Kong University of Science and Technology,Clear Water Bay, Hong Kong, China

Keywords RDF · SPARQL · RDF graph ·Distributed queries

1 Introduction

The semanticWeb datamodel, called the “ResourceDescrip-tion Framework,” or RDF, represents data as a collection oftriples of the form 〈subject, property, object〉. A triple can benaturally seen as a pair of entities connected by a named rela-tionship or an entity associated with a named attribute value.Hence, an RDF dataset can be represented as a graph wheresubjects and objects are vertices, and triples are edges withproperty names as edge labels. With the increasing amountof RDF data published on the Web, system performance andscalability issues have become increasingly pressing. Forexample, Linking Open Data (LOD) project builds an RDFdata cloud by linking more than 3000 datasets, which cur-rently havemore than84billion triples1. The recentwork [40]shows that the number of data sources has doubled within 3years (2011–2014). Obviously, the computational and stor-age requirements coupledwith rapidly growing datasets havestressed the limits of single machine processing.

There have been a number of recent efforts in dis-tributed evaluation of SPARQL queries over large RDFdatasets [20]. We broadly classify these solutions intothree categories: cloud-based, partition-based and federatedapproaches. These are discussed in detail in Sect. 2; the high-lights are as follows.

Cloud-based approaches (e.g., [23,27,33,34,37,48,49])maintain a large RDF graph using existing cloud comput-ing platforms, such as Hadoop (http://hadoop.apache.org) orCassandra (http://cassandra.apache.org), and employ triple

1 The statistic is reported in http://stats.lod2.eu/.

123

P. Peng et al.

pattern-based join processingmost commonly usingMapRe-duce.

Partition-based approaches [15,18,21,22,28,29] dividethe RDF graph G into a set of subgraphs (fragments) {Fi }and decompose the SPARQL query Q into subqueries {Qi }.These subqueries are then executed over the partitioned datausing techniques similar to relational distributed databases.

Federated SPARQLprocessing systems [16,19,36,38,39]evaluate queries over multiple SPARQL endpoints. Thesesystems typically target LOD and follow a query processingover data integration approach. These systems operate in avery different environment we are targeting, since we focuson exploiting distributed execution for speedup and scalabil-ity.

In this paper, we propose an alternative strategy that isbased on only partitioning the data graph but not decompos-ing the query.Our approach is basedon the “partial evaluationand assembly” framework [24]. An RDF graph is partitionedusing some graph partitioning algorithm such asMETIS [26]into vertex-disjoint fragments (edges that cross fragments arereplicated in source and target fragments). Each site receivesthe full SPARQL query Q and executes it on the local RDFgraph fragment providing data parallel computation. To thebest of our knowledge, this is the first work that adopts thepartial evaluation and assembly strategy to evaluate SPARQLqueries over a distributed RDF data store. The most impor-tant advantage of this approach is that the number of involvedvertices and edges in the intermediate results is minimized,which is proven theoretically (see Proposition 3 in Sect. 4).

The basic idea of the partial evaluation strategy is the fol-lowing: given a function f (s, d), where s is the known inputand d is the yet unavailable input, the part of f ’s computationthat depends only on s generates a partial answer. In our set-ting, each site Si treats fragment Fi as the known input in thepartial evaluation stage; the unavailable input is the rest ofthe graph (G = G \ Fi ). The partial evaluation technique hasbeen used in compiler optimization [24] and querying XMLtrees [7]. Within the context of graph processing, the tech-nique has been used to evaluate reachability queries [13] andgraph simulation [14,31] over graphs. However, SPARQLquery semantics is different than these—SPARQL is basedon graph homomorphism [35]—and poses additional chal-lenges. Graph simulation defines a relation between verticesin the query graph Q (i.e., V (Q)) and that in the data graphG (i.e., V (G)), but graph homomorphism is a function (not arelation) between V (Q) and V (G) [14]. Thus, the solutionsproposed for graph simulation [14] and graph pattern match-ing [31] cannot be applied to theproblemstudied in this paper.

Because of interconnections between graph fragments,application of graph homomorphism over graphs requiresspecial care. For example, consider a distributed RDF graphin Fig. 1. Each entity in RDF is represented by a URI (uni-form resource identifier), the prefix of which always denotes

Fig. 1 A Distributed RDF graph

the location of the dataset. For example, “s1:dir1” has theprefix “s1,” meaning that the entity is located at site s1.Here, the prefix is just for simplifying presentation, not ageneral assumption made by the approach. There are cross-ing links between two datasets identified in bold font. Forexample, “〈s2:act1 isMarriedTo s1:dir1〉” is a crossing link(links between different datasets), which means that act1 (atsite s2) is married to dir1 (at site s1).

Now consider the following SPARQL query Q that con-sists of five triple patterns (e.g., ?a isMarriedTo ?d) over thisdistributed RDF graph:

SELECT WHERE

SomeSPARQLquerymatches are containedwithin a frag-ment, which we call inner matches. These inner matches canbe found locally by existing centralized techniques at eachsite. However, if we consider the four datasets independentlyand ignore the crossing links, some correct answers will bemissed, such as (?a=s2:act1, ?d=s1:dir1). The key issue in thedistributed environment is how to find subgraphmatches thatcrossmultiple fragments—these are called crossingmatches.For query Q in Fig. 2, the subgraph induced by vertices 014,007, 001, 002, 009 and 018 is a crossing match between frag-ments F1 and F2 in Fig. 1 (shown in the shaded vertices andred edges). This is the focus of this paper.

There are two important issues to be addressed in thisframework. The first is to compute the partial evaluationresults at each site given a query graph Q (i.e., the localpartial match), which, intuitively, is the overlapping partbetween a crossingmatch and a fragment. This is discussed inSect. 4. The second one is the assembly of these local partialmatches to compute crossing matches. We consider two dif-

123

Processing SPARQL queries over distributed RDF graphs

Fig. 2 SPARQL query graph Q

ferent strategies: centralized assembly, where all local partialmatches are sent to a single site (Sect. 5.2), and distributedassembly, where the local partial matches are assembled at anumber of sites in parallel (Sect. 5.3).

The main benefits of our solution are twofold:

– Our solution does not depend on any specific partitioningstrategy. In existing partition-based methods, the queryprocessing always depends on a certain RDF graph par-titioning strategy, which may be difficult to enforce incertain circumstances. The partition-agnostic frameworkenables us to adopt any partition-based optimization,although this is orthogonal to our solution in this paper.

– Ourmethod guarantees to involve fewer vertices or edgesin intermediate results than other partition-based solu-tions, which we prove in Sect. 4 (Proposition 3). Thisproperty often results in smaller number of intermediateresults and lowers the cost of our approach, which wedemonstrate experimentally in Sect. 7.The rest of the paper is organized as follows: We dis-

cuss related work in the areas of distributed SPARQL queryprocessing and partial query evaluation in Sect. 2. Section3 provides the fundamental definitions that form the back-ground for this work and introduces the overall executionframework. Computation of local matches at each site is cov-ered in Sect. 4, and the centralized and distributed assemblyof partial results to compute the final query result is discussedin Sect. 5.We also study how to evaluate general SPARQLs inSect. 6. We evaluate our approach, both in terms of its inter-nal characteristics and in terms of its relative performanceagainst other approaches in Sect. 7. Section 8 concludes thepaper and outlines some future research directions.

2 Related work

2.1 Distributed SPARQL query processing

As noted above, there are three general approaches to distrib-uted SPARQL query processing: cloud-based approaches,partition-based approaches and federated SPARQL querysystems.

2.1.1 Cloud-based approaches

There have been a number of works (e.g., [23,27,33,34,37,47–49]) focused on managing large RDF datasets using

existing cloud platforms; a very good survey of these is[25]. Many of these approaches follow theMapReduce para-digm; in particular, they use HDFS [23,37,48,49], and storeRDF triples in flat files in HDFS. When a SPARQL queryis issued, the HDFS files are scanned to find the matchesof each triple pattern, which are then joined using one ofthe MapReduce join implementations (see [30] for moredetailed description of these). The most important differenceamong these approaches is how the RDF triples are stored inHDFS files; this determines how the triples are accessed andthe number of MapReduce jobs. In particular, SHARD [37]directly stores the data in a single file and each line of thefile represents all triples associated with a distinct subject.HadoopRDF [23] and PredicateJoin [49] further partitionRDF triples based on the predicate and store each partitionwithin one HDFS file. EAGRE [48] first groups all subjectswith similar properties into an entity class and then constructsa compressed RDF graph containing only entity classes andthe connections between them. It partitions the compressedRDF graph using the METIS algorithm [26]. Entities areplaced into HDFS according to the partition set that theybelong to.

Besides the HDFS-based approaches, there are also someworks that use other NoSQL distributed data stores to man-age RDF datasets. JenaHBase [27] and H2RDF [33,34] usesome permutations of subject, predicate, object to buildindices that are then stored in HBase (http://hbase.apache.org). Trinity.RDF [47] uses the distributed memory-cloudgraph system Trinity [44] to index and store the RDF graph.It uses hashing on the vertex values to obtain a disjoint parti-tioning of the RDF graph that is placed on nodes in a cluster.

These approaches benefit from the high scalability andfault tolerance offered by cloud platforms, but may sufferlower performance due to the difficulties of adaptingMapRe-duce to graph computation.

2.1.2 Partition-based approaches

The partition-based approaches [15,18,21,22,28,29] parti-tion an RDF graphG into several fragments and place each ata different site in a parallel/distributed system. Each site hostsa centralizedRDF store of somekind.At run time, a SPARQLquery Q is decomposed into several subqueries such that eachsubquery can be answered locally at one site, and the resultsare then aggregated. Each of these papers proposes its owndata partitioning strategy, and different partitioning strategiesresult in different query processing methods.

In GraphPartition [22], an RDF graph G is partitionedinto n fragments, and each fragment is extended by includ-ing N -hop neighbors of boundary vertices. According to thepartitioning strategy, the diameter of the graph correspond-ing to each decomposed subquery should not be larger thanN to enable subquery processing at each local site. WARP

123

P. Peng et al.

[21] uses some frequent structures in workload to furtherextend the results of GraphPartition. Partout [15] extendsthe concepts of minterm predicates in relational databasesystems and uses the results of minterm predicates as thefragmentation units. Lee et. al. [28,29] define the partitionunit as a vertex and its neighbors, which they call a “vertexblock.” The vertex blocks are distributed based on a set ofheuristic rules. A query is partitioned into blocks that can beexecuted among all sites in parallel and without any commu-nication. TriAD uses METIS [26] to divide the RDF graphinto many partitions, and the number of result partitions ismuch more than the number of sites. Each result partition isconsidered as a unit and distributed among different sites. Ateach site, TriAD maintains six large, in-memory vectors oftriples, which correspond to all SPO permutations of triples.Meanwhile, TriAD constructs a summary graph to maintainthe partitioning information.

All of the above methods require partitioning and dis-tributing the RDF data according to specific requirementsof their approaches. However, in some applications, theRDF repository partitioning strategy is not controlled by thedistributed RDF system itself. There may be some adminis-trative requirements that influence the data partitioning. Forexample, in some applications, the RDF knowledge basesare partitioned according to topics (i.e., different domains)or are partitioned according to different data contributors.Therefore, partition-tolerant SPARQL processing may bedesirable. This is the motivation of our partial evaluation andassembly approach.

Also, these approaches evaluate the SPARQL query basedon query decomposition, which generate more intermediateresults. We provide a detailed experimental comparison inSect. 7.

2.1.3 Federated SPARQL query systems

Federated queries run SPARQL queries over multipleSPARQL endpoints. A typical example is linked data,where different RDF repositories are interconnected, pro-viding a virtually integrated distributed database. FederatedSPARQL query processing is a very different environmentthan what we target in this paper, but we discuss these sys-tems for completeness.

A common technique is to precompute metadata for eachindividual SPARQL endpoint. Based on the metadata, theoriginal SPARQL query is decomposed into several sub-queries, where each subquery is sent to its relevant SPARQLendpoints. The results of subqueries are then joined togetherto answer the original SPARQL query. In DARQ [36], themetadata are called service description that describes whichtriple patterns (i.e., predicate) can be answered. In [19], themetadata are called Q-Tree, which is a variant of RTree. Eachleaf node in Q-Tree stores a set of source identifiers, includ-

ing one for each source of a triple approximated by the node.SPLENDID [16] uses Vocabulary of Interlinked Datasets(VOID) as the metadata. HiBISCuS [38] relies on capabil-ities to compute the metadata. For each source, HiBISCuSdefines a set of capabilities which map the properties to theirsubject and object authorities. TopFed [39] is a biologicalfederated SPARQL query engine. Its metadata comprise ofan N3 specification file and a Tissue Source Site to Tumour(TSS-to-Tumour) hash table, which is devised based on thedata distribution.

In contrast to these, FedX [42] does not require pre-processing, but sends “SPARQL ASK” to collect the meta-data on the fly. Based on the results of “SPARQL ASK”queries, it decomposes the query into subqueries and assignsubqueries with relevant SPARQL endpoints.

Global query optimization in this context has also beenstudied. Most federated query engines employ existing opti-mizers, such as dynamic programming [3], for optimizingthe join order of local queries. Furthermore, DARQ [36] andFedX [42] discuss the use of semijoins to compute a joinbetween intermediate results at the control site and SPARQLendpoints.

2.2 Partial evaluation

Partial evaluation has been used in many applications rang-ing from compiler optimization to distributed evaluation offunctional programming languages [24]. Recently, partialevaluation has also been used for evaluating queries on dis-tributed XML trees and graphs [6–8,13]. In [6–8], partialevaluation is used to evaluate some XPath queries on distrib-uted XML. These works serialize XPath queries to a vectorof subqueries and find the partial results of all subqueries ateach site by using a top-down [7] or bottom-up [6] traversalover the XML tree. Finally, all partial results are assembledtogether at the server site to form final results. Note that sinceXML is a tree-based data structure, these works serializeXPath queries and traverse XML trees in a topological order.However, the RDF data and SPARQL queries are graphsrather than trees. Serializing the SPARQL queries and tra-versing theRDF graph in a topological order are not intuitive.

There are some prior works that consider partial evalua-tion on graphs. For example, Fan et al. [13] study reachabilityquery processing over distributed graphs using the partialevaluation strategy. Partial evaluation-based graph simula-tion is well studied by Fan et al. [14] and Shuai et al. [31].However, SPARQLquery semantics is based on graph homo-morphism [35], not graph simulation. The two concepts areformally different (i.e., they produce different results), andthe two problems have very different complexities. Homo-morphism defines a “function,” while simulation defines a“relation”—relation allows “one-to-many” mappings whilefunction does not. Consequently, the results are different.

123

Processing SPARQL queries over distributed RDF graphs

The computational hardness of the two problems is alsodifferent. Graph homomorphism is a classical NP-completeproblem [11], while graph simulation has a polynomial-timealgorithm (O((|V (G)|+ |V (Q)|)(|E(G)|+ |E(Q)|))) [12],where |V (G)| (|V (Q)|) and |E(G)| (|E(Q)|) denote thenumber of vertices and edges in RDF data graph G (andquery graph Q). Thus, the solutions based on graph simula-tion cannot be applied to the problem studied in this paper. Tothe best of our knowledge, there is no prior work in applyingpartial evaluation to SPARQL query processing.

3 Background and framework

AnRDF dataset can be represented as a graphwhere subjectsand objects are vertices and triples are labeled edges.

Definition 1 (RDF graph) An RDF graph is denoted asG ={V, E,Σ}, where V is a set of vertices that correspond to allsubjects and objects in RDF data; E ⊆ V × V is a multisetof directed edges that correspond to all triples in RDF data;Σ is a set of edge labels. For each edge e ∈ E , its edge labelis its corresponding property.

Similarly, a SPARQL query can also be represented as aquery graph Q. In this paper, we first focus on basic graphpattern (BGP) queries as they are foundational to SPARQLand focus on techniques for handling these. We extend thisdiscussion in Sect. 6 to general SPARQL queries involvingFILTER, UNION, and OPTIONAL.

Definition 2 (SPARQL BGP query) A SPARQL BGP queryis denoted as Q = {V Q, EQ,ΣQ}, where V Q ⊆ V ∪VVar isa set of vertices, where V denotes all vertices in RDFgraphGand VVar is a set of variables; EQ ⊆ V Q × V Q is a multisetof edges in Q; each edge e in EQ either has an edge label inΣ (i.e., property) or the edge label is a variable.

We assume that Q is a connected graph; otherwise,all connected components of Q are considered separately.Answering a SPARQL query is equivalent to finding all sub-graph matches (Definition 3) of Q over RDF graph G.

Definition 3 (SPARQL match) Consider an RDF graphG and a connected query graph Q that has n vertices{v1, . . . , vn}. A subgraph M with m vertices {u1, . . . , um}(in G) is said to be a match of Q if and only if there existsa function f from {v1, . . . , vn} to {u1, . . . , um} (n ≥ m),where the following conditions hold:

1. if vi is not a variable, f (vi ) and vi have the same URI orliteral value (1 ≤ i ≤ n);

2. if vi is a variable, there is no constraint over f (vi ) exceptthat f (vi ) ∈ {u1, . . . , um} ;

3. if there exists an edge −−→viv j in Q, there also exists an

edge−−−−−−−→f (vi ) f (v j ) in G. Let L(−−→viv j ) denote a multi-set

of labels between vi and v j in Q; and L(−−−−−−−→f (vi ) f (v j ))

denote a multi-set of labels between f (vi ) and f (v j ) inG. Theremust exist an injective function from edge labels

in L(−−→viv j ) to edge labels in L(−−−−−−−→f (vi ) f (v j )). Note that a

variable edge label in L(−−→viv j ) can match any edge label

in L(−−−−−−−→f (vi ) f (v j )).

Vector [ f (v1), . . . , f (vn)] is a serialization of a SPARQLmatch. Note that we allow that f (vi ) = f (v j ) when 1 ≤i = j ≤ n. In other words, a match of SPARQL Q defines agraph homomorphism.

In the context of this paper, an RDF graph G is vertex-disjoint partitioned into a number of fragments, each ofwhichresides at one site. The vertex-disjoint partitioning has beenused in most distributed RDF systems, such as GraphPar-tition [22], EAGRE [48] and TripleGroup [28]. Differentdistributed RDF systems utilize different vertex-disjointpartitioning algorithms, and the partitioning algorithm isorthogonal to our approach. Any vertex-disjoint partition-ing method can be used in our method, such as METIS [26]and MLP [46].

The vertex-disjoint partitioning methods guarantee thatthere are no overlapping vertices between fragments. How-ever, to guarantee data integrity and consistency, we storesome replicas of crossing edges. Since the RDF graph Gis partitioned by our system, metadata are readily avail-able regarding crossing edges (both outgoing and incomingedges) and the endpoints of crossing edges. Formally, wedefine the distributed RDF graph as follows.

Definition 4 (Distributed RDF graph) A distributed RDFgraph G = {V, E,Σ} consists of a set of fragments F ={F1, F2, . . . , Fk}where each Fi is specified by (Vi∪V e

i , Ei∪Eci ,Σi ) (i = 1, . . . , k) such that1. {V1, . . . , Vk} is a partitioning of V , i.e., Vi ∩Vj = ∅, 1 ≤

i, j ≤ k, i = j and⋃

i=1,...,k Vi = V ;2. Ei ⊆ Vi × Vi , i = 1, . . . , k;3. Ec

i is a set of crossing edges between Fi and other frag-ments, i.e.,

Eci =

(⋃

1≤ j≤k∧ j =i{−→uu′|u∈Fi ∧ u′ ∈Fj ∧

−→uu′ ∈E}

)

⋃ (⋃

1≤ j≤k∧ j =i{−→u′u|u∈Fi ∧ u′ ∈Fj ∧

−→u′u∈E}

)

4. A vertex u′ ∈ V ei if and only if vertex u′ resides in another

fragment Fj and u′ is an endpoint of a crossing edgebetween fragment Fi and Fj (Fi = Fj ), i.e.,

V ei =

(⋃1≤ j≤k∧ j =i {u′|−→uu′ ∈ Ec

i ∧ u ∈ Fi })⋃

(⋃1≤ j≤k∧ j =i {u′|−→u′u ∈ Ec

i ∧ u ∈ Fi })

5. Vertices in V ei are called extended vertices of Fi , and all

vertices in Vi are called internal vertices of Fi ;

123

P. Peng et al.

6. Σi is a set of edge labels in Fi .

Example 1 Figure 1 shows a distributed RDF graph Gconsisting of four fragments F1, F2, F3 and F4. Thenumbers besides the vertices are vertex IDs that are intro-duced for ease of presentation. In Fig. 1,

−−−−−→002, 001 is a

crossing edge between F1 and F2. Also, edges−−−−−→004, 011,−−−−−→

001, 012 and−−−−−→006, 008 are crossing edges between F1 and

F3. Hence, V e1 = {002, 006, 012, 004} and Ec

1 = {−−−−−→002, 001,−−−−−→

004, 011,−−−−−→001, 012,

−−−−−→006, 008}.

Definition 5 (Problem statement) Let G be a distributedRDF graph that consists of a set of fragments F ={F1, . . . , Fk} and let S = {S1, . . . , Sk} be a set of computingnodes such that Fi is located at Si . Given a SPARQL querygraph Q, our goal is to find all SPARQL matches of Q in G.

Note that for simplicity of exposition, we are assumingthat each site hosts one fragment. Inner matches can becomputed locally using a centralized RDF triple store, suchas RDF-3x [32], SW-store [1] or gStore [50]. In our pro-totype development and experiments, we modify gStore, agraph-based SPARQL query engine [50], to perform partialevaluation. The main issue of answering SPARQL queriesover the distributed RDF graph is finding crossing matchesefficiently. That is a major focus of this paper.

Example 2 Given a SPARQL query graph Q in Fig. 2, thesubgraph induced by vertices 014,007,001,002,009 and 018(shown in the shaded vertices and the red edges in Fig. 1) isa crossing match of Q.

We utilize a partial evaluation and assembly [24] frame-work to answer SPARQL queries over a distributed RDFgraph G. Each site Si treats fragment Fi as the known inputs and other fragments as yet unavailable input G (as definedin Sect. 1) [13].

In our executionmodel, each site Si receives the full querygraph Q. In the partial evaluation stage, at each site Si , wefind all local partial matches (Definition 6) of Q in Fi . Weprove that an overlapping part between any crossing matchand fragment Fi must be a local partial match in Fi (seeProposition 1).

To demonstrate the intuition behind dealing with crossingedges, consider the case in Example 2. The crossing matchM overlaps with two fragments F1 and F2. If we can findthe overlapping parts between M and F1, and M and F2, wecan assemble them to form a crossing match. For example,the subgraph induced by vertices 014, 007, 001 and 002 is anoverlapping part between M and F1. Similarly, we can alsofind the overlapping part between M and F2. We assemblethem based on the common edge

−−−−−→002, 001 to form a crossing

match, as shown in Fig. 3.

Fig. 3 Assemble local partial matches

In the assembly stage, these local partial matches areassembled to form crossing matches. In this paper, we con-sider two assembly strategies: centralized and distributed (orparallel). In centralized, all local partial matches are sent toa single site for the assembly. In distributed/parallel, localpartial matches are combined at a number of sites in parallel(see Sect. 5).

There are three steps in our method.

Step 1 (Initialization): A SPARQL query Q is input andsent to each site in S.Step 2 (Partial Evaluation): Each site Si finds local partialmatches of Q over fragment Fi . This step is executed inparallel at each site (Sect. 4).Step 3 (Assembly): Finally, we assemble all local partialmatches to compute complete crossing matches. The sys-tem can use the centralized (Sect. 5.2) or the distributedassembly approach (Sect. 5.3) to find crossing matches.

4 Partial evaluation

We first formally define a local partial match (Sect. 4.1) andthen discuss how to compute it efficiently (Sect. 4.2).

4.1 Local partial match: definition

Recall that each site Si receives the full query graph Q (i.e.,there is no query decomposition). In order to answer queryQ, each site Si computes the partial answers (called localpartial matches) based on the known input Fi (recall that,for simplicity of exposition, we assume that each site hostsone fragment as indicated by its subscript). Intuitively, a localpartial match PMi is an overlapping part between a crossingmatch M and fragment Fi at the partial evaluation stage.Moreover, M may or may not exist depending on the yetunavailable input G . Based only on the known input Fi ,we cannot judge whether or not M exists. For example, the

123

Processing SPARQL queries over distributed RDF graphs

subgraph induced by vertices 014, 007, 001 and 002 (shownin shared vertices and red edges) in Fig. 1 is a local partialmatch between M and F1.

Definition 6 (Local partial match) Given a SPARQL querygraph Q with n vertices {v1, . . . , vn} and a connected sub-graph PM with m vertices {u1, . . . , um} (m ≤ n) in afragment Fk , PM is a local partial match in fragment Fkif and only if there exists a function f : {v1, . . . , vn} →{u1, . . . , um} ∪ {NULL}, where the following conditionshold:

1. If vi is not a variable, f (vi ) and vi have the same URI orliteral or f (vi ) = NULL .

2. If vi is a variable, f (vi ) ∈ {u1, . . . , um} or f (vi ) =NULL .

3. If there exists an edge −−→viv j in Q (1 ≤ i = j ≤ n), thenPM must meet one of the following five conditions: (1)

there also exists an edge−−−−−−−→f (vi ) f (v j ) in PM with prop-

erty p, and p is the same to the property of−−→viv j ; (2) there

also exists an edge−−−−−−−→f (vi ) f (v j ) in PM with property p,

and the property of −−→viv j is a variable; (3) there does not

exist an edge−−−−−−−→f (vi ) f (v j ), but f (vi ) and f (v j ) are both

in V ek ; (4) f (vi ) = NULL; (5) f (v j ) = NULL .

4. PM contains at least one crossing edge, which guaran-tees that an empty match does not qualify.

5. If f (vi ) ∈ Vk (i.e., f (vi ) is an internal vertex in Fk) and∃−−→viv j ∈ Q (or −−→v jvi ∈ Q), there must exist f (v j ) =NULL and ∃−−−−−−−→

f (vi ) f (v j ) ∈ PM (or ∃−−−−−−−→f (v j ) f (vi ) ∈

PM). Furthermore, if −−→viv j (or−−→v jvi ) has a property p,−−−−−−−→

f (vi ) f (v j ) (or−−−−−−−→f (v j ) f (vi )) has the same property p.

6. Any two vertices vi and v j (in query Q), where f (vi )and f (v j ) are both internal vertices in PM , are weaklyconnected (see Definition 7) in Q.

Vector [ f (v1), . . . , f (vn)] is a serialization of a local par-tial match.

Example 3 Given a SPARQL query Q with six verticesin Fig. 2, the subgraph induced by vertices 001, 002, 007and 014 (shown in shaded circles and red edges) is alocal partial match of Q in fragment F1. The function is{(v1, 002), (v2, 001), (v3, NULL), (v4,007), (v5, NULL),

(v6, 014)}. The five different local partial matches in F1 areshown in Fig. 4.

Definition 6 formally defines a local partial match, whichis a subset of a complete SPARQL match. Therefore, someconditions in Definition 6 are analogous to SPARQL matchwith some subtle differences. In Definition 6, some verticesof query Q are not matched in a local partial match. Theyare allowed to match a special value NULL (e.g., v3 and v5in Example 3). As mentioned earlier, a local partial matchis the overlapping part of an unknown crossing match and

a fragment Fi . Therefore, it must have a crossing edge, i.e,Condition 4.

The basic intuition of Condition 5 is that if vertex vi (inquery Q) ismatched to an internal vertex, all of vi ’s neighborsshould be matched in this local partial match as well. Thefollowing example illustrates the intuition.

Example 4 Let us recall the local partialmatch PM21 of Frag-

ment F1 in Fig. 4. An internal vertex 001 in fragment F1 ismatched to vertex v2 in query Q. Assume that PM is an over-lapping part between a crossing match M and fragment F1.Obviously, v2’s neighbors, such as v1 and v4, should also bematched in M . Furthermore, the matching vertices should be001’s neighbors. Since 001 is an internal vertex in F1, 001’sneighbors are also in fragment F1.

Therefore, if a PM violates Condition 5, it cannot be asubgraphof a crossingmatch. In otherwords,we are not inter-ested in these subgraphs when finding local partial matches,since they do not contribute to any crossing match.

Definition 7 Twovertices areweakly connected in a directedgraph if and only if there exists a connected path betweenthe two vertices when all directed edges are replaced withundirected edges. The path is called aweakly connected pathbetween the two vertices.

Condition 6 will be used to prove the correctness of ouralgorithm in Propositions 1 and 2. The following exampleshows all local partial matches in the running example.

Example 5 Given a query Q in Fig. 2 and an RDF graph Gin Figs. 1, 4 shows all local partial matches and their seri-alization vectors in each fragment. A local partial match infragment Fi is denoted as PM j

i , where the superscript distin-guishes different local partial matches in the same fragment.Furthermore, we underline all extended vertices in serializa-tion vectors.

The correctness of our method is stated in the followingpropositions.

1. The overlapping part between any crossing match M andinternal vertices of fragment Fi (i = 1, . . . , k) must be alocal partial match (see Proposition 1).

2. Missing any local partial match may lead to result dis-missal. Thus, the algorithm should find all local partialmatches in each fragment (see Proposition 2).

3. It is impossible to find two local partial matches M andM ′ in fragment F , where M ′ is a subgraph of M , i.e.,each local partial match is maximal (see Proposition 4).

Proposition 1 Given any crossing match M of SPARQLquery Q in an RDF graph G, if M overlaps with somefragment Fi , let (M ∩ Fi ) denote the overlapping partbetween M and fragment Fi . Assume that (M ∩ Fi ) con-sists of several weakly connected components, denoted as

123

P. Peng et al.

Fig. 4 Local partial matches of Q in each fragment

(M∩Fi ) = {PM1, . . . , PMn}. Eachweakly connected com-ponent PMa (1 ≤ a ≤ n) in (M ∩ Fi ) must be a local partialmatch in fragment Fi .

Proof (1) Since PMa (1≤ a ≤ n) is a subset of a SPARQLmatch, it is easy to show that Conditions 1–3 of Definition 6hold.

(2)Weprove that eachweakly connected component PMa

(1 ≤ a ≤ n) must have at least one crossing edge (i.e., Con-dition 4) as follows.

Since M is a crossing match of SPARQL query Q, Mmust be weakly connected, i.e., any two vertices in M areweakly connected. Assume that (M ∩ Fi ) consists of sev-eral weakly connected components, denoted as (M ∩ Fi )={PM1, . . . , PMn}. Let M = (M ∩ Fi )+ (M∩Fi ), where(M∩Fi ) denotes the complement of (M∩Fi ). It is straight-forward to show that (M∩Fi )must occur in other fragments;otherwise, it should be found at (M∩Fi ). PMa (1≤a≤n) isweakly disconnected with each other because we remove(M∩Fi ) from M . In other words, each PMa must haveat least one crossing edge to connect PMa with (M∩Fi ).(M∩Fi ) are in other fragments and only crossing edges canconnect fragment Fi with other fragments. Otherwise, PMa

is a separated part in the crossing match M . Since, M isweakly connected, PMa has at least one crossing edge, i.e,Condition 4.

(3) For Condition 5, for any internal vertex u in PMa

(1 ≤ a ≤ n), PMa retains all its incident edges. Thus, wecan prove that Condition 5 holds.

(4) We define PMa (1 ≤ a ≤ n) as a weakly connectedpart in (M ∩ Fi ). Thus, Condition 6 holds.

To summarize, the overlapping part between M andfragment Fi satisfies all conditions in Definition 6. Thus,Proposition 1 holds. ��

Let us recall Example 5. There are some local partialmatches that do not contribute to any crossing match, suchas PM5

1 in Fig. 4. We call these local partial matches falsepositives. However, the partial evaluation stage only dependson the known input. If we do not know the structures of otherfragments, we cannot judge whether or not PM5

1 is a falsepositive. Formally,we have the following proposition, statingthat we have to find all local partial matches in each fragmentFi in the partial evaluation stage.

Proposition 2 The partial evaluation and assembly algo-rithm does not miss any crossing matches in the answer setif and only if all local partial matches in each fragment arefound in the partial evaluation stage.

Proof In two parts:(1) The “If” part: (proven by contradiction).Assume that all local partial matches are found in each

fragment Fi but a cross match M is missed in the answerset. Since M is a crossing match, suppose that M overlapswithm fragments F1,…,Fm . According to Proposition 1, theoverlapping part betweenM and Fi (i = 1, . . . ,m) must be alocal partial match PMi in Fi . According to the assumption,these local partial matches have been found in the partialevaluation stage. Obviously, we can assemble these partialmatches PMi (i = 1, . . . ,m) to form the complete crossmatch M .

In other words, M would not be missed if all local partialmatches are found. This contradicts the assumption.

(2) The “Only If” part: (proven by contradiction).We assume that a local partial match PMi in fragment

Fi is missed and the answer set can still satisfy no-false-negative requirement. Suppose that PMi matches a part ofQ, denoted as Q′. Assume that there exists another local par-tialmatch PMj in Fj thatmatches a complementary graph of

123

Processing SPARQL queries over distributed RDF graphs

Q′, denoted as Q = Q\Q′. In this case, we can obtain a com-plete match M by assembling the two local partial matches.If PMi in Fi is missed, then match M is missed. In otherwords, it cannot satisfy the no-false-negative requirement.This also contradicts the assumption. ��

Proposition 2 guarantees that no local partial matches willbe missed. This is important to avoid false negatives. Basedon Proposition 2, we can further prove the following propo-sition, which guarantees that the intermediate results in ourmethod involve the smallest number of vertices and edges.

Proposition 3 Given the same underlying partitioning overRDF graph G, the number of involved vertices and edges inthe intermediate results (in our approach) is not larger thanthat in any other partition-based solution.

Proof In Proposition 2, we prove that every local partialmatch should be found for result completeness (i.e., falsenegatives). The same proposition proves that our methodproduces complete results. Therefore, if a partition-basedsolution omits some of the partial matches (i.e., interme-diate results) that are in our solution (i.e., has intermediateresult smaller than ours), then it cannot produce completeresults. Assuming that they all produce complete results,what remains to be proven is that our set of partial matches isa subset of those generated by other partition-based solutions.We prove that by contradiction.

Let A be a solution generated by an alternative partition-based approach. Assume that there exists one vertex u in alocal partial match PM produced by our method, but u isnot in the intermediate results of the partition-based solutionA. This would mean that during the assembly phase to pro-duce the final result, any edges adjacent to u will be missed.This would produce incomplete answer, which contradictsthe completeness assumption.

Similarly, it can be argued that it is impossible that thereexists an edge in our local partial matches (i.e., intermedi-ate results) that it is not in the intermediate results of otherpartition-based approaches.

In other words, all vertices and edges in local par-tial matches must occur in the intermediate results ofother partition-based approaches. Therefore, Proposition 3holds. ��

Finally, we discuss another feature of a local partial matchPMi in fragment Fi . Any PMi cannot be enlarged by intro-ducingmore vertices or edges to become a larger local partialmatch. The following proposition formalizes this.

Proposition 4 Given a query graph Q and anRDFgraphG,if PMi is a local partial match under function f in fragmentFi , there exists no local partial match PM ′

i under functionf ′ in Fi , where f ⊂ f ′.

Proof (by contradiction) Assume that there exists anotherlocal partial match PM ′

i of query Q in fragment Fi , wherePMi is a subgraph of PM ′

i . Since PMi is a subgraph of PM ′i ,

there must exist at least one edge e = −→uu′ where e ∈ PM ′

i

and e /∈ PMi . Assume that−→uu′ is matching edge

−→vv′ in query

Q. Obviously, at least one endpoint of e should be an internalvertex. We assume that u is an internal vertex. According to

Condition (5) of Definition 6 and Claim (1), edge−→vv′ should

also be matched in PM , since PM is a local partial match.

However, edge−→uu′ (matching

−→vv′) does not exist in PM . This

contracts PM being a local partial match. Thus, Proposition4 holds. ��

4.2 Computing local partial matches

Given a SPARQL query Q and a fragment Fi , the goal ofpartial evaluation is to find all local partial matches (accord-ing to Definition 6) in Fi . The matching process consists ofdetermining a function f that associates vertices of Q withvertices of Fi . The matches are expressed as a set of pairs(v, u) (v ∈ Q and u ∈ Fi ). A pair (v, u) represents thematching of a vertex v of query Q with a vertex u of frag-ment Fi . The set of vertex pairs (v, u) constitutes function freferred to in Definition 6.

A high-level description of finding local partial matchesis outlined in Algorithm 1 and Function ComParMatch.According to Conditions 1 and 2 of Definition 6, each vertexv in query graph Q has a candidate list of vertices in fragmentFi . Since function f is as a set of vertex pairs (v, u) (v ∈ Qand u ∈ Fi ), we start with an empty set. In each step, weintroduce a candidate vertex pair (v, u) to expand the currentfunction f , where vertex u (in fragment Fi ) is a candidate ofvertex v (in query Q).

Assume that we introduce a new candidate vertex pair(v′, u′) into the current function f to form another functionf ′. If f ′ violates any condition except for Conditions 4 and5 of Definition 6, the new function f ′ cannot lead to a localpartial match (Lines 6–7 in Function ComParMatch). If f ′satisfies all conditions except for Conditions 4 and 5, itmeansthat f ′ can be further expanded (Lines 8–9 in Function Com-ParMatch). If f ′ satisfies all conditions, then f ′ specifies alocal partialmatch and it is reported (Lines 10–11 in FunctionComParMatch).

At each step, a new candidate vertex pair (v′, u′) is addedto an existing function f to form a new function f ′. Theorder of selecting the query vertex can be arbitrarily defined.However, QuickSI [43] proposes several heuristic rules toselect an optimized order that can speed up the matchingprocess. These rules are also utilized in our experiments.

To compute local partial matches (Algorithm 1), we revisea graph-based SPARQL query engine, gStore, which is ourprevious work. Since gStore adopts “subgraph matching”

123

P. Peng et al.

Algorithm 1: Computing Local Partial MatchesInput: A fragment Fi and a query graph Q.Output: The set of all local maximal partial matches in Fi ,

denoted as Ω(Fi ).1 Select one vertex v in Q2 for each candidate vertex u with regard to v do3 Initialize a function f with (v, u)

4 Call Function ComParMatch( f )5 Return Ω(Fi );

Function ComParMatch( f )1 if all vertices of query Q have been matched in the function f

then2 Return;3 Select an unmatched v′ adjacent to a matched vertex v in thefunction f

4 for each candidate vertex u′ with regard to v′ do5 f ′ ← f ∪ (v′, u′)6 if f ′ violates any condition (except for condition 4 and 5 of

Definition 6) then7 Continue8 if f ′ satisfies all conditions (except for condition 4 and 5 of

Definition 6) then9 ComParMatch( f ′)

10 if f ′ satisfies all conditions of Definition 6 then11 f specifies a local partial match PM that will be inserted

into the answer set Ω(Fi )

technique to answer SPARQL query processing, it is easy torevise its subgraph matching algorithm to find “local partialmatches” in each fragment. gStore adopts a state transfor-mation technique to find SPARQL matches. Here, a statecorresponds to a partial match (i.e., a function from Q to G).

Our state transformation algorithm is as follows. Assumethat v matches vertex u in SPARQL query Q. We first ini-tialize a state with v. Then, we search the RDF data graphfor v’s neighbor v′ corresponding to u′ in Q, where u′ is oneof u’s neighbors and edge

−→vv′ satisfies query edge

−→uu′. The

search will extend the state step by step. The search branchterminates when a state corresponding to a match is found orsearch cannot continue. In this case, the algorithm backtracksand tries another search branch.

The only change that is required to implement Algorithm1 is in the termination condition (i.e., the final state) so thatit stops when a partial match is found rather than looking fora complete match.

Example 6 Figure 5 shows how to compute Q’s local par-tial matches in fragment F1. Suppose that we initialize afunction f with (v3, 005). In the second step, we expandto v1 and consider v1’s candidates, which are 002 and028. Hence, we introduce two vertex pairs (v1, 002) and(v1, 028) to expand f . Similarly, we introduce (v5, 027)into the function {(v3, 005), (v1, 002)} in the third step.Then, {(v3, 005), (v1, 002), (v5, 027)} satisfies all condi-

Fig. 5 Finding local partial matches

tions of Definition 6; thus, it is a local partial match andis returned. In another search branch, we check the func-tion {(v3, 005), (v1, 028)}, which cannot be expanded, i.e.,we cannot introduce a new matching pair, without violat-ing some conditions in Definition 6. Therefore, this searchbranch is terminated.

5 Assembly

Each site Si finds all local partialmatches in fragment Fi . Thenext step is to assemble partial matches to compute cross-ing matches and compute the final results. We propose twoassembly strategies: centralized and distributed (or parallel).In centralized, all local partialmatches are sent to a single sitefor assembly. For example, in a client/server system, all localpartialmatchesmay be sent to the server. In distributed/paral-lel, local partial matches are combined at a number of sites inparallel. Here, when Si sends the local partial matches to thefinal assembly site for joining, it also tags which vertices inlocal partialmatches are internal vertices or extended verticesof Fi . This will be useful for avoiding some computations asdiscussed in this section.

In Sect. 5.1, we define a basic join operator for assembly.Then, we propose a centralized assembly algorithm in Sect.5.2 using the join operator. In Sect. 5.3, we study how toassemble local partial matches in a distributed manner.

5.1 Join-based assembly

We first define the conditions under which two partialmatches are joinable. Obviously, crossing matches can onlybe formed by assembling partial matches from different frag-ments. If local partial matches from the same fragment couldbe assembled, this would result in a larger local partial matchin the same fragment, which is contrary to Proposition 4.

Definition 8 (Joinable) Given a query graph Q and twofragments Fi and Fj (i = j), let PMi and PMj be thecorresponding local partial matches over fragments Fi andFj under functions fi and f j . PMi and PMj are joinable ifand only if the following conditions hold:

123

Processing SPARQL queries over distributed RDF graphs

1. There exist no vertices u and u′ in PMi and PMj , respec-tively, such that f −1

i (u) = f −1j (u′).

2. There exists at least one crossing edge−→uu′ such that u

is an internal vertex and u′ is an extended vertex in Fi ,while u is an extended vertex and u′ is an internal vertexin Fj . Furthermore, f −1

i (u) = f −1j (u) and f −1

i (u′) =f −1j (u′).

The first condition says that the same query vertex cannotbe matched by different internal vertices in joinable partialmatches. The second condition says that two local partialmatches share at least one common crossing edge that cor-responds to the same query edge.

Example 7 Let us recall query Q in Fig. 2. Figure 3 showstwo different local partial matches PM2

1 and PM22 . We also

show the functions in Fig. 3. There do not exist two differentvertices in the two local partial matches that match the samequery vertex. Furthermore, they share a common crossingedge

−−−−−→002, 001, where 002 and 001 match query vertices v2

and v1 in the two local partial matches, respectively. Hence,they are joinable.

The join result of two joinable local partial matches isdefined as follows.

Definition 9 (Join result) Given a query graph Q and twofragments Fi and Fj , i = j , let PMi and PMj be twojoinable local partial matches of Q over fragments Fi andFj under functions fi and f j , respectively. The join of PMi

and PMj is defined under a new function f (denoted asPM = PMi �� f PM j ), which is defined as follows for anyvertex v in Q:

1. if fi (v) = NULL ∧ f j (v) = NULL2, f (v) ← fi (v)3;2. if fi (v) = NULL ∧ f j (v) = NULL , f (v) ← f j (v);3. if fi (v) = NULL∧ f j (v) = NULL , f (v) ← fi (v) (In

this case, fi (v) = f j (v))4. if fi (v) = NULL ∧ f j (v) = NULL , f (v) ← NULL

Figure 3 shows the join result of PM21 �� f PM2

2 .

5.2 Centralized assembly

In centralized assembly, all local partial matches are sent toa final assembly site. We propose an iterative join algorithm(Algorithm 2) to find all crossing matches. In each iteration,a pair of local partial matches is joined. When the join is

2 f j (v) = NULL means that vertex v in query Q is not matched inlocal partial match PMj . It is formally defined in Definition 6 condition(2)3 In this paper, we use “←” to denote the assignment operator.

Fig. 6 Joining PM41 , PM1

3 and PM14

complete (i.e., a match has been found), the result is returned(Lines 12–13 in Algorithm 2); otherwise, it is joined withother local partial matches in the next iteration (Lines 14–15). There are |V (Q)| iterations of Lines 4–16 in the worstcase, since at each iteration only a single newmatching vertexis introduced (worst case) and Q has |V (Q)| vertices. If nonew intermediate results are generated at some iteration, thealgorithm can stop early (Lines 5–6).

Example 8 Figure 3 shows the join result of PM21 �� f PM2

2 .In this example, we consider a crossing match formed bythree local partial matches. Let us consider three local partialmatches PM4

1 , PM14 and PM1

3 in Fig. 4. In the first iteration,we obtain the intermediate result PM4

1 �� f PM13 in Fig. 6.

Then, in the next iteration, (PM41 �� f PM1

3 ) joinswith PM14

to obtain a crossing match.

5.2.1 Partitioning-based join processing

The join space inAlgorithm 2 is large, sincewe need to checkwhether every pair of local partial matches PMi and PMj

is joinable. This subsection proposes an optimized techniqueto reduce the join space.

The intuition of our method is as follows. We divide alllocal partial matches into multiple partitions such that twolocal partial matches in the same set cannot be joinable; weonly consider joining local partialmatches fromdifferent par-titions. The following theorem specifies which local partialmatches can be put in the same partition.

123

P. Peng et al.

Algorithm 2: Centralized Join-based AssemblyInput: Ω(Fi ), i.e, the set of local partial matches in each

fragment Fi , i = 1, . . . , kOutput: All crossing matches set RS.

1 Each fragment Fi sends the set of local partial matches in eachfragment Fi (i.e., Ω(Fi )) to a single site for the assembly

2 Let Ω ← ⋃i=ki=1 Ω(Fi )

3 Set MS ← Ω

4 while MS = ∅ do5 Set MS′ ← ∅6 for each local partial match PM in MS do7 for each local partial match PM ′ in Ω do8 if PM and PM ′ are joinable then9 Set PM ′′= PM �� PM ′

10 if PM ′′ is a complete match of Q then11 put PM ′′ into RS12 else13 put PM ′′ into MS′14 MS ← MS′15 Return RS

Fig. 7 The local partial match partition on v1

Theorem 1 Given two local partial matches PMi and PMj

from fragments Fi and Fj with functions fi and f j , respec-tively, if there exists a query vertex v where both fi (v) andf j (v) are internal vertices of fragments Fi and Fj , respec-tively, PMi and PMj are not joinable.

Proof If fi (v) = f j (v), then a vertex v in query Q matchestwo different vertices in PMi and PMj , respectively. Obvi-ously, PMi and PMj cannot be joinable.

If fi (v) = f j (v), since fi (v) and f j (v) are both internalvertices, both PMi and PMj are from the same fragment. Asmentioned earlier, it is impossible to assemble two local par-tial matches from the same fragment (see the first paragraphof Sect. 5.1); thus, PMi and PMj cannot be joinable. ��

Example 9 Figure 7 shows the serialization vectors (definedin Definition 6) of four local partial matches. For each localpartial match, there is an internal vertex that matches v1 inquery graph. The underline indicates the extended vertex inthe local partial match. According to Theorem 1, none ofthem are joinable.

Definition 10 (Local partialmatch partitioning). Consider aSPARQL query Q with n vertices {v1, . . . , vn}. LetΩ denoteall local partialmatches.P = {Pv1, . . . , Pvn } is a partitioningof Ω if and only if the following conditions hold.

1. Each partition Pvi (i = 1, . . . , n) consists of a set of localpartial matches, each of which has an internal vertex thatmatches vi .

2. Pvi ∩ Pv j = ∅, where 1 ≤ i = j ≤ n.3. Pv1 ∪ . . . ∪ Pvn = Ω

Example 10 Let us consider all local partial matches of ourrunning example in Fig. 4. Figure 8 shows two different par-titionings.

As mentioned earlier, we only need to consider joininglocal partial matches from different partitions of P . Givena partitioning P = {Pv1, . . . , Pvn }, Algorithm 3 shows howto perform partitioning-based join of local partial matches.Note that different partitionings and the different join ordersin the partitioning will impact the performance of Algorithm3. In Algorithm 3, we assume that the partitioning P ={Pv1, . . . , Pvn } is given and that the join order is from Pv1 toPvn , i.e., the order inP . Choosing a good partitioning and theoptimal join order will be discussed in Sects. 5.2.2 and 5.2.3.

Algorithm 3: Partitioning-based Joining Local PartialMatches

Input: A partitioning P = {Pv1 , . . . , Pvn } of all local partialmatches.

Output: All crossing matches set RS.1 MS ← Pv1

2 for i ← 2 to n do3 MS′ ← ∅4 for each partial match PM in MS do5 for each partial match PM ′ in Pvi do6 if PM and PM ′ are joinable then7 Set PM ′′ ← PM �� PM ′8 if PM ′′ is a complete match then9 Put PM ′′ into the answer set RS

10 else11 Put PM ′′ into MS′12 Put PM ′ into MS′13 Insert MS′ into MS14 Return RS

The basic idea of Algorithm 3 is to iterate the join processon each partition of P . First, we set MS ← Pv1 (Line 1 inAlgorithm 3). Then, we try to join local partial matches PMin MS with local partial matches PM ′ in Pv2 (the first loopof Lines 3–13). If the join result is a complete match, it isinserted into the answer set RS (Lines 8–9). If the join resultis an intermediate result, we insert it into a temporary setMS′(Lines 10–11). We also need to insert PM ′ into MS′, sincethe local partial match PM ′ (in Pv2 ) will join local partialmatches in the later partition ofP (Line 12). At the end of theiteration, we insert all intermediate results (inMS′) intoMS,which will join local partial matches in the later partition ofP in the next iterations (Line 13). We iterate the above stepsfor each partition of P in the partitioning (Lines 3–13).

123

Processing SPARQL queries over distributed RDF graphs

(a)

(b)

Fig. 8 Evaluation of two partitionings of local partial matches

5.2.2 Finding the optimal partitioning

Obviously, given a set Ω of local partial matches, there maybe multiple feasible local partial match partitionings, each ofwhich leads to a different join performances. In this subsec-tion, we discuss how to find the “optimal” local partial matchpartitioning over Ω , which can minimize the joining time ofAlgorithm 4.

First, there is a need for a measure that would define moreprecisely the join cost for a local partial match partitioning.We define it as follows.

Definition 11 (Join cost). Given a query graph Q with nvertices v1,…,vn and a partitioning P = {Pv1 ,…,Pvn } overall local partial matches Ω , the join cost is

Cost (Ω) = O

(i=n∏

i=1

(|Pvi | + 1)

)

(1)

where |Pvi | is the number of local partial matches in Pvi and1 is introduced to avoid the “0” element in the product.

Definition 11 assumes that each pair of local partialmatches (from different partitions of P) are joinable so thatwe can quantify theworst-case performance. Naturally, moresophisticated and more realistic cost functions can be usedinstead, but finding the most appropriate cost function is amajor research issue in itself and outside the scope of thispaper.

Example 11 The cost of the partitioning in Fig. 8a is 5×4×4 = 80, while that of Fig. 8b is 6 × 3 × 4 = 72. Hence, thepartitioning in Fig. 8b has lower join cost.

Based on the definition of join cost, the “optimal” localpartial match partitioning is one with the minimal join cost.We formally define the optimal partitioning as follows.

Definition 12 (Optimal partitioning). Given a partitioningP over all local partial matches Ω , P is the optimal parti-tioning if and only if there exists no another partitioning thathas smaller join cost.

Unfortunately, Theorem 2 shows that finding the optimalpartitioning is NP-complete.

Theorem 2 Finding the optimal partitioning isNP-completeproblem.

Proof We can reduce a 0–1 integer planning problem to find-ing the optimal partitioning. We build a bipartite graph B,which contains two vertex groups B1 and B2. Each vertexa j in B1 corresponds to a local partial match PMj in Ω ,j = 1, . . . , |Ω|. Each vertex bi in B2 corresponds to a queryvertex vi , i = 0, . . . , n. We introduce an edge between a j

and bi if and only if PMj has a internal vertex that is match-ing query vertex bi . Let a variable x ji denote the edge labelof the edge a jbi . Figure 9 shows an example bipartite graphof all local partial matches in Fig. 4.

We formulate the 0–1 integer planningproblemas follows:

mini=n∏

i=0

⎝∑

j

x ji + 1

st.∀ j,∑

i

x ji = 1

The above equation means that each local partial matchshould be assigned to only one query vertex.

The equivalence between the 0–1 integer planning andfinding the optimal partitioning are straightforward. The for-mer is a classical NP-complete problem. Thus, the theoremholds. ��

Although finding the optimal partitioning is NP-complete(see Theorem 2), in this work, we propose an algorithm

123

P. Peng et al.

Fig. 9 Example Bipartite graph

Fig. 10 Uv1 and Uv3

with time complexity (2n × |Ω|), where n (i.e., |V (Q)|)is small in practice. Theoretically, this algorithm is calledfixed-parameter tractable [10]4.

Our algorithm is based on the following feature of optimalpartitioning (see Theorem 3). Consider a query graph Q withn vertices v1,…,vn . Let Uvi (i = 1, . . . , n) denote all localpartial matches (in Ω) that have internal vertices matchingvi . Unlike the partitioning defined in Definition 10, Uvi andUv j (1 ≤ i = j ≤ n) may have overlaps. For example, PM3

2(in Fig. 10) contains an internal vertex 002 that matches v1;thus, PM3

2 is in Uv1 . PM32 also has internal vertex 010 that

matches v3; thus, PM32 is also in Uv3 . However, the parti-

tioning defined in Definition 10 does not allow overlappingamong partitions of P .

Theorem 3 Givenaquerygraph Q withn vertices {v1,…,vn}and a set of all local partialmatchesΩ , letUvi (i = 1, . . . , n)be all local partial matches (in Ω) that have internal ver-tices matching vi . For the optimal partitioning Popt ={Pv1 , . . . , Pvn } where Pvn has the largest size (i.e., the num-ber of local partial matches in Pvn is maximum) in Popt ,Pvn = Uvn .

Proof (by contradiction) Assume that Pvn = Uvn in the opti-mal partitioning Popt = {Pv1 , . . . , Pvn }. Then, there exists alocal partial match PM /∈ Pvn and PM ∈ Uvn . We assumethat PM ∈ Pv j , j = n. The cost ofPopt = {Pv1 , . . . , Pvn } is:

Cost(Ω)opt =⎛

⎝∏

1≤i<n∧i = j

(|Pvi |+1)

⎠×(|Pv j |+1)×(|Pvn |+1) (2)

4 An algorithm is called fixed-parameter tractable for a problem of sizel, with respect to a parameter n, if it can be solved in time O( f (n)g(l)),where f (n) can be any function but g(l) must be polynomial [10].

Since PM ∈ Uvn , PM has an internal vertexmatching vn .Hence, we can also put PM into Pvn . Then, we get a new par-titioning P ′ = {Pv1 , . . . , Pv j − {PM}, . . . , , Pvn ∪ {PM}}.The cost of the new partitioning is:

Cost (Ω)=⎛

⎝∏

1≤i<n∧i = j

(|Pvi |+1)

⎠×|Pv j |×(|Pvn |+2) (3)

Let C = ∏1≤i<n∧i = j (|Pvi | + 1), which exists in both

Eqs. 2 and 3. Obviously, C > 0.

Cost (Ω)opt − Cost (Ω)

=C×(|Pvn |+1)×(|Pv j |+1)−C×(|Pvn |+2)×(|Pv j |)= C × (|Pvn | + 1 − |Pv j |)

Because Pvn is the largest partition in Popt , |Pvn | + 1 −|Pv j | > 0. Furthermore, C > 0. Hence, Cost (Ω)opt −Cost (Ω) > 0, meaning that the optimal partitioning haslarger cost. Obviously, this cannot happen.

Therefore, in the optimal partitioning Popt , we cannotfind a local partial match PM , where |Pvn | is the largest,PM /∈ Pvn and PM ∈ Uvn . In other words, Pvn = Uvn inthe optimal partitioning. ��

Let Ω denote all local partial matches. Assume that theoptimal partitioning is Popt = {Pv1, Pv2 , . . . , Pvn }. Wereorder the partitions of Popt in non-descending order ofsizes, i.e., Popt = {Pvk1

, . . . , Pvkn}, |Pvk1

| ≥ |Pvk2| ≥

. . . ≥ |Pvkn|. According to Theorem 3, we can conclude

that Pvk1= Uvk1

in the optimal partitioning Popt .Let Ωvk1

= Ω − Uvk1, i.e., the set of local partial

matches excluding the ones with an internal vertex matchingvk1 . It is straightforward to know Cost (Ω)opt = |Pvk1

| ×Cost (Ωvk1

)opt = |Uvk1| × Cost (Ωvk1

)opt . In the optimalpartitioning over Ωvk1

, we assume that Pvk2has the largest

size. Iteratively, according to Theorem 3, we know thatPvk2

= U ′vk2

, where U ′vk2

denotes the set of local partialmatches with an internal vertex matching vk2 in Ωvk1

.According to the above analysis, if a vertex order is given,

the partitioning over Ω is fixed. Assume that the optimalvertex order that leads to minimum join cost is given as{vk1 , . . . , vkn }. The partitioning algorithm work as follows.

Let Uvk1denote all local partial matches (in Ω) that have

internal vertices matching vertex vk15. Obviously, Uvk1

isfixed if Ω and the vertex order is given. We set Pvk1

= Uvk1.

In the second iteration, we remove all local partial matchesin Uvk1

from Ωvk1, i.e, Ωvk1

= Ω − Uvk1. We set U ′

vk2to be

all local partial matches (in Ω ′) that have internal vertices

5 When we find local partial matches in fragment Fi and send them tojoin, we tag which vertices in local partial matches are internal verticesof Fi .

123

Processing SPARQL queries over distributed RDF graphs

Fig. 11 Example of partitioning local partial matches

matching vertex vk2 . Then, we set Pvk2= U ′

vk2. Iteratively,

we can obtain Pvk3, . . . , Pvkn

.

Example 12 Consider all local partial matches in Fig. 11.Assume that the optimal vertex order is {v3, v1, v2}. We willdiscuss how to find the optimal order later. In the first iter-ation, we set Pv3 = Uv3 , which contains five matches. Forexample, PM1

1 =[0026,NULL, 005,NULL, 027,NULL] isin Uv3 , since internal vertex 005 matches v3. In the seconditeration, we set Ωv3 = Ω − Pv3 . Let U

′v1

to be all localpartial matches in Ωv3 that have internal vertices matchingvertex v1. Then, we set Pv1 = U ′

v1. Iteratively, we can obtain

the partitioning {Pv3, Pv1 , Pv2}, as shown in Fig. 11.

Therefore, the challenging problem is how to find theoptimal vertex order {vk1 , . . . , vkn }. Let us denote by Ωvk1all local partial matches (in Ω) that do not contain internalverticesmatching vk1 , i.e.,Ωvk1

= Ω−Uvk1. It is straightfor-

ward to have the following optimal substructure7 in Eq. 4.

Cost (Ω)opt = |Pvk1| × Cost (Ωvk1

)opt

= |Uvk1| × Cost (Ωvk1

)opt (4)

Since we do not know which vertex is vk1 , we introducethe following optimal structure that is used in our dynamicprogramming algorithm (Lines 3–7 in Algorithm 4 ).

Cost (Ω)opt = MI N1≤i≤n(|Pvi | × Cost (Ωvi )opt )

= MI N1≤i≤n(|Uvi | × Cost (Ωvi )opt ) (5)

Obviously, it is easy to design a naive dynamic algorithmbased on Eq. 5. However, it can be further optimized byrecording some intermediate results. Based on Eq. 5, we canprove the following equation.

6 We underline all extended vertices in serialization vectors.7 A problem is said to have optimal substructure if an optimal solutioncan be constructed efficiently from optimal solutions of its subproblems[9]. This property is often used in dynamic programming formulations.

Algorithm 4: Finding the Optimal PartitioningInput: All local partial matches Ω

Output: The Optimal Partitioning Popt and Costopt (Ω)

1 minI D ← Φ

2 Cost (Ω)opt ← ∞3 for i = 1 to n do4 Costopt (Ω −Uvi ), P ′

i ← ComCost(Ω −Uvi , {vi })/* Call Function ComCost, U ′

videnotes

all local partial matches (in Ω ′) thathave vertices match vi. */

5 if Cost (Ω)opt > |Uvi | × Costopt (Ω −Uvi ) then6 Cost (Ω)opt ← |Uvi | × Costopt (Ω −Uvi )

7 minI D = i8 Popt ← {UvminI D } ⋃ P ′

minI D9 Return Popt

Cost (Ω)opt = MI N1≤i≤n;1≤ j≤n;i = j (|Pvi | × |Pv j |×Cost (Ωviv j )opt )

= MI N1≤i≤n;1≤ j≤n;i = j (|Uvi | × |U ′v j

|×Cost (Ωviv j )opt ) (6)

where Ωviv j denotes all local partial matches that do notcontain internal vertices matching vi or v j , and U ′

v jdenotes

all local partial matches (inΩvi ) that contain internal verticesmatching vertex v j .

However, if Eq. 6 is used naively in the dynamic program-ming formulation, it would result in repeated computations.For example, Cost (Ωv1v2)opt will be computed twice inboth |Uv1 | × |U ′

v2| × Cost (Ωv1v2)opt and |Uv2 | × |U ′

v1| ×

Cost (Ωv1v2)opt . To avoid this, we introduce a map thatrecordsCost (Ω ′) that is already calculated (Line 16 in Func-tion OptComCost), so that subsequent uses of Cost (Ω ′) canbe serviced directly by searching the map (Lines 8–10 inFunction ComCost).

We can prove that there are∑n

i=1

(ni

)

= 2n items in the

map (worst case), where n = |V (Q)|. Thus, the time com-plexity of the algorithm is (2n×|Ω|). Since n (i.e., |V (Q)|) issmall in practice, this algorithm is fixed-parameter tractable.

5.2.3 Join order

When we determine the optimal partitioning of local partialmatches, the join order is also determined. If the optimal par-

123

P. Peng et al.

Function ComCost(Ω ′,W )Input: local partial match set Ω ′ and a set W of vertices that

have been usedOutput: The Optimal Partitioning P ′

opt over Ω ′ and Costopt (Ω ′)1 if Ω ′ = Φ then2 Return P ′

opt ← Φ ; Cost (Ω ′)opt ← 1 ;3 else4 minI D ← Φ

5 Cost (Ω ′)opt ← ∞6 for i = 1 to n do7 if vi /∈ W then8 if MAP consists the key (Ω ′-U ′

vi) then

/* if Cost (Ω ′-U ′vi

))opt has been

calculated before */9 Costopt (Ω ′ −U ′

vi), P ′

i ← MAP[(Ω ′-U ′vi

)]/* finding the corresponding join

cost and the optimalpartitioning over (Ω ′-U ′

vi) from

the map */10 else11 Costopt (Ω ′ −U ′

vi), P ′

i ←ComCost(Ω ′ −U ′

vi,W ∪ {vi })

/* Call Function ComCost, U ′vi

denotes all local partialmatches (in Ω ′) that havevertices match vi. */

12 if Cost (Ω ′)opt > |U ′vi

| × Costopt (Ω ′ −U ′vi

) then13 Cost (Ω ′)opt ← |Uvi | × Costopt (Ω ′ −U ′

vi)

14 minI D = i15 P ′

opt ← {UvminI D } ⋃ P ′minI D

16 Insert (key=Ω ′, value=(Cost (Ω ′)opt , P ′opt ) ) into the MAP .

17 Return Cost (Ω ′)opt and P ′opt

titioning is Popt = {Pvk1, . . . , Pvkn

} and |Pvk1| ≥ |Pvk2

| ≥. . . ≥ |Pvkn

|, then the join order must be Pvk1�� Pvk2

��

. . . �� Pvkn. The reasons are as follows.

First, changing the join order may not prune any interme-diate results. Let us recall the example optimal partitioning{Pv3 , Pv2 , Pv1} shown in Fig. 8b. The join order should bePv3 �� Pv2 �� Pv1 , and any changes in the join order wouldnot prune intermediate results. For example, if we first joinPv2 with Pv1 , we cannot prune the local partial matches inPv2 that cannot join with any local partial matches in Pv1 .This is because there may be some local partial matches Pv3

that have an internal vertex matching v1 and can join withlocal partial matches in Pv2 . In other words, the results ofPv2 �� Pv1 is not smaller than Pv2 . Similarly, we can provethat any other changes of the join order of the partitioninghave no effects.

Second, in some special cases, the join order may havean effect on the performance. Given a partitioning Popt ={Pvk1

, . . . , Pvkn} and |Pvk1

| ≥ |Pvk2| ≥ . . . ≥ |Pvkn

|, if theset of the first n′ vertices, {vk1 , vk2 , . . . , vn′ }, is a vertex cutof the query graph, the join order for the remaining n − n′partitions ofP has an effect. For example, let us consider thepartitioning {Pv1 , Pv3 , Pv2} in Fig. 8(a). If the partitioning isoptimal, then both joining Pv1 with Pv2 first and joining Pv1

with Pv3 first can work. However, it is possible for their costto be different.8 In the worst case, if the query graph is a com-plete graph, the join order has no effect on the performance.

In conclusion,when the optimal partitioning is determinedas Popt = {Pvk1

, . . . , Pvkn} and |Pvk1

| ≥ |Pvk2| ≥ . . . ≥

|Pvkn|, then the join ordermust be Pvk1

�� Pvk2�� . . . �� Pvkn

.The join cost can be estimated based on the cost function(Definition 11).

5.3 Distributed assembly

Analternative to centralized assembly is to assemble the localpartial matches in a distributed fashion. We adopt Bulk Syn-chronous Parallel (BSP) model [45] to design a synchronousalgorithm for distributed assembly. A BSP computation pro-ceeds in a series of global supersteps, each of which consistsof three components: local computation, communication andbarrier synchronization. In the following, we discuss howweapply this strategy to distributed assembly.

5.3.1 Local computation

Each processor performs some computation based on thedata stored in the local memory. The computations on dif-ferent processors are independent in the sense that differentprocessors perform the computation in parallel.

Consider the mth superstep. For each fragment Fi , letΔm

in(Fi ) denote all received intermediate results in the mthsuperstep andΩm(Fi )denote all local partialmatches and theintermediate results generated in the first (m−1) supersteps.In themth superstep, we join local partial matches inΔm

in(Fi )with local partial matches in Ωm(Fi ) by Algorithm 5. Foreach intermediate result PM , we check whether it can joinwith some local partial match PM ′ inΩm(Fi )

⋃Δm

in(Fi ). Ifthe join result PM ′′ = PM �� PM ′ is a complete crossingmatch, it is returned. If the join result PM ′′ is an intermedi-ate result, we will check whether PM ′′ can further join withanother local partial match in Ωm(Fi )

⋃Δm

in(Fi ) in the nextiteration. We also insert the intermediate result PM ′′ intoΔm

out (Fi ) that will be sent to other fragments in the commu-nication step discussed below. Of course, we can also use thepartitioning-based solution (in Sect. 5.2.1) to optimize joinprocessing, but we do not discuss that due to space limitation.

5.3.2 Communication

Processors exchange data among themselves. Consider themth superstep. A straightforward communication strategy isas follows. If an intermediate result PM in Δm

out (Fi ) sharesa crossing edge with fragment Fj , PM will be sent to site S j

from Si (assuming fragments Fi and Fj are stored in sites Siand S j , respectively).

8 Note that, in this example, their cost values are the same, but they arepossible to be different.

123

Processing SPARQL queries over distributed RDF graphs

Algorithm 5: Local Computation in Each Fragment FiInput: Ωm(Fi ), the local partial matches in fragment FiOutput: RS, the crossing matches found at this superstep;

Δmout (Fi ), the intermediate results that will be sent

1 Let Ω = Ωm(Fi ) ∪ Δmin(Fi )

2 Set MS = Δmin(Fi )

3 for N = 1 to |V (Q)| do4 if |MS|=0 then5 Break;6 Set MS′ = φ

7 for each local partial match PM in MS do8 for each local partial match PM ′ in Ωm(Fi ) ∪ Δm

in(Fi )do

9 if PM and PM ′ are joinable then10 PM ′′ ← PM �� PM ′11 if PM ′′ is a SPARQL match then12 Put PM ′′ into the answer set RS13 else14 Put PM ′′ into MS′15 Insert MS′ into Δm

out (Fi )16 Clear MS and MS ← MS′17 Ωm+1(Fi ) = Ωm(Fi ) ∪ Δm

out (Fi )18 Return RS and Δm

out (Fi )

However, the above communication strategymaygenerateduplicate results. For example, as shown in Fig. 4, we canassemble PM4

1 (at site S1) and PM13 (at site S3) to form a

complete crossing match. According to the straightforwardcommunication strategy, PM4

1 will be sent to S1 from S3 toproduce PM4

1 �� PM13 at S3. Similarly, PM1

3 is sent fromS3 to S1 to assemble at site S1. In other words, we obtain thejoin result PM4

1 �� PM13 at both sites S1 and S3. This wastes

resources and increases total evaluation time.To avoid duplicate result computation, we introduce a

“divide-and-conquer” approach. We define a total order (≺)over fragmentsF in a non-descending order of |Ω(Fi )|, i.e.,the number of local partial matches in fragment Fi found atthe partial evaluation stage.

Definition 13 Given any two fragments Fi and Fj , Fi ≺ Fj

if and only if |Ω(Fi )| ≤ |Ω(Fj )| (1 ≤ i, j ≤ n).

Without loss of generality, we assume that F1 ≺ F2 ≺. . . ≺ Fn in the remainder. The basic idea of the divide-and-conquer approach is as follows.Assume that a crossingmatchM is formedby joining local partialmatches that are fromdif-ferent fragments Fi1 ,…,Fim , where Fi1 ≺ Fi2 ≺ . . . ≺ Fim (1≤ i1, . . . , im ≤ n). The crossing match should only be gen-erated at fragment site Sim rather than other fragment sites.

For example, at site S2, we generate crossing matches byjoining local partial matches from F1 and F2. The cross-ing matches generated at S2 should not contain any localpartial matches from F3 or even larger fragments (such asF4,…,Fn). Similarly, at site S3, we should generate cross-ing matches by joining local partial matches from F3 andfragments smaller than F3. The crossing matches should notcontain any local partial match from F4 or even larger frag-ments (such as F5,…,Fn).

The “divide-and-conquer” framework can avoid duplicateresults, since each crossing match can be only generated at asingle site according to the “divided search space.” To enablethe “divide-and-conquer” framework, we need to introducesome constraints over data communication. The transmis-sion (of local partial matches) from fragment site Si to S j isallowed only if Fi ≺ Fj .

Let us consider an intermediate result PM in Δmout (Fi ).

Assume that PM is generated by joining intermediate resultsfromm different fragments Fi1 , . . . , Fim , where Fi1 ≺ Fi2 ≺. . . ≺ Fim . We send PM to another fragment Fj if and onlyif two conditions hold: (1) Fj > Fim ; and (2) Fj shares com-mon crossing edges with at least one fragment of Fi1 ,…,Fim .

5.3.3 Barrier synchronization

All communication in themth superstep should finish beforeentering in the (m + 1)th superstep.

We now discuss the initial state (i.e., 0th superstep) andthe system termination condition.

Initial state In the 0th superstep, each fragment Fi has onlylocal partial matches in Fi , i.e, ΩFi . Since it is impossibleto assemble local partial matches in the same fragment, the0th superstep requires no local computation. It enters thecommunication stage directly. Each site Si sendsΩFi to otherfragments according to the communication strategy that hasbeen discussed before.

5.3.4 System termination condition

A key problem in the BSP algorithm is the number of thesupersteps to terminate the system. In order to facilitate theanalysis, we propose using a fragmentation graph topologygraph.

Definition 14 (Fragmentation topologygraph)Given a frag-mentation F over an RDF graph G, the correspondingfragmentation topology graph T is defined as follows: Eachnode in T is a fragment Fi , i = 1, . . . , k. There is an edgebetween nodes Fi and Fj in T , 1 ≤ i = j ≤ n, if and onlyif there is at least one crossing edge between Fi and Fj inRDF graph G.

Let Dia(T )be the diameter of T .Weneed atmost Dia(T )

supersteps to transfer the local partial matches in one frag-ment Fi to any other fragment Fj . Hence, the number of thesupersteps in the BSP-based algorithm is Dia(T ).

6 Handling general SPARQL

So far, we only consider basic graph pattern (BGP) queryevaluation. In this section, we discuss how to extend ourmethod to general SPARQL queries involving UNION,OPTIONAL and FILTER statements.

123

P. Peng et al.

Fig. 12 Example general SPARQL query with UNION, OPTIONALand FILTER

A general SPARQL query and SPARQL query results canbe defined recursively based on BGP queries.

Definition 15 (General SPARQL query) Any BGP is aSPARQL query. If Q1 and Q2 are SPARQL queries,then expressions (Q1 AND Q2), (Q1 UN I ON Q2),(Q1 OPT Q2) and (Q1 F I LT ER F) are also SPARQLqueries.

Figure 12 shows an example general SPARQL query withmultiple operators, including UNION, OPTIONAL and FIL-TER. The set of all matches for Q is denoted as �Q�.

Definition 16 (Match of general SPARQL query) Given anRDF graph G, the match set of a SPARQL query Q over G,denoted as [[Q]], is defined recursively as follows:

1. If Q is a BGP, [[Q]] is the set of matches defined inDefinition 3 of Section 3.

2. If Q = Q1 AND Q2, then [[Q]] = [[Q1]] �� [[Q2]]3. If Q = Q1 UN I ON Q2, then [[Q]] = [[Q1]] ∪ [[Q2]]4. If Q = Q1 OPT Q2, then [[Q]] = ([[Q1]] �� [[Q2]]) ∪

([[Q1]]\[[Q2]])5. If Q = Q1 F I LT ER F , then [[Q]] = ΘF ([[Q1]])

We can parse each SPARQL query into a parse tree9,where the root is a pattern group. A pattern group spec-ifies a SPARQL statement and consists of a BGP querywith UNION, OPTIONAL and FILTER statements. TheUNION and OPTIONAL may recursively contain multiplepattern groups. It is easy to show that each leaf node (inthe parser tree) is a BGP query whose evaluation was dis-cussed earlier. We design a recursive algorithm (Algorithm8) to find answers to handle UNION, OPTIONAL and FIL-TER. Specifically, we perform left-outer join between BGPand OPTIONAL query results (Lines 4–5 in Function Recur-siveEvaluation). Then, we join the answer set with UNIONquery results (Line 9 in Function RecursiveEvaluation).Finally, we evaluate FILTER operator (Line 13) (Fig. 13).

Further optimizing general SPARQL evaluation is alsopossible (e.g., [4]). However, this issue is independent onour studied problem in this paper.9 We use ANTRL v3’s grammar which is an implementation of theSPARQL grammar’s specifications. It is available at http://www.antlr3.org/grammar/1200929755392/.

Algorithm 6: Handling General SPARQLsInput: A SPARQL QOutput: The result set RS of Q

1 Parse Q into a parser tree T2 RS =RecursiveEvaluation(T ) // Call Function

Function RecursiveEvaluation(T )1 Evaluate BGP in T and put all its results into RS2 for each subtree T ′ in OPTIONAL statement of T do3 // Handling OPTIONAL stamtent.4 RS′ =RecursiveEvaluation(T ′)5 RS = RS �� RS′6 RS′ = ∅7 for each subtree T ′ of pattern group in UNIONs of T do8 // Handling UNION stamtent.9 RS′ = RS′ UN I ON RecursiveEvaluation(T ′)

10 RS = RS �� RS′11 for each expression F in FILTER operators do12 // Handling FILTER operator.13 Select RS by using expression F14 Return RS

Fig. 13 Parse tree of example SPARQL query

7 Experiments

We evaluate our method over both real and synthetic RDFdatasets and compare our approach with the state-of-the-artdistributed RDF systems, including a cloud-based approach(EAGRE [48]), two partition-based approaches (Graph-Partition [22] and TripleGroup [28]), two memory-basedsystems (TriAD[18] andTrinity.RDF [47]) and two federatedSPARQL query systems (FedX [42] and SPLENDID [16]).The results of of federated system comparisons are given inAppendix E since, as argued earlier, the environment targetedby these systems is different than ours.

7.1 Setting

We use two benchmark datasets with different sizes and onereal dataset in our experiments, in addition to FedBench usedin federated system experiments. Table 1 summarizes thestatistics of these datasets. All sample queries are shown inAppendix B.

123

Processing SPARQL queries over distributed RDF graphs

Table 1 Datasets

Dataset Number of triples RDF N3 file size (KB) Number of entities

WatDiv 100M 109,806,750 15,386,213 5,212,745

WatDiv 300M 329,539,576 46,552,961 15,636,385

WatDiv 500M 549,597,531 79,705,831 26,060,385

WatDiv 700M 769,065,496 110,343,152 36,486,007

WatDiv 1B 1,098,732,423 159,625,433 52,120,385

LUBM 1000 133,553,834 15,136,798 21,715,108

LUBM 10000 1,334,481,197 153,256,699 217,006,852

BTC 1,056,184,911 238,970,296 183,835,054

1. WatDiv [2] is a benchmark that enables diversified stresstesting of RDF data management systems. In WatDiv,instances of the same type can have different attributesets. We generate three datasets varying sizes from 100million to 1 billion triples. We use 20 queries of the basictesting templates provided by WatDiv [2] to evaluate ourmethod. We randomly partition the WatDiv datasets intoseveral fragments (except in Exp. 6 where we test dif-ferent partitioning strategies). We assign each vertex v

in RDF graph to the i th fragment if H(v)MOD N = i ,where H(v) is a hash function and N is the number offragments. By default, we use the uniform hash functionand N = 10. Each machine stores a single fragment.

2. LUBM [17] is a benchmark that adopts an ontology forthe university domain and can generate synthetic OWLdata scalable to an arbitrary size. We assign the univer-sity number to 10,000. The number of triples is about1.33 billion. We partition the LUBM datasets accordingto the university identifiers. Although LUBM defines 14queries, some of these are similar; therefore, we use the7 benchmark queries that have been used in some recentstudies [5,50]. We report the results over all 14 queries inAppendix B for completeness. As expected ,the resultsover 14 benchmark queries are similar to the results over7 queries.

3. BTC 2012 (http://km.aifb.kit.edu/projects/btc-2012/) isa real dataset that serves as the basis of submissions tothe Billion Triples Track of the SemanticWebChallenge.After eliminating all redundant triples, this dataset con-tains about 1 billion triples. We use METIS to partitionthe RDF graph, and use the 7 queries in [48].

4. FedBench [41] is used for testing against federated sys-tems; it is described in Appendix E alongwith the results.

We conduct all experiments on a cluster of 10 machinesrunning Linux, each of which has one CPU with four coresof 3.06 GHz, 16 GB memory and 500GB disk storage. Eachsite holds one fragment of the dataset. At each site, we installgStore [50] to find inner matches, since it supports the graph-based SPARQL evaluation paradigm. We revise gStore tofind all local partial matches in each fragment as discussed

in Sect. 4. All implementations are in standard C++. We useMPICH-3.0.4 library for communication.

7.2 Exp 1: Evaluating each stage’s performance

In this experiment, we study the performance of our systemat each stage (i.e., partial evaluation and assembly process)with regard to different queries in WatDiv 1B and LUBM1000. We report the running time of each stage (i.e., partialevaluation and assembly) and the number of local partialmatches, inner matches, and crossing matches, with regardto different query types in Tables 2 and 3. We also comparethe centralized and distributed assembly strategies. The timefor assembly includes the time for computing the optimaljoin order. Note that we classify SPARQL queries into fourcategories according to query graphs’ structures: star, linear,snowflake (several stars linked by a path) and complex (acombination of the above with complex structure).

7.2.1 Partial evaluation

Tables 2 and 3 show that if there are some selective triplepatterns10 in the query, the partial evaluation is much fasterthan others. Our partial evaluation algorithm (Algorithm 1)is based on a state transformation, while the selective triplepatterns can reduce the search space. Furthermore, the run-ning time also depends on the number of inner matches andlocal partial matches, as given in Tables 2 and 3. More innermatches and local partial matches lead to higher running timein the partial evaluation stage.

7.2.2 Assembly

In this experiment, we compare centralized and distrib-uted assembly approaches. Obviously, there is no assemblyprocess for a star query. Thus, we only study the perfor-mances of linear, snowflake and complex queries. We findthat distributed assembly can beat the centralized one whenthere are lots of local partial matches and crossing matches.The reason is as follows: in centralized assembly, all localpartial matches need to be sent to the server where theyare assembled. Obviously, if there are lots of local partialmatches, the server becomes the bottleneck. However, in dis-tributed assembly, we can take advantage of parallelizationto speed up both the network communication and assembly.For example, in F3, there are 40,656,32 local partial matches.It takes a long time to transfer the local partial matches to theserver and assemble them in the server in centralized assem-bly. So, distributed assembly outperforms the centralizedalternative. However, if the number of local partial matchesand the number of crossing matches are small, the barrier

10 A triple pattern t is a “selective triple pattern” if it has no more than100 matches in RDF graph G

123

P. Peng et al.

Table 2 Evaluation of each stage on WatDiv 1B

Partial evaluation Assembly Total # of LPMFsh # of CMFsi

Time (in ms) # of LPMsb # of IMsc Time (in ms) # of CMsd Time (in ms) # of Matchesg

Centralized Distributed PECAe PEDAf

Star S1√a 43,803 0 1 0 0 0 43,803 43,803 1 0 0

S2√

74,479 0 13,432 0 0 0 74,479 74,479 13,432 0 0

S3√

8087 0 13,335 0 0 0 8087 8087 13,335 0 0

S4√

16,520 0 2 0 0 0 16,520 16,520 1 0 0

S5√

1861 0 112 0 0 0 1861 1861 940 0 0

S6√

50,865 0 14 0 0 0 50,865 50,865 14 0 0

S7√

56,784 0 1 0 0 0 56,784 56,784 1 0 0

Linear L1√

15,340 2 0 1 16 1 15,341 15,356 1 2 2

L2√

1492 794 88 18 130 793 1510 1622 881 10 10

L3√

16,889 0 5 0 0 0 16,889 16,889 5 0 0

L4√

261 0 6005 0 0 0 261 261 6005 0 0

L5√

48,055 1274 141 572 1484 1273 48,627 49,539 1414 10 10

Snowflake F1√

64,699 29 1 9 49 14 64,708 64,748 15 10 10

F2√

203,968 2184 99 1598 3757 1092 205,566 207725 1191 10 10

F3√

23,419,32 40,656,32 58 36,734,09 24,893,25 6200 60,153,41 48,312,57 6258 10 10

F4√

251,546 6909 0 13,693 8864 1808 265,239 260410 1808 10 10

F5√

25,180 92 3 58 1028 46 25,238 26,208 49 10 10

Complex C1 206,864 161,803 4 9195 5265 356 216,059 212,129 360 10 10

C2 16,135,25 937,198 0 229,381 174,167 155 18,429,06 17,876,92 155 10 10

C3 123,349 0 80,997 0 0 0 123,349 123,349 80,997 0 0

a √means that the query involves some selective triple patterns

b “# of LPMs” means the number of local partial matchesc “# of IMs” means the number of inner matchesd “# of CMs” means the number of crossing matchese “PECA” is the abbreviation of Partial Evaluation & Centralized Assemblyf “PEDA” is the abbreviation of Partial Evaluation & Distributed Assemblyg “# of Matches” means the number of matchesh “# of LPMFs” means the number of fragments containing local partial matchesi “# of CMFs” means the number of fragments containing crossing matches

Table 3 Evaluation of each stage on LUBM 1000

Partial evaluation Assembly Total # of LPMFs # of CMFs

Time (in ms) # of LPMs # of IMs Time (in ms) # of CMs Time (in ms) # of Matches

Centralized Distributed PECA PEDA

Star Q2 1818 0 10,811,87 0 0 0 1818 1818 10,811,87 0 0

Q4√

82 0 10 0 0 0 82 82 10 0 0

Q5√

8 0 10 0 0 0 8 8 10 0 0

Snowflake Q6√

158 6707 110 164 125 15 322 283 125 10 10

Complex Q1 52,548 3033 2524 53 60 4 52,601 52,608 2528 10 10

Q3 920 3358 0 36 48 0 956 968 0 10 0

Q7 3945 167,621 42,479 211,670 35,856 1709 215,615 39,801 44,190 10 10

synchronization cost dominates the total cost in distributedassembly. In this case, the advantage of distributed assemblyis not clear. A quantitative comparison between distributedand centralized assembly approaches needs more statisticsabout the network communication, CPU and other parame-

ters. A sophisticated quantitative study is beyond the scopeof this paper and is left as future work.

In Tables 2 and 3, we also show the number of fragmentsinvolved in each test query. For most queries, their localpartial matches and crossing matches involve all fragments.

123

Processing SPARQL queries over distributed RDF graphs

Queries containing selective triple patterns (L1 in WatDiv)may only involve a part of the fragmentation.

7.3 Exp 2: Evaluating optimizations in assembly

In this experiment, we use WatDiv 1B to evaluate two dif-ferent optimization techniques in the assembly: partitioning-based join strategy (Sect. 5.1) and the divide-and-conquerapproach in the distributed assembly (Sect. 5.3). If a querydoes not have any local partial matches in RDF graph G, itdoes not need the assembly process. Therefore, we only usethe benchmark queries that need assembly (L1, L2, L5, F1,F2, F3, F3, F4, F5, C1 and C2) in our experiments.

7.3.1 Partitioning-based join

First, we compare partitioning-based join (i.e., Algorithm3) with naive join processing (i.e., Algorithm 2) in Table 4,which shows that the partitioning-based strategy can greatlyreduce the join cost. Second, we evaluate the effectivenessof our cost model. Note that the join order depends on thepartitioning strategy, which is based on our cost model as dis-cussed in Sect. 5.2.2. In other words, once the partitioning isgiven, the join order is fixed. So, we use the cost model to findthe optimal partitioning and report the running time of theassembly process in Table 4. We find that the assembly withthe optimal partitioning is faster than that with random parti-tioning, which confirms the effectiveness of our cost model.Especially for C2, the assembly with the optimal partition-ing is an order of magnitude faster than the assembly withrandom partitioning.

7.3.2 Divide-and-conquer in distributed assembly

Table 5 shows that dividing the search spacewill speedupdis-tributed assembly. Otherwise, some duplicate results can begenerated, as discussed in Sect. 5.3. Elimination of duplicatesand parallelization speeds up distributed assembly. For exam-ple, forC1, dividing search space lowers the time of assemblymore than twice as much as no dividing search space.

7.4 Exp 3: Scalability test

In this experiment, we vary the RDF dataset size from 100million triples (WatDiv 100M) to 1 billion triples (WatDiv1B) to study the scalability of ourmethods. Figures 14 and 15show the performance of different queries using centralizedand distributed assembly.

Query response time is affectedbyboth the increase in datasize (which is 1x → 10x in these experiments) and the querytype. For star queries, the query response time increases pro-portional to the data size, as shown in Figs. 14b and 15b. Forother query types, the query response times may grow fasterthan the data size. Especially for F3, the query response time

Table 4 Running time of partitioning-based join versus naive join (inms)

Partitioning-basedjoin based on theoptimal partitioning

Partitioning-basedjoin based on therandom partitioning

Naive join

L1 1 1 1

L2 18 23 139

L5 572 622 3419

F1 1 1 1

F2 1598 2286 48,096

F3 36,734,09 40,054,09 Timeouta

F4 13,693 13,972 Timeout

F5 58 80 8383

C1 9195 10,582 Timeout

C2 229,381 40,831,81 Timeout

a Timeout is issued if query evaluation does not terminate in 10 h

Table 5 Dividing versus no dividing (in ms)

Distributed assembly time (in ms)

Dividing No dividing

L1 16 19

L2 130 151

L5 1484 1684

F1 49 55

F2 3757 5481

F3 24,893,25 44,394,30

F4 8864 19,759

F5 1028 1267

C1 5265 12,194

C2 174,167 225,062

increases 30 times as the datasize increases 10 times. This isbecause the complexquery graph shape causesmore complexoperations in query processing, such as joining and assembly.However, even for complex queries, the query performanceis scalable with RDF graph size on the benchmark datasets.

Note that, as mentioned in Exp. 1, there is no assemblyprocess for star queries, since matches of a star query cannotcross two fragments. Therefore, the query response times forstar queries in centralized and distributed assembly are thesame. In contrast, for other query types, some local partialmatches and crossing matches result in differences betweenthe performances of centralized and distributed assembly.Here, L3, L4 and C3 are a special case. Although they arenot star queries, there are few local partial matches for L3, L4

and C3. Furthermore, the crossing match number is 0 in L3,L4 andC3 (in Table 2). Therefore, the assembly times for L3,L4 and C3 are so small that the query response times in bothcentralized and distributed assembly are the almost same.

123

P. Peng et al.

(a) (b)

(c) (d)

Fig. 14 Scalability test of PECA a star queries, b linear queries,c snowflake queries, d complex queries

(a) (b)

(c) (d)

Fig. 15 Scalability test of PEDA a star queries, b linear queries,c snowflake queries, d complex queries

7.5 Exp 4: Intermediate result size and queryperformance versus query decompositionapproaches

Table 6 compares the number of intermediate results in ourmethod with two typical query decomposition approaches,i.e., GraphPartition and TripleGroup. We use undirected1-hop guarantee for GraphPartition and 1-hop bidirectionsemantic hash partition for TripleGroup. The dataset is stillWatDiv 1B.

A star query has no intermediate results, so each methodcan be answered at each fragment locally. Thus, all methodshave the same response time, as given in Table 7 (S1–S6).

For other query types, both GraphPartition and Triple-Group need to decompose them into several star subqueriesand find these subquery matches (in each fragment) asintermediate results. NeitherGraphPartition nor TripleGroup

Table 6 Number of intermediate results of different approaches ondifferent partitioning strategies

PECA & PEDA GraphPartition TripleGroup

S1–S7 0 0 0

L1 2 249,571 249,598

L2 794 73,307 79,630

L3–L4 0 0 0

L5 1274 99,363 99,363

F1 29 76,228 15,702

F2 2184 501,146 11,198,81

F3 40,656,32 45,157,31 45,157,52

F4 6909 132,193 329,426

F5 92 25,007,73 90,007,62

C1 161,803 45,515,62 44,516,93

C2 937,198 14,571,56 23,684,05

C3 0 0 0

Table 7 Query response time of different approaches (in milliseconds)

PECA PEDA GraphPartition TripleGroup

S1 43,803 43,803 43,803 43,803

S2 74,479 74,479 74,479 74,479

S3 8087 8087 8087 8087

S4 16,520 16,520 16,520 16,520

S5 1861 1861 1861 1861

S6 50,865 50,865 50,865 50,865

S7 56,784 56,784 56,784 56,784

L1 15,341 15,776 40,840 39,570

L2 1510 1622 36,150 36420

L3 16,889 16,889 16,889 16,889

L4 261 261 261 261

L5 48,627 49,539 57,550 57,480

F1 64,708 64,748 66,230 66,200

F2 205,566 207,725 240,700 248,180

F3 60,153,41 48,312,57 62,440,00 61,428,00

F4 265,239 260,410 340,540 340,600

F5 25,238 29,208 52,180 91,110

C1 216,059 212,129 216,720 223,670

C2 18,429,06 17,876,92 19,548,00 21,683,00

C3 123,349 123,349 12,3349 123,349

distinguishes the star subquery matches that contributeto crossing matches from those that contribute to innermatches—all star subquery matches are involved in theassembly process. However, in our method, only local par-tial matches are involved in the assembly process, leadingto lower communication cost and the assembly computa-tion cost. Therefore, the intermediate results that need to beassembled with others are smaller in our approach.

More intermediate results typically lead to more assem-bly time. Furthermore, both GraphPartition and TripleGroupemploy MapReduce jobs for assembly, which takes muchmore time than our method. Table 7 shows that our queryresponse time is faster than others.

123

Processing SPARQL queries over distributed RDF graphs

Existing partition-based solutions, such as GraphPartitionand TripleGroup, use MapReduce jobs to join intermediateresults to find SPARQL matches. In order to evaluate thecost of MapReduce jobs, we perform the following experi-ments over WatDiv 100M. We revise join processing in bothGraphPartition and TripleGroup, by applying joins whereintermediate results are sent to a central server usingMPI.Weuse WatDiv 100M and only consider the benchmark queriesthat need join processing (L1, L2, L5, F1, F2, F3, F3, F4, F5,C1 andC2) in our experiments. Moreover, all partition-basedmethods generate intermediate results and merge them at acentral sever that shares the same framework with PECA,so we only compare them with PECA. The detailed resultsare given in Appendix C. Our technique is always fasterregardless of the use of MPI or MapReduce-based join. Thisis because our method produces smaller intermediate resultsets; MapReduce-based join dominates the query cost. Ourpartial evaluation process is more expensive in evaluatinglocal queries than GraphPartition and TripleGroup in manycases. This is easy to understand—since the subquery struc-tures in GraphPartition and TripleGroup are fixed, such asstars, it is cheaper to find these local query results thanfinding local partial matches. Our system generally outper-forms GraphPartition and TripleGroup significantly if theyuse MapReduce-based join. Even when GraphPartition andTripleGroup use distributed joins, our system is still fasterthan them in most cases (8 out of 10 queries used in thisexperiment, see Appendix C for details).

7.6 Exp 5: Performance on RDF datasets with onebillion triples

This experiment is a comparative evaluation of our methodagainst GraphPartition, TripleGroup and EAGRE on three

very large RDF datasets with more than one billion triples,WatDiv 1B, LUBM 10000 and BTC. Figure 16 shows theperformance of different approaches.

Note that, almost half of the queries (S1, S2, S3, S4, S5,S6, S7, L3, L4 and C3 in WatDiv, Q2, Q4 and Q5 in LUBM,Q1, Q2 and Q3 in BTC) have no intermediate results gener-ated in any of the approaches. For these queries, the responsetimes of our approaches and partition-based approaches arethe same. However, for other queries, the gap between ourapproach and others is significant. For example, L2 in Wat-Div, for Q3, Q6 and Q7 in LUBM and Q3, Q4, Q6 and Q5

in BTC, our approach outperforms others one or more ordersof magnitudes. We already explained the reasons for Graph-Partition and TripleGroup in Exp 4; reasons for EAGREperformance follows.

EAGRE stores all triples as flat files in HDFS and answersSPARQL queries by scanning the files. Because HDFS doesnot provide fine-grained data access, a query can only beevaluated by a full scan of the files followed by a MapRe-duce job to join the intermediate results. Although EAGREproposes some techniques to reduce I/O and data processing,it is still very costly. In contrast, we use graph matching toanswer queries, which avoids scanning the whole dataset.

7.7 Exp 6: Impact of different partitioning strategies

In this experiment, we test the performance under threedifferent partitioning strategies over WatDiv 100 M. Theimpact of different partitioning strategies is shown in Table8. We implement three partitioning strategies: uniformlydistributed hash partitioning, exponentially distributed hashpartitioning, and minimum-cut graph partitioning.

The first partitioning strategy uniformly hashes a vertexv in RDF graph G to a fragment (machine). Thus, frag-

Fig. 16 Online performancecomparison a WatDiv 1B, bLUBM 10000, c BTC

(a)

(b) (c)

123

P. Peng et al.

Table 8 Query response time under different partitioning strategies (inms)

Uniform Exponential Min-cut

S1

PECA 4095 7472 3210

PEDA 4095 7472 3210

S2

PECA 5910 5830 5053

PEDA 5910 5830 5053

S3

PECA 869 2003 1098

PEDA 869 2003 1098

S4

PECA 1506 1532 1525

PEDA 1506 1532 1525

S5

PECA 208 384 255

PEDA 208 384 255

S6

PECA 5153 5642 4145

PEDA 5153 5642 4145

S7

PECA 5047 5720 4085

PEDA 5047 5720 4085

L1

PECA 2301 4271 3162

PEDA 2325 4296 3168

L2

PECA 271 502 261

PEDA 339 505 297

L3

PECA 1115 2122 1334

PEDA 1115 2122 1334

L4

PECA 37 54 27

PEDA 37 54 27

L5

PECA 7741 6736 4984

PEDA 7863 6946 5163

F1

PECA 5754 7889 4386

PEDA 5768 7943 4415

F2

PECA 11809 16,461 10,209

PEDA 11,832 16,598 10,539

F3

PECA 246,277 155,064 122,539

PEDA 163,642 115,214 103,618

F4

PECA 26,439 37,608 21,979

PEDA 26421 36,817 22030

F5

PECA 11,630 16,433 8735

PEDA 11654 16,501 8262

Table 8 continued

Uniform Exponential Min-cut

C1

PECA 14,980 30,271 14,131

PEDA 14667 29861 13,807

C2

PECA 147,962 105,926 36,038

PEDA 147406 104,084 35,220

C3

PECA 11,631 16,368 13,959

PEDA 11,631 16,368 13,959

Bold values indicate fastest response time

ments on different machines have approximately the samesize. The second strategy uses an exponentially distributedhash function with a rate parameter pf 0.5. Each vertex v hasa probability of 0.5k to be assigned to fragment (machine) k.This partitioning strategy results in skewed fragment sizes.Finally, we use min-cut-based partitioning strategy (i.e.,METIS algorithm) to partition graph G.

Minimum-cut partitioning strategy generally leads tofewer crossing edges than the other two. Thus, it beats theother two approaches in most cases, especially in complexqueries (such as F and C category queries). For example, inC2, the minimum-cut is faster than the uniform partitioningby more than four times. For star queries (i.e., S categoryqueries), since there exist no crossing matches, the uniformpartitioning and minimum-cut partitioning have the similarperformance. Sometimes, the uniform partitioning is better,but the performance gap is very small. Due to the skew infragment sizes, exponentially distributed hashing has worseperformance, in most cases, than uniformly distributed hash-ing.

Although our partial evaluation and assembly frameworkis agnostic to the particular partitioning strategy, it is clearthat it works better when fragment sizes are balanced, and thecrossing edges are minimized. Many heuristic minimum-cutgraph partitioning algorithms (a typical one is METIS [31])satisfy the requirements.

7.8 Exp 7: Comparing with memory-based distributedRDF systems

We compare our approach (which is disk-based) againstTriAD [18] and Trinity.RDF [47] that are memory-baseddistributed systems. To enable fair comparison, we cachethe whole RDF graph together with the corresponding indexintomemory. Experiments show that our system is faster thanTrinity.RDF and TriAD in these benchmark queries. Resultsare given in Appendix D.

7.9 Exp 8: Comparing with federated SPARQL systems

In this experiment, we compare our methods with some fed-erated SPARQL query systems including (FedX [42] and

123

Processing SPARQL queries over distributed RDF graphs

Table 9 Comparison with centralized system (in ms)

RDF-3X PECA PEDA

Q1 10,840,47 326,167 309,361

Q2 81,373 23,685 23,685

Q3 72,257 10,239 10,368

Q4 7 753 753

Q5 6 125 125

Q6 355 3388 1914

Q7 146,325 143,779 46123

SPLENDID [16]). We evaluate our methods on the standard-ized benchmark for federated SPARQL query processing,FedBench [41]. Results are given in Appendix E.

7.10 Exp 9: Comparing with centralized RDF systems

In this experiment, we compare our method with RDF-3X inLUBM 10000. Table 9 shows the results.

Ourmethod is generally faster thanRDF-3Xwhen a querygraph is complex, such as Q1, Q2, Q3 and Q7. Since thesequeries do not contain selective triple patterns and the querygraph structure is complex, the search space for these queriesis very large. Our method can take advantage of parallelprocessing and reduce query response time significantly rel-ative to a centralized system. If the queries (Q4, Q5 and Q6)contain selective triple patterns, the search space is small.The centralized system (RDF-3X) is faster than our methodin these queries, since our approach spends more communi-cation cost between different machines. These queries onlyspend less than 1–3 seconds in both RDF-3X and our distrib-uted system. However, for some challenging queries (such asQ1, Q2, Q3 and Q7), our method outperforms RDF-3X sig-nificantly. For example, RDF-3X spends about 1000 s in Q1,while our approach only spends about 300 seconds. The per-formance advantage of our distributed system is more clearin these challenging queries.

8 Conclusion

In this paper, we propose a graph-based approach to dis-tributed SPARQL query processing that adopts the partialevaluation and assembly approach.This is a two-stepprocess.In the first step, we evaluate a query Q on each graphfragment in parallel to find local partial matches, which,intuitively, is the overlapping part between a crossing matchand a fragment. The second step is to assemble these localpartial matches to compute crossing matches. Two differentassembly strategies are proposed in this work: centralizedassembly, where all local partial matches are sent to a singlesite, anddistributed assembly,where the local partialmatchesare assembled at a number of sites in parallel.

The main benefits of our method are twofold: first, oursolution is partition-agnostic as opposed to existing partition-based methods each of which depends on a particular RDFgraph partition strategy, which may be infeasible to enforcein certain circumstances. Our method is, therefore, muchmore flexible. Second, compared with other partition-basedmethods, the number of involved vertices and edges in theintermediate results is minimized in our method, which areproven theoretically and demonstrated experimentally.

There are a number of extensions we are currently work-ing on. An important one is handling SPARQL queries overlinked open data (LOD). We can treat the interconnectedRDF repositories (in LOD) as a virtually integrated dis-tributed database. Some RDF repositories provide SPARQLendpoints and others may not have query capability. There-fore, data at these sites need to be moved for processingthat will affect the algorithm and cost functions. Further-more, multiple SPARQL query optimization in the contextof distributed RDF graphs is also an ongoing work. In realapplications, queries in the same time are commonly over-lapped. Thus, there is much room for sharing computationwhen executing these queries. This observation motivates usto revisit the classical problem of multi-query optimizationin the context of distributed RDF graphs.

References

1. Abadi, D.J., Marcus, A., Madden, S., Hollenbach, K.: SW-store: avertically partitioned DBMS for semantic web data management.VLDB J. 18(2), 385–406 (2009)

2. Aluç, G., Hartig, O., Özsu, M.T., Daudjee, K.: Diversified stresstesting of RDF data management systems. In: Proceedings of 13thInternational Semantic Web Conference, pp 197–212 (2014)

3. Astrahan, M.M., Blasgen, H.W., Chamberlin, D.D., Eswaran, K.P.,Gray, J.N., Griffiths, P.P., King, W.F., Lorie, R.A., Mehl, J.W.,Putzolu, G.R., Traiger, I.L., Wade, B.W., Watson, V.: System R:relational approach to database management. ACM Trans. Data-base Syst. 1, 97–137 (1976)

4. Atre, M.: Left Bit Right: for SPARQL join queries withOPTIONAL patterns (left-outer-joins). In: Proceedings of ACMSIGMOD International Conference on Management of Data, pp1793–1808 (2015)

5. Atre,M., Chaoji, V., Zaki, M.J., Hendler, J.A.: Matrix “bit” loaded:a scalable lightweight join query processor for RDF data. In: Pro-ceedings of 19th International World Wide Web Conference, pp41–50 (2010)

6. Buneman, P., Cong, G., Fan,W., Kementsietsidis, A.: Using partialevaluation in distributed query evaluation. In: Proceedings of 32ndInternational Conference on Very Large Data Bases, pp 211–222(2006)

7. Cong, G., Fan, W., Kementsietsidis, A.: Distributed query eval-uation with performance guarantees. In: Proceedings of ACMSIGMOD International Conference on Management of Data, pp509–520 (2007)

8. Cong, G., Fan, W., Kementsietsidis, A., Li, J., Liu, X.: Partial eval-uation for distributed XPath query processing and beyond. ACMTrans. Database Syst. 37(4), 32 (2012)

9. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introductionto Algorithms, 3rd edn. MIT Press, Cambridge (2009)

123

P. Peng et al.

10. Downey, R.G., Fellows, M.R., Vardy, A., Whittle, G.: The parame-trized complexity of some fundamental problems in coding theory.SIAM J. Comput. 29(2), 545–570 (1999)

11. Dyer, M.E., Greenhill, C.S.: The complexity of counting graphhomomorphisms. Random Struct. Algorithms 17(3–4), 260–289(2000)

12. Fan, W., Li, J., Ma, S., Tang, N., Wu, Y., Wu, Y.: Graph pat-tern matching: from intractable to polynomial time. Proc. VLDBEndow. 3(1), 264–275 (2010)

13. Fan, W., Wang, X., Wu, Y.: Performance guarantees for distributedreachability queries. Proc.VLDBEndow. 5(11), 1304–1315 (2012)

14. Fan,W.,Wang, X.,Wu, Y., Deng, D.: Distributed graph simulation:impossibility and possibility. Proc. VLDB Endow. 7(12), 1083–1094 (2014)

15. Galarraga, L., Hose, K., Schenkel, R.: Partout: a distributed enginefor efficient RDF processing. In: Proceedings of 23rd InternationalWorld Wide Web Conference (Companion Volume), pp 267–268(2014)

16. Görlitz, O., Staab, S.: SPLENDID: SPARQL endpoint federa-tion exploiting VOID descriptions. In: Proceedings of ISWC 2011Workshop on Consuming Linked Data (2011)

17. Guo, Y., Pan, Z., Heflin, J.: LUBM: a benchmark for OWL knowl-edge base systems. J. Web Semant. 3(2–3), 158–182 (2005)

18. Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: adistributed shared-nothing RDF engine based on asynchronousmessage passing. In: Proceedings of ACM SIGMOD InternationalConference on Management of Data, pp 289–300 (2014)

19. Harth, A., Hose, K., Karnstedt, M., Polleres, A., Sattler, K.,Umbrich, J.: Data summaries for on-demand queries over linkeddata. In: Proceedings of 19th International World Wide Web Con-ference, pp 411–420 (2010)

20. Hartig, O., Özsu,M.T.: Linked data query processing (Tutorial). In:Proceedings of 30th InternationalConference onDataEngineering,pp 1286–1289 (2014)

21. Hose, K., Schenkel, R.: WARP: Workload-aware replication andpartitioning for RDF. In: Proceedings of Workshops of 29th Inter-national Conference on Data Engineering, pp 1–6 (2013)

22. Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL querying oflarge RDF graphs. Proc. VLDB Endow. 4(11), 1123–1134 (2011)

23. Husain,M.F.,McGlothlin, J.P.,Masud,M.M.,Khan, L.R., Thurais-ingham, B.M.: Heuristics-based query processing for large RDFgraphs using cloud computing. IEEE Trans. Knowl. Data Eng.23(9), 1312–1327 (2011)

24. Jones, N.D.: An introduction to partial evaluation. ACM Comput.Surv. 28(3), 480–503 (1996)

25. Kaoudi, Z., Manolescu, I.: RDF in the clouds: a survey. VLDB J.24(1), 67–91 (2015)

26. Karypis, G., Kumar, V.: Analysis of multilevel graph partitioning.In: Proceedings of ACM/IEEE Conference on Supercomputing.Article No. 29 (1995)

27. Khadilkar, V., Kantarcioglu, M., Thuraisingham, B. M., Castagna,P.: Jena-HBase: A distributed, scalable and efficient RDF triplestore. In: Proceedings of International Semantic Web ConferencePosters & Demos Track (2012)

28. Lee,K., Liu, L.: Scaling queries over bigRDFgraphswith semantichash partitioning. Proc. VLDB Endow. 6(14), 1894–1905 (2013)

29. Lee, K., Liu, L., Tang, Y., Zhang, Q., Zhou, Y.: Efficient and cus-tomizable data partitioning framework for distributed big RDF dataprocessing in the cloud. In: Proceedings of IEEE 6th InternationalConference on Cloud Computing, pp 327–334 (2013)

30. Li, F., Ooi, B.C., Özsu, M.T.,Wu, S.: Distributed data managementusing MapReduce. ACM Comput. Surv. 46(3), 31 (2014)

31. Ma, S., Cao, Y., Huai, J., Wo, T.: Distributed graph patternmatching. In: Proceeding of 21st International World Wide WebConference, pp 949–958 (2012)

32. Neumann, T.,Weikum, G.: RDF-3X: a RISC-style engine for RDF.Proc. VLDB Endow. 1(1), 647–659 (2008)

33. Papailiou, N., Konstantinou, I., Tsoumakos, D., Koziris, N.:H2RDF: adaptive query processing on RDF data in the cloud. In:Proceedings of 21st International World Wide Web Conference(Companion Volume), pp 397–400 (2012)

34. Papailiou, N., Tsoumakos, D., Konstantinou, I., Karras, P., Koziris,N.: H2RDF+: an efficient data management system for big RDFgraphs. In: Proceeding ofACMSIGMODInternationalConferenceon Management of Data, pp 909–912 (2014)

35. Pérez, J., Arenas, M., Gutierrez, C.: Semantics and complexity ofSPARQL. ACM Trans. Database Syst. 34(3), 2009

36. Quilitz, B., Leser,U.:QueryingDistributedRDFData SourceswithSPARQL. In: Proceeding of 5th European Semantic Web Confer-ence, pp 524–538 (2008)

37. Rohloff, K., Schantz, R. E.: High-performance, massively scalabledistributed systems using the MapReduce software framework:the shard triple-store. In: Proceeding of International Workshopon Programming Support Innovations for Emerging DistributedApplications, Article No. 4 (2010)

38. Saleem, M., Ngomo, A. N.: HiBISCuS: Hypergraph-based sourceselection for sparql endpoint federation. In: Proceeding of 11thExtended Semantic Web Conference, pp 176–191 (2014)

39. Saleem, M., Padmanabhuni, S.S., Ngomo, A.N., Iqbal, A.,Almeida, J.S., Decker, S., Deus, H.F.: TopFed: TCGA tailored fed-erated query processing and linking to LOD. J. Biomed. Semant.5, 47 (2014)

40. Schmachtenberg, M., Bizer, C., Paulheim, H.: Adoption of bestdata practices in different topical domains. In: Proceeding of 13thInternational Semantic Web Conference, pp 245–260 (2014)

41. Schmidt, M., Görlitz, O., Haase, P., Ladwig, G., Schwarte, A.,Tran, T.: FedBench: A benchmark suite for federated semantic dataquery processing. In: Proceeding of 10th International SemanticWeb Conference, pp 585–600 (2011)

42. Schwarte, A., Haase, P., Hose, K., Schenkel, R., Schmidt, M.:FedX: Optimization techniques for federated query processing onlinked data. In: Proceeding of 10th International Semantic WebConference, pp 601–616 (2011)

43. Shang, H., Zhang, Y., Lin, X., Yu, J.X.: Taming verification hard-ness: an efficient algorithm for testing subgraph isomorphism.Proc.VLDB Endow. 1(1), 364–375 (2008)

44. Shao, B., Wang, H., Li, Y.: Trinity: a distributed graph engine ona memory cloud. In: Proceeding of ACM SIGMOD InternationalConference on Management of Data, pp 505–516 (2013)

45. Valiant, L.G.: A bridging model for parallel computation. Com-mun. ACM 33(8), 103–111 (1990)

46. Wang, L., Xiao, Y., Shao, B., Wang, H.: How to partition a billion-node graph. In: Proceeding of 30th International Conference onData Engineering, pp 568–579 (2014)

47. Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributedgraph engine for web scale RDF data. Proc. VLDB Endow. 6(4),265–276 (2013)

48. Zhang,X., Chen, L., Tong,Y.,Wang,M.: EAGRE: towards scalableI/Oefficient SPARQLquery evaluationon the cloud. In: Proceedingof 29th International Conference onData Engineering, pp 565–576(2013)

49. Zhang, X., Chen, L., Wang, M.: Towards efficient join process-ing over large RDF graph using mapreduce. In: Proceeding of24th International Conference on Scientific and Statistical Data-base Management, pp 250–259 (2012)

50. Zou, L., Özsu, M.T., Chen, L., Shen, X., Huang, R., Zhao, D.:gStore: a graph-basedSPARQLquery engine.VLDBJ. 23(4), 565–590 (2014)

123