querying federations of triple pattern fragments

Querying Federations of Triple Pattern Fragments

Ruben Verborgh

Tutorial

Linked Data Fragments

Triple Pattern Fragments

Federated querying


A whole spectrum of trade-offs exists between the two extremes.

high server costlow server cost

datadump

SPARQLendpoint

interface offered by the server

high availability low availabilityhigh bandwidth low bandwidthout-of-date data live data

low client costhigh client cost

Linked Datadocuments

data

metadata

controls

What triples does it contain?

What do we know about it?

How to access more data?

All RDF interfaces offer fragments with the following characteristics.

all dataset triples

(none)

data dump

number of triples, file size

data

metadata

controls

Each type of Linked Data Fragment is defined by three characteristics.

triples matching the query

(none)

(none)

SPARQL query resultdata

metadata

controls

Each type of Linked Data Fragment is defined by three characteristics.



Federated querying


We design new mixes of trade-offs with much lower server-side cost.

high server costlow server cost

datadump

SPARQLquery results

high availability low availabilityhigh bandwidth low bandwidthout-of-date data live data

low client costhigh client cost


low server cost

datadump

SPARQLquery results

high availabilitylive data


Triple PatternFragments

A Triple Pattern Fragments interfaceis low-cost and enables clients to query.

matches of a triple pattern

total number of matches

access to all other fragments

data

metadata

controls

(paged)

A Triple Pattern Fragments interfaceis low-cost and enables clients to query.

data (first 100)

controls (other fragments)

metadata (total count)

Give them a SPARQL query.Give them a URL of any dataset fragment.

How can intelligent clientssolve SPARQL queries over fragments?

They look inside the fragmentto see how to access the dataset

and use the metadatato decide how to plan the query.

Let’s follow the execution of an example SPARQL query.

SELECT ?artist ?name WHERE { ?artist a dbpedia-owl:Artist; rdfs:label ?name; dbpedia-owl:birthPlace dbpedia:Padua. FILTER LANGMATCHES(LANG(?name), "EN") }

Find names of artists born in Padua, Italy.

Fragment: http://fragments.dbpedia.org/2014/en

The client looks inside the fragment to see how to access the dataset.

<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [ hydra:template "http://fragments.dbpedia.org/2014/en {?subject,predicate,object}"; hydra:mapping [ hydra:variable "subject"; hydra:property rdf:subject ], [ hydra:variable "predicate"; hydra:property rdf:predicate ], [ hydra:variable "object"; hydra:property rdf:object ] ].

Fragment: http://fragments.dbpedia.org/2014/en

“I can query the dataset by triple pattern.”

The client splits the queryinto the available fragments.


The client gets the fragments and inspects their metadata.

?artist a dbpedia-owl:Artist.first 100 triples

96.000

?artist rdfs:label ?name.first 100 triples

12.000.000

?artist dbont:birthPlace dbpedia:Padua.first 100 triples

135

?artist a dbpedia-owl:Artist. 96.000

?artist rdfs:label ?name. 12.000.000

?artist dbont:birthPlace dbpedia:Padua.dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.

135

dbpedia:Alberto_Bigon dbont:birthPlace dbpedia:Padua.

The metadata enables the client to choose the right starting point.

dbp:Alberto_Benettin a dbont:Artist.

dbp:Alberto_Benettin rdfs:label ?name.

Clients execute the query in 3 secondson a highly available, low-cost server.


Try it yourself: bit.ly/artistspadua

Querying Datasets on the Web with High Availability 13

1 10 100

10

100

100010000

clients

throughput

(q/hr)

Virtuoso 6 Virtuoso 7

Fuseki–tdb Fuseki–hdttriple pattern fragments

Fig. 3.1: Server performance (log-log plot)

1 10 1000

2

4

clients

data

sent

(mb)

Fig. 3.2: Server network traffic

1 10 1000

50

100

150

200

clients

#tim

eouts

Fig. 3.3: Query timeouts

1 10 1000

10

20

clients

sent

(mb)

Fig. 3.4: Cache network traffic

1 10 1000

50

100

clients

cpu

use

(%

)

Fig. 3.5: Server processor usage per core

1 10 1000

2

4

6

8

clients

ram

use

(gb)

Fig. 3.6: Server memory usage

1 10 1000

50

100

clients

cpu

use

(%

)

Fig. 3.7: Client processor usage per core

1 10 10000.20.40.60.81

clients

ram

use

(gb)

Fig. 3.8: Client memory usage

1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)

Fig. 3.9: Query 3 execution time (log-log plot)

1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)

Fig. 3.10: Query 3 response time (log-log plot)

Fig. 3.2 shows the outbound network traffic on the server with an increasingnumber of clients. This traffic is substantially higher with triple pattern fragmentservers, because clients need to ask for several responses to evaluate a singlequery. The cache ensures that responses to identical requests are reused; Fig. 3.4indeed shows that caching is far more effective with triple pattern fragments.

Some of the bsbm queries execute slowly on triple pattern fragments clients,especially those queries that strongly depend on operators such as FILTER, whichin a triple-pattern-based interface can only be evaluated on the client. The exe-cution times of these queries exceed the timeout limit of 60s (Fig. 3.3). There-fore, we separately study bsbm query template 3 (finding products that satisfy2 numerical inequalities and an OPTIONAL clause), which is one of the templateswhose queries cause few timeouts. Note how its execution time (Fig. 3.9) fortriple pattern fragments starts high, but only increases very gradually, whereasthe execution time on sparql endpoints rises very rapidly. Furthermore, theresponse times increase more slowly with increased load (Fig. 3.10). Only on thetriple pattern fragments server, cpu usage remains low for this query at all times.

The query throughput is lower, but resilient to high client numbers.

executed SPARQL queries per hour

The server traffic is higher,but requests are significantly lighter.


1 10 100

10

100

100010000

clients

throughput

(q/hr)




1 10 1000

2

4

clients

data

sent

(mb)


1 10 1000

50

100

150

200

clients

#tim

eouts


1 10 1000

10

20

clients

sent

(mb)


1 10 1000

50

100

clients

cpu

use

(%

)


1 10 1000

2

4

6

8

clients

ram

use

(gb)


1 10 1000

50

100

clients

cpu

use

(%

)


1 10 10000.20.40.60.81

clients

ram

use

(gb)


1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)


1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)




data sent by server in MB

Caching is significantly more effective,as clients reuse fragments for queries.


1 10 100

10

100

100010000

clients

throughput

(q/hr)




1 10 1000

2

4

clients

data

sent

(mb)


1 10 1000

50

100

150

200

clients

#tim

eouts


1 10 1000

10

20

clients

sent

(mb)


1 10 1000

50

100

clients

cpu

use

(%

)


1 10 1000

2

4

6

8

clients

ram

use

(gb)


1 10 1000

50

100

clients

cpu

use

(%

)


1 10 10000.20.40.60.81

clients

ram

use

(gb)


1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)


1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)




data sent by cache in MB

The server uses much less CPU,allowing for higher availability.

server CPU usage per core


1 10 100

10

100

100010000

clients

throughput

(q/hr)




1 10 1000

2

4

clients

data

sent

(mb)


1 10 1000

50

100

150

200

clients

#tim

eouts


1 10 1000

10

20

clients

sent

(mb)


1 10 1000

50

100

clients

cpu

use

(%

)


1 10 1000

2

4

6

8

clients

ram

use

(gb)


1 10 1000

50

100

clients

cpu

use

(%

)


1 10 10000.20.40.60.81

clients

ram

use

(gb)


1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)


1 10 1000

0.01

0.1

1

10

clients

avg.tim

e(s)






Federated querying


Federated querying is native to Triple Pattern Fragment clients.

Every query is decomposed locally.

Clients send simple requests to a server.

For clients, it doesn’t matterwhich server they send queries to.

For federation, we just send queriesto multiple servers.

No prior source selection.

Each triple pattern is sent to all servers.

If a certain pattern has no result,just don’t send more specific patterns.

Federation compares pretty well to SPARQL endpoint federation.

�.�. Experiment �: Querying federated knowledgegraphs

With this �nal experiment, we aim to validate??: “” Thus, we study whether the evaluation ofSPARQL queries over a federation of TPF inter-faces is practically feasible. To this end, we mea-sure result completeness and response times onthe public Web. As a result, we aim to discover inwhich directions further improvements are pos-sible.

�.�.�. Experimental setupWe implemented the extension from ?? in the

TPF client, which now accepts a set of TPF inter-faces and a SPARQL query. We chose the popularFedBench benchmark [��], enabling us to com-pare against other approaches. The datasets�

were �rst cleaned through the LOD Laundromatservice [��]. Then, each dataset was publishedthrough a TPF API on a dedicated Amazon EC�machine (� virtual CPUs, �.� GB RAM) locatedin the US, each running an HTTP cache.

As a query-mix, we selected the Linked Data(LD), Life Science (LS), and Cross Domain (CD)queries from FedBench, appended with the com-plex queries (C) by Montoya et al. [��] to gaindeeper insights. The complete query-mix wasran �� times in sequence on the public Web, ac-cessed from a desktop computer in Belgium inorder to represent realistic long-distance latency.The timeout per query was set to � minutes.

We compare our measurements with numbersreported by Castillo et al. [��], who tested thefollowing SPARQL endpoint federation systems:ANAPSID [��], ANAPSID EG (ANAPSID using Ex-clusive Groups), FedX [��] (with a warmed-upcache), and SPLENDID [��]. Castillo et al. ob-tained their measurements with one client and�� SPARQL endpoints on di�erent machines ina fast local network. We opted to perform ourtests on a public network in order to validate thefederated TPF solution within the context of theWeb. Measuring recall and execution time onthe Web gives an indication of the practical fea-sibility of real-world TPF-based federation, anda comparison with measurements of the state-of-the-art on a closed network positions TPF relativeto an ideal baseline without connection delays.

TPF

ANAP

SID

ANAP

SID

EG

FedX

(war

m)

SPLE

ND

ID

LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD� �.�� .�� .�� .�� .��LD�� .�� .�� .�� .�� .��LD�� .�� .�� .�� .�� .��LS� �.�� .�� .�� .�� .��LS� �.�� .�� .�� .�� .��LS� �.�� .�� .�� .�� .��LS� �.�� .�� .�� .�� .��LS� �.�� .�� .�� .�� .��LS� �.�� .�� .�� .�� .��LS� �.�� .�� .�� .�� .��CD� �.�� .�� .�� .�� .��CD� �.�� .�� .�� .�� .��CD� �.�� .�� .�� .�� .��CD� �.�� .�� .�� .�� .��CD� �.�� .�� .�� .�� .��CD� �.�� .�� .�� .�� .��CD� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C� �.�� .�� .�� .�� .��C�� .�� .�� .�� .�� .��

# queries= �.�� .�� .�� > �.��

Table �: Recall of FedBench query execution on the TPFclient/server setup compared to SPARQL endpoint federa-tion systems (timeout of ��s). All occurrences of incompleterecall are highlighted. The TPF-related measurements wereperformed in the context of this article; the numbers for theother four systems are adopted from [��].

�

FedBench

recall

Federation compares pretty well to SPARQL endpoint federation.








TPF

ANAP

SID

ANAP

SID

EG

FedX

(war

m)

SPLE

ND

ID


# queries= �.�� .�� .�� > �.��


�

recall

Complex queries








TPF

ANAP

SID

ANAP

SID

EG

FedX

(war

m)

SPLE

ND

ID


# queries= �.�� .�� .�� > �.��


�

Federation compares pretty well, even time-wise in some cases.

LD� LD� LD� LD� LD� LD� LD� LD� LD� LD�� LD�� CD� CD� CD� CD� CD� CD� CD�

50

100

exec

utio

nti

me

(s) ��

LS� LS� LS� LS� LS� LS� LS� C� C� C� C� C� C� C� C� C� C��0

50

100

150

200

250

300

exec

utio

nti

me

(s)

TPF ANAPSID ANAPSID EG FedX SPLENDID

Figure �: Evaluation times of FedBench query execution on the TPF client/server setup compared to SPARQL endpoint federationsystems (timeout of ��s). These measurements should be considered together with the recall for each query (Table �). TheTPF-related measurements were performed in the context of this article; the numbers for the four SPARQL endpoint federationsystems are adopted from [��].

tive results. This execution time is comparable tothat of ANAPSID and SPLENDID, and even approx-imates FedX. A likely explanation is the presenceof highly selective triple patterns in the query,for which the simplicity of TPF requests seems tocompensate the overhead of query planning inother systems. For other queries (CD�, CD�, LD�,LD�), the TPF client performs worse than ANAP-SID and SPLENDID, but remains within compa-rable bounds. A probable cause is the presenceof patterns like ?x rdf:type ?y, which can be an-swered by all data sources. Thus, other systemsclearly bene�t from prior source selection, whichthe current TPF client does not use.

For the remaining regular FedBench queries,the TPF client distinctly reveals its limitations.These queries (LS�, LS�) time out, or executesigni�cantly slower (CD�, CD�, LD�, LS�) thanSPARQL endpoint federation systems. Theycontain common predicates like owl:sameAs orfoaf:name that trigger requests to all interfaces.Additionally, a high number of subject joinscauses ine�ciencies, since they potentially pro-duce many “membership requests” to all sources,checking whether a triple is present or not. Anpossible enhancement is to include metadatathat prevents this, at the expense of a morecostly server-side interface [��]. Another cause

is the presence of a FILTER statement (LS�),which is currently executed client-side. This in-dicates room for interface extensions for suchclauses [��].

The limitations of TPF become more apparentwith the C queries, � of which time out and an-other � end prematurely (as discussed above).The high number of produced HTTP requests(�,�� on average) caused by the many triplepatterns, contributes signi�cantly to this delay.SPLENDID, FedX, and ANAPSID show similar re-sults, but fail on di�erent queries. For the querieswhere TPF reaches complete recall (C�, C�), thetotal execution time is comparable and even out-performs ANAPSID. These �ndings, measured onthe public Web, motivate a more in-depth studyof complex queries for TPF, to discover possibleclient or server enhancements.

AcknowledgmentsThe research activities in this article were

funded by Ghent University, iMinds, the Institutefor the Promotion of Innovation by Science andTechnology in Flanders (IWT), and the EuropeanUnion. R. Verborgh is a postdoctoral fellow of theResearch Foundation Flanders.

[�] O. Erling, I. Mikhailov, Virtuoso: RDF support in a na-tive RDBMS, in: Semantic Web Information Manage-

�


50

100

exec

utio

nti

me

(s) ��


50

100

150

200

250

300

exec

utio

nti

me

(s)










�

FedBench

Federation compares pretty well, even time-wise in some cases.


50

100

exec

utio

nti

me

(s) ��


50

100

150

200

250

300

exec

utio

nti

me

(s)










�


50

100

exec

utio

nti

me

(s) ��


50

100

150

200

250

300

exec

utio

nti

me

(s)










�


50

100

exec

utio

nti

me

(s) ��


50

100

150

200

250

300

exec

utio

nti

me

(s)










�

Complex queries

Note the different setupin the previous comparisons.

SPARQL endpoint federation was measured with local servers.

Triple Pattern Fragments federationwas measured over the Web.



Federated querying


Triple Pattern Fragments are easy:all software is available as open source.

github.com/LinkedDataFragments

linkeddatafragments.org

Software

Documentation and specification

More than 650.000 TPF interfaces are available for federated querying.

fragments.dbpedia.org

lodlaundromat.org/wardrobe/

data.linkeddatafragments.org

tutorial.linkeddatafragments.org

querying federations of triple pattern fragments

Internet