querying federations of triple pattern fragments
TRANSCRIPT
Querying Federations of Triple Pattern Fragments
Ruben Verborgh
Tutorial
Linked Data Fragments
Triple Pattern Fragments
Federated querying
Querying Federations of Triple Pattern Fragments
Linked Data Fragments
Triple Pattern Fragments
Federated querying
Querying Federations of Triple Pattern Fragments
A whole spectrum of trade-offs exists between the two extremes.
high server costlow server cost
datadump
SPARQLendpoint
interface offered by the server
high availability low availabilityhigh bandwidth low bandwidthout-of-date data live data
low client costhigh client cost
Linked Datadocuments
data
metadata
controls
What triples does it contain?
What do we know about it?
How to access more data?
All RDF interfaces offer fragments with the following characteristics.
all dataset triples
(none)
data dump
number of triples, file size
data
metadata
controls
Each type of Linked Data Fragment is defined by three characteristics.
triples matching the query
(none)
(none)
SPARQL query resultdata
metadata
controls
Each type of Linked Data Fragment is defined by three characteristics.
Linked Data Fragments
Triple Pattern Fragments
Federated querying
Querying Federations of Triple Pattern Fragments
We design new mixes of trade-offs with much lower server-side cost.
high server costlow server cost
datadump
SPARQLquery results
high availability low availabilityhigh bandwidth low bandwidthout-of-date data live data
low client costhigh client cost
Linked Datadocuments
low server cost
datadump
SPARQLquery results
high availabilitylive data
Linked Datadocuments
Triple PatternFragments
A Triple Pattern Fragments interfaceis low-cost and enables clients to query.
matches of a triple pattern
total number of matches
access to all other fragments
data
metadata
controls
(paged)
A Triple Pattern Fragments interfaceis low-cost and enables clients to query.
data (first 100)
controls (other fragments)
metadata (total count)
Give them a SPARQL query.Give them a URL of any dataset fragment.
How can intelligent clientssolve SPARQL queries over fragments?
They look inside the fragmentto see how to access the dataset
and use the metadatato decide how to plan the query.
Let’s follow the execution of an example SPARQL query.
SELECT ?artist ?name WHERE { ?artist a dbpedia-owl:Artist; rdfs:label ?name; dbpedia-owl:birthPlace dbpedia:Padua. FILTER LANGMATCHES(LANG(?name), "EN") }
Find names of artists born in Padua, Italy.
Fragment: http://fragments.dbpedia.org/2014/en
The client looks inside the fragment to see how to access the dataset.
<http://fragments.dbpedia.org/2014/en#dataset> hydra:search [ hydra:template "http://fragments.dbpedia.org/2014/en {?subject,predicate,object}"; hydra:mapping [ hydra:variable "subject"; hydra:property rdf:subject ], [ hydra:variable "predicate"; hydra:property rdf:predicate ], [ hydra:variable "object"; hydra:property rdf:object ] ].
Fragment: http://fragments.dbpedia.org/2014/en
“I can query the dataset by triple pattern.”
The client splits the queryinto the available fragments.
SELECT ?artist ?name WHERE { ?artist a dbpedia-owl:Artist; rdfs:label ?name; dbpedia-owl:birthPlace dbpedia:Padua. FILTER LANGMATCHES(LANG(?name), "EN") }
The client gets the fragments and inspects their metadata.
?artist a dbpedia-owl:Artist.first 100 triples
96.000
?artist rdfs:label ?name.first 100 triples
12.000.000
?artist dbont:birthPlace dbpedia:Padua.first 100 triples
135
?artist a dbpedia-owl:Artist. 96.000
?artist rdfs:label ?name. 12.000.000
?artist dbont:birthPlace dbpedia:Padua.dbpedia:Alberto_Benettin dbont:birthPlace dbpedia:Padua.
135
dbpedia:Alberto_Bigon dbont:birthPlace dbpedia:Padua.
The metadata enables the client to choose the right starting point.
dbp:Alberto_Benettin a dbont:Artist.
dbp:Alberto_Benettin rdfs:label ?name.
Clients execute the query in 3 secondson a highly available, low-cost server.
SELECT ?artist ?name WHERE { ?artist a dbpedia-owl:Artist; rdfs:label ?name; dbpedia-owl:birthPlace dbpedia:Padua. FILTER LANGMATCHES(LANG(?name), "EN") }
Try it yourself: bit.ly/artistspadua
Querying Datasets on the Web with High Availability 13
1 10 100
10
100
100010000
clients
throughput
(q/hr)
Virtuoso 6 Virtuoso 7
Fuseki–tdb Fuseki–hdttriple pattern fragments
Fig. 3.1: Server performance (log-log plot)
1 10 1000
2
4
clients
data
sent
(mb)
Fig. 3.2: Server network traffic
1 10 1000
50
100
150
200
clients
#tim
eouts
Fig. 3.3: Query timeouts
1 10 1000
10
20
clients
sent
(mb)
Fig. 3.4: Cache network traffic
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.5: Server processor usage per core
1 10 1000
2
4
6
8
clients
ram
use
(gb)
Fig. 3.6: Server memory usage
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.7: Client processor usage per core
1 10 10000.20.40.60.81
clients
ram
use
(gb)
Fig. 3.8: Client memory usage
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.9: Query 3 execution time (log-log plot)
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.10: Query 3 response time (log-log plot)
Fig. 3.2 shows the outbound network traffic on the server with an increasingnumber of clients. This traffic is substantially higher with triple pattern fragmentservers, because clients need to ask for several responses to evaluate a singlequery. The cache ensures that responses to identical requests are reused; Fig. 3.4indeed shows that caching is far more effective with triple pattern fragments.
Some of the bsbm queries execute slowly on triple pattern fragments clients,especially those queries that strongly depend on operators such as FILTER, whichin a triple-pattern-based interface can only be evaluated on the client. The exe-cution times of these queries exceed the timeout limit of 60s (Fig. 3.3). There-fore, we separately study bsbm query template 3 (finding products that satisfy2 numerical inequalities and an OPTIONAL clause), which is one of the templateswhose queries cause few timeouts. Note how its execution time (Fig. 3.9) fortriple pattern fragments starts high, but only increases very gradually, whereasthe execution time on sparql endpoints rises very rapidly. Furthermore, theresponse times increase more slowly with increased load (Fig. 3.10). Only on thetriple pattern fragments server, cpu usage remains low for this query at all times.
The query throughput is lower, but resilient to high client numbers.
executed SPARQL queries per hour
The server traffic is higher,but requests are significantly lighter.
Querying Datasets on the Web with High Availability 13
1 10 100
10
100
100010000
clients
throughput
(q/hr)
Virtuoso 6 Virtuoso 7
Fuseki–tdb Fuseki–hdttriple pattern fragments
Fig. 3.1: Server performance (log-log plot)
1 10 1000
2
4
clients
data
sent
(mb)
Fig. 3.2: Server network traffic
1 10 1000
50
100
150
200
clients
#tim
eouts
Fig. 3.3: Query timeouts
1 10 1000
10
20
clients
sent
(mb)
Fig. 3.4: Cache network traffic
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.5: Server processor usage per core
1 10 1000
2
4
6
8
clients
ram
use
(gb)
Fig. 3.6: Server memory usage
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.7: Client processor usage per core
1 10 10000.20.40.60.81
clients
ram
use
(gb)
Fig. 3.8: Client memory usage
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.9: Query 3 execution time (log-log plot)
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.10: Query 3 response time (log-log plot)
Fig. 3.2 shows the outbound network traffic on the server with an increasingnumber of clients. This traffic is substantially higher with triple pattern fragmentservers, because clients need to ask for several responses to evaluate a singlequery. The cache ensures that responses to identical requests are reused; Fig. 3.4indeed shows that caching is far more effective with triple pattern fragments.
Some of the bsbm queries execute slowly on triple pattern fragments clients,especially those queries that strongly depend on operators such as FILTER, whichin a triple-pattern-based interface can only be evaluated on the client. The exe-cution times of these queries exceed the timeout limit of 60s (Fig. 3.3). There-fore, we separately study bsbm query template 3 (finding products that satisfy2 numerical inequalities and an OPTIONAL clause), which is one of the templateswhose queries cause few timeouts. Note how its execution time (Fig. 3.9) fortriple pattern fragments starts high, but only increases very gradually, whereasthe execution time on sparql endpoints rises very rapidly. Furthermore, theresponse times increase more slowly with increased load (Fig. 3.10). Only on thetriple pattern fragments server, cpu usage remains low for this query at all times.
data sent by server in MB
Caching is significantly more effective,as clients reuse fragments for queries.
Querying Datasets on the Web with High Availability 13
1 10 100
10
100
100010000
clients
throughput
(q/hr)
Virtuoso 6 Virtuoso 7
Fuseki–tdb Fuseki–hdttriple pattern fragments
Fig. 3.1: Server performance (log-log plot)
1 10 1000
2
4
clients
data
sent
(mb)
Fig. 3.2: Server network traffic
1 10 1000
50
100
150
200
clients
#tim
eouts
Fig. 3.3: Query timeouts
1 10 1000
10
20
clients
sent
(mb)
Fig. 3.4: Cache network traffic
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.5: Server processor usage per core
1 10 1000
2
4
6
8
clients
ram
use
(gb)
Fig. 3.6: Server memory usage
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.7: Client processor usage per core
1 10 10000.20.40.60.81
clients
ram
use
(gb)
Fig. 3.8: Client memory usage
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.9: Query 3 execution time (log-log plot)
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.10: Query 3 response time (log-log plot)
Fig. 3.2 shows the outbound network traffic on the server with an increasingnumber of clients. This traffic is substantially higher with triple pattern fragmentservers, because clients need to ask for several responses to evaluate a singlequery. The cache ensures that responses to identical requests are reused; Fig. 3.4indeed shows that caching is far more effective with triple pattern fragments.
Some of the bsbm queries execute slowly on triple pattern fragments clients,especially those queries that strongly depend on operators such as FILTER, whichin a triple-pattern-based interface can only be evaluated on the client. The exe-cution times of these queries exceed the timeout limit of 60s (Fig. 3.3). There-fore, we separately study bsbm query template 3 (finding products that satisfy2 numerical inequalities and an OPTIONAL clause), which is one of the templateswhose queries cause few timeouts. Note how its execution time (Fig. 3.9) fortriple pattern fragments starts high, but only increases very gradually, whereasthe execution time on sparql endpoints rises very rapidly. Furthermore, theresponse times increase more slowly with increased load (Fig. 3.10). Only on thetriple pattern fragments server, cpu usage remains low for this query at all times.
data sent by cache in MB
The server uses much less CPU,allowing for higher availability.
server CPU usage per core
Querying Datasets on the Web with High Availability 13
1 10 100
10
100
100010000
clients
throughput
(q/hr)
Virtuoso 6 Virtuoso 7
Fuseki–tdb Fuseki–hdttriple pattern fragments
Fig. 3.1: Server performance (log-log plot)
1 10 1000
2
4
clients
data
sent
(mb)
Fig. 3.2: Server network traffic
1 10 1000
50
100
150
200
clients
#tim
eouts
Fig. 3.3: Query timeouts
1 10 1000
10
20
clients
sent
(mb)
Fig. 3.4: Cache network traffic
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.5: Server processor usage per core
1 10 1000
2
4
6
8
clients
ram
use
(gb)
Fig. 3.6: Server memory usage
1 10 1000
50
100
clients
cpu
use
(%
)
Fig. 3.7: Client processor usage per core
1 10 10000.20.40.60.81
clients
ram
use
(gb)
Fig. 3.8: Client memory usage
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.9: Query 3 execution time (log-log plot)
1 10 1000
0.01
0.1
1
10
clients
avg.tim
e(s)
Fig. 3.10: Query 3 response time (log-log plot)
Fig. 3.2 shows the outbound network traffic on the server with an increasingnumber of clients. This traffic is substantially higher with triple pattern fragmentservers, because clients need to ask for several responses to evaluate a singlequery. The cache ensures that responses to identical requests are reused; Fig. 3.4indeed shows that caching is far more effective with triple pattern fragments.
Some of the bsbm queries execute slowly on triple pattern fragments clients,especially those queries that strongly depend on operators such as FILTER, whichin a triple-pattern-based interface can only be evaluated on the client. The exe-cution times of these queries exceed the timeout limit of 60s (Fig. 3.3). There-fore, we separately study bsbm query template 3 (finding products that satisfy2 numerical inequalities and an OPTIONAL clause), which is one of the templateswhose queries cause few timeouts. Note how its execution time (Fig. 3.9) fortriple pattern fragments starts high, but only increases very gradually, whereasthe execution time on sparql endpoints rises very rapidly. Furthermore, theresponse times increase more slowly with increased load (Fig. 3.10). Only on thetriple pattern fragments server, cpu usage remains low for this query at all times.
Linked Data Fragments
Triple Pattern Fragments
Federated querying
Querying Federations of Triple Pattern Fragments
Federated querying is native to Triple Pattern Fragment clients.
Every query is decomposed locally.
Clients send simple requests to a server.
For clients, it doesn’t matterwhich server they send queries to.
For federation, we just send queriesto multiple servers.
No prior source selection.
Each triple pattern is sent to all servers.
If a certain pattern has no result,just don’t send more specific patterns.
Federation compares pretty well to SPARQL endpoint federation.
�.�. Experiment �: Querying federated knowledgegraphs
With this �nal experiment, we aim to validate??: “” Thus, we study whether the evaluation ofSPARQL queries over a federation of TPF inter-faces is practically feasible. To this end, we mea-sure result completeness and response times onthe public Web. As a result, we aim to discover inwhich directions further improvements are pos-sible.
�.�.�. Experimental setupWe implemented the extension from ?? in the
TPF client, which now accepts a set of TPF inter-faces and a SPARQL query. We chose the popularFedBench benchmark [��], enabling us to com-pare against other approaches. The datasets�
were �rst cleaned through the LOD Laundromatservice [��]. Then, each dataset was publishedthrough a TPF API on a dedicated Amazon EC�machine (� virtual CPUs, �.� GB RAM) locatedin the US, each running an HTTP cache.
As a query-mix, we selected the Linked Data(LD), Life Science (LS), and Cross Domain (CD)queries from FedBench, appended with the com-plex queries (C) by Montoya et al. [��] to gaindeeper insights. The complete query-mix wasran �� times in sequence on the public Web, ac-cessed from a desktop computer in Belgium inorder to represent realistic long-distance latency.The timeout per query was set to � minutes.
We compare our measurements with numbersreported by Castillo et al. [��], who tested thefollowing SPARQL endpoint federation systems:ANAPSID [��], ANAPSID EG (ANAPSID using Ex-clusive Groups), FedX [��] (with a warmed-upcache), and SPLENDID [��]. Castillo et al. ob-tained their measurements with one client and�� SPARQL endpoints on di�erent machines ina fast local network. We opted to perform ourtests on a public network in order to validate thefederated TPF solution within the context of theWeb. Measuring recall and execution time onthe Web gives an indication of the practical fea-sibility of real-world TPF-based federation, anda comparison with measurements of the state-of-the-art on a closed network positions TPF relativeto an ideal baseline without connection delays.
TPF
ANAP
SID
ANAP
SID
EG
FedX
(war
m)
SPLE
ND
ID
LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD�� �.�� �.�� �.�� �.�� �.��LD�� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C�� �.�� �.�� �.�� �.�� �.��
# queries= �.�� �� �� �� �� ��� �.�� �� �� �� �� ��� �.�� �� �� �� �� ��> �.�� �� �� �� �� ��
Table �: Recall of FedBench query execution on the TPFclient/server setup compared to SPARQL endpoint federa-tion systems (timeout of ���s). All occurrences of incompleterecall are highlighted. The TPF-related measurements wereperformed in the context of this article; the numbers for theother four systems are adopted from [��].
�
FedBench
recall
Federation compares pretty well to SPARQL endpoint federation.
�.�. Experiment �: Querying federated knowledgegraphs
With this �nal experiment, we aim to validate??: “” Thus, we study whether the evaluation ofSPARQL queries over a federation of TPF inter-faces is practically feasible. To this end, we mea-sure result completeness and response times onthe public Web. As a result, we aim to discover inwhich directions further improvements are pos-sible.
�.�.�. Experimental setupWe implemented the extension from ?? in the
TPF client, which now accepts a set of TPF inter-faces and a SPARQL query. We chose the popularFedBench benchmark [��], enabling us to com-pare against other approaches. The datasets�
were �rst cleaned through the LOD Laundromatservice [��]. Then, each dataset was publishedthrough a TPF API on a dedicated Amazon EC�machine (� virtual CPUs, �.� GB RAM) locatedin the US, each running an HTTP cache.
As a query-mix, we selected the Linked Data(LD), Life Science (LS), and Cross Domain (CD)queries from FedBench, appended with the com-plex queries (C) by Montoya et al. [��] to gaindeeper insights. The complete query-mix wasran �� times in sequence on the public Web, ac-cessed from a desktop computer in Belgium inorder to represent realistic long-distance latency.The timeout per query was set to � minutes.
We compare our measurements with numbersreported by Castillo et al. [��], who tested thefollowing SPARQL endpoint federation systems:ANAPSID [��], ANAPSID EG (ANAPSID using Ex-clusive Groups), FedX [��] (with a warmed-upcache), and SPLENDID [��]. Castillo et al. ob-tained their measurements with one client and�� SPARQL endpoints on di�erent machines ina fast local network. We opted to perform ourtests on a public network in order to validate thefederated TPF solution within the context of theWeb. Measuring recall and execution time onthe Web gives an indication of the practical fea-sibility of real-world TPF-based federation, anda comparison with measurements of the state-of-the-art on a closed network positions TPF relativeto an ideal baseline without connection delays.
TPF
ANAP
SID
ANAP
SID
EG
FedX
(war
m)
SPLE
ND
ID
LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD�� �.�� �.�� �.�� �.�� �.��LD�� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C�� �.�� �.�� �.�� �.�� �.��
# queries= �.�� �� �� �� �� ��� �.�� �� �� �� �� ��� �.�� �� �� �� �� ��> �.�� �� �� �� �� ��
Table �: Recall of FedBench query execution on the TPFclient/server setup compared to SPARQL endpoint federa-tion systems (timeout of ���s). All occurrences of incompleterecall are highlighted. The TPF-related measurements wereperformed in the context of this article; the numbers for theother four systems are adopted from [��].
�
recall
Complex queries
�.�. Experiment �: Querying federated knowledgegraphs
With this �nal experiment, we aim to validate??: “” Thus, we study whether the evaluation ofSPARQL queries over a federation of TPF inter-faces is practically feasible. To this end, we mea-sure result completeness and response times onthe public Web. As a result, we aim to discover inwhich directions further improvements are pos-sible.
�.�.�. Experimental setupWe implemented the extension from ?? in the
TPF client, which now accepts a set of TPF inter-faces and a SPARQL query. We chose the popularFedBench benchmark [��], enabling us to com-pare against other approaches. The datasets�
were �rst cleaned through the LOD Laundromatservice [��]. Then, each dataset was publishedthrough a TPF API on a dedicated Amazon EC�machine (� virtual CPUs, �.� GB RAM) locatedin the US, each running an HTTP cache.
As a query-mix, we selected the Linked Data(LD), Life Science (LS), and Cross Domain (CD)queries from FedBench, appended with the com-plex queries (C) by Montoya et al. [��] to gaindeeper insights. The complete query-mix wasran �� times in sequence on the public Web, ac-cessed from a desktop computer in Belgium inorder to represent realistic long-distance latency.The timeout per query was set to � minutes.
We compare our measurements with numbersreported by Castillo et al. [��], who tested thefollowing SPARQL endpoint federation systems:ANAPSID [��], ANAPSID EG (ANAPSID using Ex-clusive Groups), FedX [��] (with a warmed-upcache), and SPLENDID [��]. Castillo et al. ob-tained their measurements with one client and�� SPARQL endpoints on di�erent machines ina fast local network. We opted to perform ourtests on a public network in order to validate thefederated TPF solution within the context of theWeb. Measuring recall and execution time onthe Web gives an indication of the practical fea-sibility of real-world TPF-based federation, anda comparison with measurements of the state-of-the-art on a closed network positions TPF relativeto an ideal baseline without connection delays.
TPF
ANAP
SID
ANAP
SID
EG
FedX
(war
m)
SPLE
ND
ID
LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD� �.�� �.�� �.�� �.�� �.��LD�� �.�� �.�� �.�� �.�� �.��LD�� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��LS� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��CD� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C� �.�� �.�� �.�� �.�� �.��C�� �.�� �.�� �.�� �.�� �.��
# queries= �.�� �� �� �� �� ��� �.�� �� �� �� �� ��� �.�� �� �� �� �� ��> �.�� �� �� �� �� ��
Table �: Recall of FedBench query execution on the TPFclient/server setup compared to SPARQL endpoint federa-tion systems (timeout of ���s). All occurrences of incompleterecall are highlighted. The TPF-related measurements wereperformed in the context of this article; the numbers for theother four systems are adopted from [��].
�
Federation compares pretty well, even time-wise in some cases.
LD� LD� LD� LD� LD� LD� LD� LD� LD� LD�� LD�� CD� CD� CD� CD� CD� CD� CD�
50
100
exec
utio
nti
me
(s) ��� ��� ��� ���
LS� LS� LS� LS� LS� LS� LS� C� C� C� C� C� C� C� C� C� C��0
50
100
150
200
250
300
exec
utio
nti
me
(s)
TPF ANAPSID ANAPSID EG FedX SPLENDID
Figure �: Evaluation times of FedBench query execution on the TPF client/server setup compared to SPARQL endpoint federationsystems (timeout of ���s). These measurements should be considered together with the recall for each query (Table �). TheTPF-related measurements were performed in the context of this article; the numbers for the four SPARQL endpoint federationsystems are adopted from [��].
tive results. This execution time is comparable tothat of ANAPSID and SPLENDID, and even approx-imates FedX. A likely explanation is the presenceof highly selective triple patterns in the query,for which the simplicity of TPF requests seems tocompensate the overhead of query planning inother systems. For other queries (CD�, CD�, LD�,LD�), the TPF client performs worse than ANAP-SID and SPLENDID, but remains within compa-rable bounds. A probable cause is the presenceof patterns like ?x rdf:type ?y, which can be an-swered by all data sources. Thus, other systemsclearly bene�t from prior source selection, whichthe current TPF client does not use.
For the remaining regular FedBench queries,the TPF client distinctly reveals its limitations.These queries (LS�, LS�) time out, or executesigni�cantly slower (CD�, CD�, LD�, LS�) thanSPARQL endpoint federation systems. Theycontain common predicates like owl:sameAs orfoaf:name that trigger requests to all interfaces.Additionally, a high number of subject joinscauses ine�ciencies, since they potentially pro-duce many “membership requests” to all sources,checking whether a triple is present or not. Anpossible enhancement is to include metadatathat prevents this, at the expense of a morecostly server-side interface [��]. Another cause
is the presence of a FILTER statement (LS�),which is currently executed client-side. This in-dicates room for interface extensions for suchclauses [��].
The limitations of TPF become more apparentwith the C queries, � of which time out and an-other � end prematurely (as discussed above).The high number of produced HTTP requests(�,��� on average) caused by the many triplepatterns, contributes signi�cantly to this delay.SPLENDID, FedX, and ANAPSID show similar re-sults, but fail on di�erent queries. For the querieswhere TPF reaches complete recall (C�, C�), thetotal execution time is comparable and even out-performs ANAPSID. These �ndings, measured onthe public Web, motivate a more in-depth studyof complex queries for TPF, to discover possibleclient or server enhancements.
AcknowledgmentsThe research activities in this article were
funded by Ghent University, iMinds, the Institutefor the Promotion of Innovation by Science andTechnology in Flanders (IWT), and the EuropeanUnion. R. Verborgh is a postdoctoral fellow of theResearch Foundation Flanders.
[�] O. Erling, I. Mikhailov, Virtuoso: RDF support in a na-tive RDBMS, in: Semantic Web Information Manage-
�
LD� LD� LD� LD� LD� LD� LD� LD� LD� LD�� LD�� CD� CD� CD� CD� CD� CD� CD�
50
100
exec
utio
nti
me
(s) ��� ��� ��� ���
LS� LS� LS� LS� LS� LS� LS� C� C� C� C� C� C� C� C� C� C��0
50
100
150
200
250
300
exec
utio
nti
me
(s)
TPF ANAPSID ANAPSID EG FedX SPLENDID
Figure �: Evaluation times of FedBench query execution on the TPF client/server setup compared to SPARQL endpoint federationsystems (timeout of ���s). These measurements should be considered together with the recall for each query (Table �). TheTPF-related measurements were performed in the context of this article; the numbers for the four SPARQL endpoint federationsystems are adopted from [��].
tive results. This execution time is comparable tothat of ANAPSID and SPLENDID, and even approx-imates FedX. A likely explanation is the presenceof highly selective triple patterns in the query,for which the simplicity of TPF requests seems tocompensate the overhead of query planning inother systems. For other queries (CD�, CD�, LD�,LD�), the TPF client performs worse than ANAP-SID and SPLENDID, but remains within compa-rable bounds. A probable cause is the presenceof patterns like ?x rdf:type ?y, which can be an-swered by all data sources. Thus, other systemsclearly bene�t from prior source selection, whichthe current TPF client does not use.
For the remaining regular FedBench queries,the TPF client distinctly reveals its limitations.These queries (LS�, LS�) time out, or executesigni�cantly slower (CD�, CD�, LD�, LS�) thanSPARQL endpoint federation systems. Theycontain common predicates like owl:sameAs orfoaf:name that trigger requests to all interfaces.Additionally, a high number of subject joinscauses ine�ciencies, since they potentially pro-duce many “membership requests” to all sources,checking whether a triple is present or not. Anpossible enhancement is to include metadatathat prevents this, at the expense of a morecostly server-side interface [��]. Another cause
is the presence of a FILTER statement (LS�),which is currently executed client-side. This in-dicates room for interface extensions for suchclauses [��].
The limitations of TPF become more apparentwith the C queries, � of which time out and an-other � end prematurely (as discussed above).The high number of produced HTTP requests(�,��� on average) caused by the many triplepatterns, contributes signi�cantly to this delay.SPLENDID, FedX, and ANAPSID show similar re-sults, but fail on di�erent queries. For the querieswhere TPF reaches complete recall (C�, C�), thetotal execution time is comparable and even out-performs ANAPSID. These �ndings, measured onthe public Web, motivate a more in-depth studyof complex queries for TPF, to discover possibleclient or server enhancements.
AcknowledgmentsThe research activities in this article were
funded by Ghent University, iMinds, the Institutefor the Promotion of Innovation by Science andTechnology in Flanders (IWT), and the EuropeanUnion. R. Verborgh is a postdoctoral fellow of theResearch Foundation Flanders.
[�] O. Erling, I. Mikhailov, Virtuoso: RDF support in a na-tive RDBMS, in: Semantic Web Information Manage-
�
FedBench
Federation compares pretty well, even time-wise in some cases.
LD� LD� LD� LD� LD� LD� LD� LD� LD� LD�� LD�� CD� CD� CD� CD� CD� CD� CD�
50
100
exec
utio
nti
me
(s) ��� ��� ��� ���
LS� LS� LS� LS� LS� LS� LS� C� C� C� C� C� C� C� C� C� C��0
50
100
150
200
250
300
exec
utio
nti
me
(s)
TPF ANAPSID ANAPSID EG FedX SPLENDID
Figure �: Evaluation times of FedBench query execution on the TPF client/server setup compared to SPARQL endpoint federationsystems (timeout of ���s). These measurements should be considered together with the recall for each query (Table �). TheTPF-related measurements were performed in the context of this article; the numbers for the four SPARQL endpoint federationsystems are adopted from [��].
tive results. This execution time is comparable tothat of ANAPSID and SPLENDID, and even approx-imates FedX. A likely explanation is the presenceof highly selective triple patterns in the query,for which the simplicity of TPF requests seems tocompensate the overhead of query planning inother systems. For other queries (CD�, CD�, LD�,LD�), the TPF client performs worse than ANAP-SID and SPLENDID, but remains within compa-rable bounds. A probable cause is the presenceof patterns like ?x rdf:type ?y, which can be an-swered by all data sources. Thus, other systemsclearly bene�t from prior source selection, whichthe current TPF client does not use.
For the remaining regular FedBench queries,the TPF client distinctly reveals its limitations.These queries (LS�, LS�) time out, or executesigni�cantly slower (CD�, CD�, LD�, LS�) thanSPARQL endpoint federation systems. Theycontain common predicates like owl:sameAs orfoaf:name that trigger requests to all interfaces.Additionally, a high number of subject joinscauses ine�ciencies, since they potentially pro-duce many “membership requests” to all sources,checking whether a triple is present or not. Anpossible enhancement is to include metadatathat prevents this, at the expense of a morecostly server-side interface [��]. Another cause
is the presence of a FILTER statement (LS�),which is currently executed client-side. This in-dicates room for interface extensions for suchclauses [��].
The limitations of TPF become more apparentwith the C queries, � of which time out and an-other � end prematurely (as discussed above).The high number of produced HTTP requests(�,��� on average) caused by the many triplepatterns, contributes signi�cantly to this delay.SPLENDID, FedX, and ANAPSID show similar re-sults, but fail on di�erent queries. For the querieswhere TPF reaches complete recall (C�, C�), thetotal execution time is comparable and even out-performs ANAPSID. These �ndings, measured onthe public Web, motivate a more in-depth studyof complex queries for TPF, to discover possibleclient or server enhancements.
AcknowledgmentsThe research activities in this article were
funded by Ghent University, iMinds, the Institutefor the Promotion of Innovation by Science andTechnology in Flanders (IWT), and the EuropeanUnion. R. Verborgh is a postdoctoral fellow of theResearch Foundation Flanders.
[�] O. Erling, I. Mikhailov, Virtuoso: RDF support in a na-tive RDBMS, in: Semantic Web Information Manage-
�
LD� LD� LD� LD� LD� LD� LD� LD� LD� LD�� LD�� CD� CD� CD� CD� CD� CD� CD�
50
100
exec
utio
nti
me
(s) ��� ��� ��� ���
LS� LS� LS� LS� LS� LS� LS� C� C� C� C� C� C� C� C� C� C��0
50
100
150
200
250
300
exec
utio
nti
me
(s)
TPF ANAPSID ANAPSID EG FedX SPLENDID
Figure �: Evaluation times of FedBench query execution on the TPF client/server setup compared to SPARQL endpoint federationsystems (timeout of ���s). These measurements should be considered together with the recall for each query (Table �). TheTPF-related measurements were performed in the context of this article; the numbers for the four SPARQL endpoint federationsystems are adopted from [��].
tive results. This execution time is comparable tothat of ANAPSID and SPLENDID, and even approx-imates FedX. A likely explanation is the presenceof highly selective triple patterns in the query,for which the simplicity of TPF requests seems tocompensate the overhead of query planning inother systems. For other queries (CD�, CD�, LD�,LD�), the TPF client performs worse than ANAP-SID and SPLENDID, but remains within compa-rable bounds. A probable cause is the presenceof patterns like ?x rdf:type ?y, which can be an-swered by all data sources. Thus, other systemsclearly bene�t from prior source selection, whichthe current TPF client does not use.
For the remaining regular FedBench queries,the TPF client distinctly reveals its limitations.These queries (LS�, LS�) time out, or executesigni�cantly slower (CD�, CD�, LD�, LS�) thanSPARQL endpoint federation systems. Theycontain common predicates like owl:sameAs orfoaf:name that trigger requests to all interfaces.Additionally, a high number of subject joinscauses ine�ciencies, since they potentially pro-duce many “membership requests” to all sources,checking whether a triple is present or not. Anpossible enhancement is to include metadatathat prevents this, at the expense of a morecostly server-side interface [��]. Another cause
is the presence of a FILTER statement (LS�),which is currently executed client-side. This in-dicates room for interface extensions for suchclauses [��].
The limitations of TPF become more apparentwith the C queries, � of which time out and an-other � end prematurely (as discussed above).The high number of produced HTTP requests(�,��� on average) caused by the many triplepatterns, contributes signi�cantly to this delay.SPLENDID, FedX, and ANAPSID show similar re-sults, but fail on di�erent queries. For the querieswhere TPF reaches complete recall (C�, C�), thetotal execution time is comparable and even out-performs ANAPSID. These �ndings, measured onthe public Web, motivate a more in-depth studyof complex queries for TPF, to discover possibleclient or server enhancements.
AcknowledgmentsThe research activities in this article were
funded by Ghent University, iMinds, the Institutefor the Promotion of Innovation by Science andTechnology in Flanders (IWT), and the EuropeanUnion. R. Verborgh is a postdoctoral fellow of theResearch Foundation Flanders.
[�] O. Erling, I. Mikhailov, Virtuoso: RDF support in a na-tive RDBMS, in: Semantic Web Information Manage-
�
LD� LD� LD� LD� LD� LD� LD� LD� LD� LD�� LD�� CD� CD� CD� CD� CD� CD� CD�
50
100
exec
utio
nti
me
(s) ��� ��� ��� ���
LS� LS� LS� LS� LS� LS� LS� C� C� C� C� C� C� C� C� C� C��0
50
100
150
200
250
300
exec
utio
nti
me
(s)
TPF ANAPSID ANAPSID EG FedX SPLENDID
Figure �: Evaluation times of FedBench query execution on the TPF client/server setup compared to SPARQL endpoint federationsystems (timeout of ���s). These measurements should be considered together with the recall for each query (Table �). TheTPF-related measurements were performed in the context of this article; the numbers for the four SPARQL endpoint federationsystems are adopted from [��].
tive results. This execution time is comparable tothat of ANAPSID and SPLENDID, and even approx-imates FedX. A likely explanation is the presenceof highly selective triple patterns in the query,for which the simplicity of TPF requests seems tocompensate the overhead of query planning inother systems. For other queries (CD�, CD�, LD�,LD�), the TPF client performs worse than ANAP-SID and SPLENDID, but remains within compa-rable bounds. A probable cause is the presenceof patterns like ?x rdf:type ?y, which can be an-swered by all data sources. Thus, other systemsclearly bene�t from prior source selection, whichthe current TPF client does not use.
For the remaining regular FedBench queries,the TPF client distinctly reveals its limitations.These queries (LS�, LS�) time out, or executesigni�cantly slower (CD�, CD�, LD�, LS�) thanSPARQL endpoint federation systems. Theycontain common predicates like owl:sameAs orfoaf:name that trigger requests to all interfaces.Additionally, a high number of subject joinscauses ine�ciencies, since they potentially pro-duce many “membership requests” to all sources,checking whether a triple is present or not. Anpossible enhancement is to include metadatathat prevents this, at the expense of a morecostly server-side interface [��]. Another cause
is the presence of a FILTER statement (LS�),which is currently executed client-side. This in-dicates room for interface extensions for suchclauses [��].
The limitations of TPF become more apparentwith the C queries, � of which time out and an-other � end prematurely (as discussed above).The high number of produced HTTP requests(�,��� on average) caused by the many triplepatterns, contributes signi�cantly to this delay.SPLENDID, FedX, and ANAPSID show similar re-sults, but fail on di�erent queries. For the querieswhere TPF reaches complete recall (C�, C�), thetotal execution time is comparable and even out-performs ANAPSID. These �ndings, measured onthe public Web, motivate a more in-depth studyof complex queries for TPF, to discover possibleclient or server enhancements.
AcknowledgmentsThe research activities in this article were
funded by Ghent University, iMinds, the Institutefor the Promotion of Innovation by Science andTechnology in Flanders (IWT), and the EuropeanUnion. R. Verborgh is a postdoctoral fellow of theResearch Foundation Flanders.
[�] O. Erling, I. Mikhailov, Virtuoso: RDF support in a na-tive RDBMS, in: Semantic Web Information Manage-
�
Complex queries
Note the different setupin the previous comparisons.
SPARQL endpoint federation was measured with local servers.
Triple Pattern Fragments federationwas measured over the Web.
Linked Data Fragments
Triple Pattern Fragments
Federated querying
Querying Federations of Triple Pattern Fragments
Triple Pattern Fragments are easy:all software is available as open source.
github.com/LinkedDataFragments
linkeddatafragments.org
Software
Documentation and specification
More than 650.000 TPF interfaces are available for federated querying.
fragments.dbpedia.org
lodlaundromat.org/wardrobe/
data.linkeddatafragments.org
tutorial.linkeddatafragments.org