semantic web 0 (0) 1 ios press publishing and archiving planned … · julian rojasa, david...

19
Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 15 15 16 16 17 17 18 18 19 19 20 20 21 21 22 22 23 23 24 24 25 25 26 26 27 27 28 28 29 29 30 30 31 31 32 32 33 33 34 34 35 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 44 45 45 46 46 47 47 48 48 49 49 50 50 51 51 Publishing and archiving planned and live public transport events with the Linked Connections framework Julian Rojas a , David Chaves-Fraga b , Pieter Colpaert a , Oscar Corcho b and Ruben Verborgh a a IDLab, Department of Electronics and Information Systems, Ghent University-imec, Belgium b Ontology Engineering Group, Universidad Politécnica de Madrid, Spain E-mails: [email protected], dchaves@fi.upm.es, [email protected], ocorcho@fi.upm.es, [email protected] Abstract. Using Linked Data based approaches, companies and institutions are seeking ways to automate the adoption of Open Datasets. In the transport domain, data about planned events, live updates and historical data have to coexist to provide reliable data to route planning assistants. Linked Connections (LC) introduces a preliminary specification that allows cost-efficient pub- lishing of the raw public transport data in linked information resources. This paper gives an overview of Linked Connections so far and supports claims with existing and novel experiments. Furthermore, (i) an extension of the current Linked Connections specification providing methods and vocabulary to deal with live data is provided; (ii) a Linked Connections Live server is de- veloped that is able to process GTFS-RT feeds providing consistent identifiers; and (iii) an efficient management of historical data taking into account the size of each fragments exposed on the Web is described. We discover that the size of the fragments has a relevant impact on the performance of query evaluation. Based on our experiments conducted in 2018, an ideal Linked Connections fragment – for the use case of route planning with a client developed for this work – weighs about 50kb. This research scratches the surface on a Web ecosystem for route planning. In future works, we envision to find optimal fragmentation strategies of larger public transit networks for automated federated route planning. Keywords: Linked Connections, Historical Data, Real time data, Reliable data, Linked Data Fragments 1. Introduction In the current state of the Web of Data, a wide amount of that data are exposed following the prin- ciples of Linked Data [1]. Giving unique identifiers to each resource, representing the data using a shared and common vocabulary of a domain or the possibil- ity of dereferencing each URI are some of the rele- vant aspects that made Linked Data as one the most common approaches to organize and expose the data on the Web [2]. This features allow third parties to query the data in a standard way, using for example, the corresponding query language for RDF, SPARQL [3], and the possibility of doing federation across multiple datasets [4]. However, today, a lot of domains need to deal with live and historical data where is important to maintain stable identifiers that remain valid over time. The problems regarding the manage of these types of data and the relation with Linked Data have not widely researched. One of the most relevant domains where contributions will have a relevant impact is the trans- port domain, where a complex environment with mul- tiple types of data sources have to be managed to pro- vide reliable data to information systems. Since March 2017, one of the main motivations for developing solutions about multimodal and integrated travel information services is the publication of the new directive by the EU Commission about discover- ability and access to public transport data across Eu- rope 1 . This document proposes the making of public 1 https://ec.europa.eu/transport/sites/transport/files/ 2017-sustainable-urban-mobility-policy-context.pdf 1570-0844/0-1900/$35.00 c 0 – IOS Press and the authors. All rights reserved

Upload: others

Post on 28-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

Semantic Web 0 (0) 1 1IOS Press

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Publishing and archiving planned and livepublic transport events with theLinked Connections frameworkJulian Rojas a David Chaves-Fraga b Pieter Colpaert a Oscar Corcho b and Ruben Verborgh a

a IDLab Department of Electronics and Information Systems Ghent University-imec Belgiumb Ontology Engineering Group Universidad Politeacutecnica de Madrid SpainE-mails julianandresrojasmelendezugentbe dchavesfiupmes pietercolpaertugentbeocorchofiupmes rubenverborghugentbe

Abstract Using Linked Data based approaches companies and institutions are seeking ways to automate the adoption of OpenDatasets In the transport domain data about planned events live updates and historical data have to coexist to provide reliabledata to route planning assistants Linked Connections (LC) introduces a preliminary specification that allows cost-efficient pub-lishing of the raw public transport data in linked information resources This paper gives an overview of Linked Connections sofar and supports claims with existing and novel experiments Furthermore (i) an extension of the current Linked Connectionsspecification providing methods and vocabulary to deal with live data is provided (ii) a Linked Connections Live server is de-veloped that is able to process GTFS-RT feeds providing consistent identifiers and (iii) an efficient management of historicaldata taking into account the size of each fragments exposed on the Web is described We discover that the size of the fragmentshas a relevant impact on the performance of query evaluation Based on our experiments conducted in 2018 an ideal LinkedConnections fragment ndash for the use case of route planning with a client developed for this work ndash weighs about 50kb Thisresearch scratches the surface on a Web ecosystem for route planning In future works we envision to find optimal fragmentationstrategies of larger public transit networks for automated federated route planning

Keywords Linked Connections Historical Data Real time data Reliable data Linked Data Fragments

1 Introduction

In the current state of the Web of Data a wideamount of that data are exposed following the prin-ciples of Linked Data [1] Giving unique identifiersto each resource representing the data using a sharedand common vocabulary of a domain or the possibil-ity of dereferencing each URI are some of the rele-vant aspects that made Linked Data as one the mostcommon approaches to organize and expose the dataon the Web [2] This features allow third parties toquery the data in a standard way using for example thecorresponding query language for RDF SPARQL [3]and the possibility of doing federation across multipledatasets [4] However today a lot of domains need todeal with live and historical data where is important tomaintain stable identifiers that remain valid over time

The problems regarding the manage of these types ofdata and the relation with Linked Data have not widelyresearched One of the most relevant domains wherecontributions will have a relevant impact is the trans-port domain where a complex environment with mul-tiple types of data sources have to be managed to pro-vide reliable data to information systems

Since March 2017 one of the main motivations fordeveloping solutions about multimodal and integratedtravel information services is the publication of thenew directive by the EU Commission about discover-ability and access to public transport data across Eu-rope1 This document proposes the making of public

1httpseceuropaeutransportsitestransportfiles2017-sustainable-urban-mobility-policy-contextpdf

1570-08440-1900$3500 ccopy 0 ndash IOS Press and the authors All rights reserved

2 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 1 Example of a connection at LC in JSON-LD

transport data from providers available on national orcommon access points saved on databases data ware-house or repositories All the states will provide accessto a unique common point following different staticstandards as Transmodel2 Datex II3 or GTFS4 andreal-time standards like GTFS-RT5 or SIRI6 So thedomain requires solutions able to provide reliable dataand to deal with the heterogeneity of access points andthe data formats

One of the main challenges when the Linked Dataapproaches deal with transport domain where live andhistorical data has to be taken into account for provid-ing consistent data to the information services is howto ensure that the identifiers of each resource is stableand valid over time For example if we define the con-nection entity as a departure-arrival pair as has beendone in previous works on Linked Connections [5] theURI of each connection was defined getting informa-tion from static GTFS datasets At the moment that livedata is involved those URIs are not consistent becausethe live information creates new instances of a connec-tion at different points in time An example of the con-ceptualization of a connection is shown in the Figure 1Other relevant challenge is how to manage the exposedhistorical data on the Web to allow an optimal queryperformance One of the requirements of most relevantroute planning algorithms is that the data have to besorted by the time In previous works of Linked Con-nections [6] we developed a server that paginates thelist of connections in departure time intervals (10 min-utes) and publishes these pages over HTTP We havenoticed that the clients were able to analyze the histor-ical data but the performance was too low

Our work is focused on providing an improvementof the current state of Linked Connections frameworkby allowing an efficient management of historical andlive data with a standard vocabulary for the transportdomain This is relevant because of two main reasons

2httpwwwtransmodel-ceneu3httpwwwdatex2eu4httpsdevelopersgooglecomtransitgtfs5httpsdevelopersgooglecomtransitgtfs-realtime6httpwwwtransmodel-ceneustandardssiri

(i) currently a lot of transport companies are starting toprovide access to their live services but the most com-mon way to do it is to develop an ad-hoc solution likeAPIs or web services that only work locally [7] and(ii) the heterogeneity of the domain in terms of dataformats will be a very relevant problem next years es-pecially in Europe based on the proposal of the EUCommission so a standard framework will be needed

In this paper we present the Linked Connectionsframework to provide reliable access to live and his-torical data Our main contribution is the extension ofprevious version of the LC server by providing an ef-ficient management of live and historical data Firstwe develop a library that gets information from staticGTFS datasets and GTFS-RT data streams and inte-grate them following the LC vocabulary Second wemodify the approach of Linked Connections for split-ting the fragments of the data using a specific size ofeach fragment instead of the time Third we programa route planning algorithm in the top of the LC serverto test if our approach is able to provide reliable dataFinally we evaluate the improvements comparing thenew version of the LC framework with our previousapproaches as base line and we analyse the impact ofthe fragment size in the performance of a route plan

The paper is organized as follows Section 2presents some of the related work about relevant ap-proaches for exposing data on the web efficiently pre-vious steps about the Linked Connections frameworka description of the de-facto standard GTFS and thespecification of relevant route planning algorithmsSection 3 describes our proposal about the improve-ments of the Linked Connections framework and aworking example description Section 4 presents thedesign of our experiments Section 5 describes the re-sults we obtained evaluating our main contributionsSection 6 provides a brief discussion about the rele-vance of our contributions and Section 7 presents con-clusions and areas for future work

2 Related Work

On the current state of the Web huge amount of dataare exposed following the principles of Linked Data Inthis section we describe the main contributions on thistopic focused on an efficient publication of data on thetransport domain a description of previous approachesdeveloped using the Linked Connections frameworkand the analysis of the model for transport data that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 3

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

supports our work Finally we analyse the most rele-vant algorithms for route planning

One of the most well-known alternatives to publishdata on the Web is Linked Data [1] Linked Data al-lows to identify in an unique way resources on theWeb using identifiers or HTTP URIs It is a methodto distribute and scale data over large organizationssuch as the Web When looking up this identifier byusing the HTTP protocol or using a Web browser adefinition must be returned including links towardsother related resources a practice called dereferencingThe triple format to be used in combination with URIsis standardized within RDF The URIs used for thesetriples already existed in other data sources and wethus favoured using the same identifiers It is up to adata publisher to make a choice on which data sourcescan provide the identifiers for a certain type of entities

A common problem in Linked Data is the availabil-ity of the triple stores They provide a way to gettingdata using the SPARQL query language but at the mo-ment of queries involving long periods of time theseapproaches are not efficient [8] The Linked Data Frag-ments(LDF) [9 10] solve this issue fragmenting thedata in several HTTP documents Following this ap-proach the store moves the load from server side toclient side improving its availability Comunica [11]is a framework that extends the possibilities of LDFallowing to query other semantic interfaces as com-mon SPARQL endpoint RDF data dumps or HDTdatasets [12]

Linked Connections [5] applies this approach to de-velop a cost-efficient solution based on a HTTP inter-face for transport data The main assumption of LC isthat the relevant data for route planners can be basedon the connection concept Basically as the LC vo-cabulary7 defines it a connection describes a depar-ture at a certain stop and an arrival at a different stopwith their corresponding departure and arrival timesand without any intermediate stop A basic implemen-tation of LC is shown in Figure 2 where the routeplanning algorithms have to analyse the connections(small rectangles) through the pages (big rectangles)and across time to find the expected route The joinbetween the connections is possible because same re-sources have same identifiers based on one of the prin-ciples of Linked Data The Hydra Ontology [13] isused to specify the next and previous page links as wellas how the resource itself should be discovered

7httpsemwebmmlabbenslinkedconnections

Fig 2 Linked Connections implementation

It also relevant to describe the previous works thathas been carried out using the specification of LinkedConnections For example [14] describes and anal-yses the behaviour of a basic transport API and theLinked Connections framework for public transit routeplanning comparing the CPU and query executiontime The authors found that at the expense of ahigher bandwidth consumption more queries can beanswered using LC than the origin-destination APIIn [15] studies the impact of taking into account userpreferences in a public transit route planning addingthat features both on server and client and comparingthe two solution on query execution time cache per-formance and CPU usage on both sides A first stepfor providing reliable access to historical and live datausing Linked Connections is described in [6] where amechanism is introduced to tackle the problem of themanagement of data modifications when live data is in-volved in route planning queries Tripscore8 a LinkedData client that consume several Linked Connectionsservers with live and historical data is also described in[16] where the Connection Scan Algorithm (CSA) isimplemented as the route planning algorithm in top ofthe client [17] Tripscore add multiple user preferencesat the client side to provide an score for each possibleroute

All the solutions that we aforementioned rely onthe de-facto standard for representing public transportdata the General Transit Feed Specification or GTFSThis model and its extension for real-time (GTFS-RT)is used by Google Maps9 since 2005 but also by otherroute planners like Open Trip Planner 10 or Navitaio11It is also the most common model used by the trans-port companies to expose their data on open data por-tals like for example the Consorcio General de Trans-portes de Madrid12 the TRAM in Barcelona13 or the

8wwwtripscoreeu9httpmapsgooglees10httpwwwopentripplannerorg11httpswwwnavitiaio12httpdatoscrtmes13httpsopendatatramcat

4 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Belgium National Train System (NMBS) GTFS de-fines the headers of 13 types of CSV files and a set ofrules the must be take into account when the datasetis created Each file as well as their headers can bemandatory or optional and have relations among themas show in Figure 3 Linked Connections is getting thenecessary information from a subset of the full dataset

ndash stops Individual locations where vehicles pick upor drop off passengers

ndash calendar Dates for service IDs using a weeklyschedule Specify when service starts and ends aswell as days of the week where service is avail-able

ndash calendar_dates Exceptions for the service IDsdefined in the calendar file

ndash stop_times Times that a vehicle arrives at and de-parts from individual stops for each trip

ndash trips Trips for each route A trip is a sequence oftwo or more stops that occurs at specific time

ndash routes Transit routes A route is a group of tripsthat are displayed to riders as a single service

ndash transfers Rules for making connections at trans-fer points between routes

In order to link the terms and identifiers defined inthese files with the Linked Open Data cloud we usedthe Linked GTFS14 vocabulary We create mappingsable to transform GTFS files to Linked GTFS follow-ing the CSV2RDF [18] W3C recommendation15 butalso using other standard OBDA mapping languagesthat are able to deal with CSV files16 like RML [19]or R2RML [20]

The extension of GTFS for real-time GTFS-RT17 isa feed specification that allows public transport agen-cies to provide real time updates about their fleetThe specification supports three types of information(i) trips updates like delays cancellations or changeroutes (ii) service alerts like stop moved unforeseenevents affecting stations routes etc and (iii) informa-tion of vehicle positions including location and con-gestion level The data exchange format is based onProtocol Buffers18

It is also important to mention the works about routeplanning algorithms that can be developed on the topof the Linked Connections framework The problem

14httpvocabgtfsorgterms15httpsgithubcomOpenTransportgtfs-csv2rdf16httpsgithubcomdachafragtfsmappings17httpsdevelopersgooglecomtransitgtfs-realtime18httpsdevelopersgooglecomprotocol-buffers

Fig 3 The GTFS model and its primary relations

that these algorithms has to solve using the data isthe Earliest Arrival Time (EAT) An EAT query con-sists of a departure stop a departure time and a desti-nation stop The main goal is to find the fastest jour-ney to the destination stop starting from the depar-ture stop at the departure time There are more com-plex route planning questions like the Minimum Ex-pected Arrival Time (MEAT) [17] or multi-criteriaprofile queries [21ndash23] The Connection Scan Algo-rithm (CSA) [17] is an approach that models thetimetable data as a directed acyclic graph [24] usinga stream of connections As we defined before a con-nection is a combination of a departure stop (cdepstop)with a departure time (cdeptime) and an arrival stop(carrstop) with an arrival time (carrtime) All the connec-tions are combined in a stream of connections sortedby increasing departure time Thanks to this feature itis sufficient to only consider the connections where thecdeptime is equals to or later than the desirable depar-ture time The CSA algorithm works as follows whena new scanned connection leads to a faster route to thearrival stop the minimum spanning tree (MST) will beupdated This is the case when carrtime is earlier thanthe actual EAT at carrstop Finally the algorithm endswhen the destination stop is added to the MST and theresulting journey can be obtained by following the pathin the MST backwards

In summary after some proofs of concepts devel-oped taking into account live and historical data in theLC framework and the current advances in the state ofthe work exposing the data on the Web efficiently we

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

think that it is the moment to improve the features ofthe current LC framework to create a standard mecha-nism for publishing reliable public transport data Themain motivations to do that are on one hand the ne-cessity to improve the current access of the transportinformation services to the data where at the momentthat real-time is involved the solutions are basicallyad-hoc and they are not feasible if we are based onthe proposal of the EU Commission for publish pub-lic transport data On the other hand supported bythe characteristics of the transport data where a com-plex data management environment emerges the newapproach that we present in this paper can serve asa source of inspiration for historical and live LinkedData management on the Web in other domains

3 The Linked Connections Framework

In this section we describe the Linked Connectionsframework for providing reliable access to live andhistorical data First we describe a summary of theLinked Connections specification including new prop-erties about features of live data Second we describethe extensions we develop to deal with these types ofdata the linked connections library for transforminglive feeds into connections and the LC server for man-aging and exposing that connections on the Web effi-ciently

31 The Linked Connections specification

The LC specification19 explains how to implement adata publishing sever and explains what you can relyon when writing a route planning client Following theLinked Data principles and the REST constraints wemake sure that from any HTTP response hypermediacontrols can be followed to discover the rest of thedataset A Linked Connections graph is a paged col-lection of connections describing the time transit ve-hicles leave and arrive The Linked Connections vo-cabulary20 defines the basic properties used within thelcConnection class

ndash lcdepartureTime It is a date-time includ-ing delay at which the vehicle will leave for thelcarrivalStop

ndash lcdepartureStop The departure stop URI

19httpslinkedconnectionsorgspecification20httpsemwebmmlabbenslinkedconnections

ndash lcdepartureDelay Provides time in sec-onds when the lcdepartureTime is not theplanned departureTime

ndash lcarrivalTime It is a date-time includ-ing delay at which the vehicle will arrives atlcarrivalStop

ndash lcarrivalStop The arrival stop URIndash lcarrivalDelay Provides time in second

when the lcarrivalTime is not the plannedarrivalTime

We also reuse the terms from the Linked GTFS vo-cabulary21 in order to describe other properties of alcConnection class This vocabulary it is alignedwith the de-facto standard to exchange public transportdata on the Web GTFS

ndash gtfstrip Must be set to link a gtfsTripwith a lcConnection identifying whetheranother connection is part of the current trip of avehicle

ndash gtfspickupType Should be set to indi-cate whether people can be picked up at thisstop The possible values gtfsRegulargtfsNotAvailable gtfsMustPhoneand gtfsMustCoordinateWithDriver

ndash gtfsdropOffType Should be set to indicatewhether people can be dropped off at this stopThe possible values are the same as the defined inthe gtfspickupType property

It is important to remark that each Linked Connec-tions page must contain some metadata about itselfThe URL after redirection or the one indicated by theLocation HTTP header therefore must occur in thetriples of the response The current page must containthe hypermedia controls to discover under what condi-tions the data can be legally reused and must containthe hypermedia controls to understand how to navigatethrough the paged collection(s) Different ways existto implement the paging strategy At least one of thestrategies must be implemented for a Linked Connec-tions client to find the next pages to be processed

1 Each response must contain a hydranext andhydraprevious page link

2 Each response should contain ahydrasearch describing that you cansearch for a specific time and that the clientwill be redirected to a page containing in-

21httpvocabgtfsorgterms

6 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 4 Metadata LC response

formation about that timestamp To describethis functionality the hydrapropertylcdepartureTimeQuery is used Anexample is shown in the Figure 4

Finally for each document that is published by aLinked Connections server a Cross Origin ResourceSharing22 HTTP header must set the property Access-Control-Allow-Origin for sharing the response withany origin The server also should implement thor-ough caching strategies as the cachability is one of thebiggest advantages of the Linked Connections frame-work Both conditional requests23 as regular caching24

are recommended As every connection will need tohave a unique an persistent identifier an HTTP URIthe server must support at least one RDF11 formatsuch as JSON-LD [25] TriG [26] or N-Quads [27]The connection URI should follow the Linked Dataprinciples [1]

32 Live data in Linked Connections

As we aforementioned the transport domain shouldinvolve in their route planning algorithms the live datato provide consistent information to the passengers andimprove the current informational services The de-facto standard for exchange transport data on the WebGTFS has created a version for live updates GTFS-RT Providing globally unique identifiers to the differ-ent entities that comprise a public transport network isfundamental to lower the adoption cost of public trans-port data in route-planning applications Specificallyin the case of live updates about the schedules is im-portant to maintain stable identifiers that remain validover time Here we use the Linked Data principles totransform schedule updates given in the GTFS-RT for-

22httpenable-corsorg23httpsdevelopermozillaorgen-USdocsWebHTTP

Conditional_requests24httpsdevelopermozillaorgen-USdocsWebHTTPCaching

mat to Linked Connections and we give the option toserialize them in JSON CSV or RDF (turtle N-Triplesor JSON-LD) format

The URI strategy to be used during the conversionprocess is given following the RFC 657025 specifica-tion for URI templates The parameters used to buildthe URIs are given following an object-like notation(objectvariable) where the left side references a CSVfile present in the provided GTFS data source and theright side references a specific column of such fileWe use the data from a reference GTFS data sourceto create the URIs as with only the data present in aGTFS-RT update may not be feasible to create persis-tent URIs The GFTS files that can be used to createthe URIs are routes and trips files As for the variablesany column that exists in those files can be referencedA simple example is shown in Figure 5 and an standardtemplate is also available26 Next we describe how arethe entities URIs build based on these templates

ndash stop A Linked Connection references two differ-ent stops (departure and arrival stop) The dataused to build these specific URIs comes directlyfrom the GTFS-RT update so we do not specifyany CSV file and header from the reference GTFSdata source The variable name chosen in our caseis the stop_id but it can be freely named

ndash route For the route identifier we relyon the routesroute_short_name and thetripstrip_short_name variables

ndash trip In the case of the trip we add theexpected departure time on top of theroute URI based on the associated connec-tiondepartureTime(YYYYMMDD) and theinformation about the delay

ndash connection For a connection identifier weresort to its departure stop with connec-tiondepartureStop its departure time withconnectiondepartureTime(YYYYMMDD)and the routesroute_short_name and thetripstrip_short_name In this case we referencea special entity we called connection whichcontains the related basic data that can beextracted from a GTFS-RT update for everyLinked Connection A connection entity containsthese parameters that can be used on the URIsdefinition connectiondepartureStop connec-

25httpstoolsietforghtmlrfc657026httpsgithubcomlinkedconnectionsgtfsrt2lcblobmaster

uris_template_examplejson

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 5 Connection example with live updates

Fig 6 The GTFSRT2LC tool

tionarrivalStop connectiondepartureTime andconnectionarrivalTime As both departureTimeand arrivalTime are date objects the expectedformat can be defined using brackets

Finally we developed a tool27 that analyses the in-formation from an static GTFS dataset and the updatesfrom a GTFS-RT feed exploits the implicit relationsamong the CSV files and creates the live Linked Con-nections feed in the desirable format with the help ofthe URIs template as it is shown in Figure 6

33 Managing live and historical data with theLinked Connections server

The third main contribution of this paper after de-velop a way to create Linked Connections feeds tak-ing into account live data is how to provide reliableaccess to historical and live transport data in an cost-efficiently way For that reason we develop a LinkedConnections server which exposes data fragments us-ing JSON-LD serialization format First the serveruses the GTFS2LC28 library to convert a GTFS datasetinto a time sorted directed acyclic graph of connec-tions After that the fragmenting process starts using

27httpsgithubcomlinkedconnectionsgtfsrt2lc28httpsgithubcomlinkedconnectionsgtfs2lc

the information provided in the configuration of theserver about the fragment size Once the fragmentationprocess is completed it starts using the GTFSRT2LClibrary for mapping real time updates into a LinkedConnections feed as we describe in the above section

As we aforementioned the previous process forfragmenting LC was based on a predefined time spanwhich used this variable for the pagination of the dataand created an heterogeneous environment in terms ofthe fragment size In other words setting a fixed timewindow for the creation of the fragments (eg 10 min-utes) could lead to the creation of fragments contain-ing a high number of connections specially in the rushhours and thus being big fragments in terms of sizeAt the same time smaller fragments could also be cre-ated at times where there are few or no vehicles de-parting This is the main reason why we start to frag-menting the historical data of the LC feed homoge-neously providing a specific fragment size this has arelevant effect in terms of performance as we demon-strate in the Section 5 As for the LC live feed wekeep an unique timeline of connection updates that isstored and fragmented using a predefined time span incontrast to static timetable updates which may containa complete re-write or partially overlap with previousversions from a time perspective

The server is able to manage both the historical andthe live connections First the server create the LinkedConnections from the information provided by a GTFSdataset and fragment the historical data into severalfragments After that if the transport system has anopen live service the sever starts to process the updatesfrom the GTFS-RT feed and create another streamof connections based on that information Then theserver searches the correspondence between the his-torical and live connections based on the URIs of thetrips and keeps an registry over time of the historicalevolution of the connections regarding delays andorcancellations Finally the server exposes the stream ofhistorical and live connections on an API following thefragmentation approach This process shown in Fig-ure 7 is repeated every time that the server process anupdate

The server also allows querying historic data bymeans of the Memento Framework [28] which enablestime-based content negotiation over HTTP By usingthe Accept-Datetime header a client can request thestate of a resource at a given moment If existing theserver will respond with a 302 Found containing theURI of the stored version of such resource So if a re-quest is made to obtain a Linked Connections fragment

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 2: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

2 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 1 Example of a connection at LC in JSON-LD

transport data from providers available on national orcommon access points saved on databases data ware-house or repositories All the states will provide accessto a unique common point following different staticstandards as Transmodel2 Datex II3 or GTFS4 andreal-time standards like GTFS-RT5 or SIRI6 So thedomain requires solutions able to provide reliable dataand to deal with the heterogeneity of access points andthe data formats

One of the main challenges when the Linked Dataapproaches deal with transport domain where live andhistorical data has to be taken into account for provid-ing consistent data to the information services is howto ensure that the identifiers of each resource is stableand valid over time For example if we define the con-nection entity as a departure-arrival pair as has beendone in previous works on Linked Connections [5] theURI of each connection was defined getting informa-tion from static GTFS datasets At the moment that livedata is involved those URIs are not consistent becausethe live information creates new instances of a connec-tion at different points in time An example of the con-ceptualization of a connection is shown in the Figure 1Other relevant challenge is how to manage the exposedhistorical data on the Web to allow an optimal queryperformance One of the requirements of most relevantroute planning algorithms is that the data have to besorted by the time In previous works of Linked Con-nections [6] we developed a server that paginates thelist of connections in departure time intervals (10 min-utes) and publishes these pages over HTTP We havenoticed that the clients were able to analyze the histor-ical data but the performance was too low

Our work is focused on providing an improvementof the current state of Linked Connections frameworkby allowing an efficient management of historical andlive data with a standard vocabulary for the transportdomain This is relevant because of two main reasons

2httpwwwtransmodel-ceneu3httpwwwdatex2eu4httpsdevelopersgooglecomtransitgtfs5httpsdevelopersgooglecomtransitgtfs-realtime6httpwwwtransmodel-ceneustandardssiri

(i) currently a lot of transport companies are starting toprovide access to their live services but the most com-mon way to do it is to develop an ad-hoc solution likeAPIs or web services that only work locally [7] and(ii) the heterogeneity of the domain in terms of dataformats will be a very relevant problem next years es-pecially in Europe based on the proposal of the EUCommission so a standard framework will be needed

In this paper we present the Linked Connectionsframework to provide reliable access to live and his-torical data Our main contribution is the extension ofprevious version of the LC server by providing an ef-ficient management of live and historical data Firstwe develop a library that gets information from staticGTFS datasets and GTFS-RT data streams and inte-grate them following the LC vocabulary Second wemodify the approach of Linked Connections for split-ting the fragments of the data using a specific size ofeach fragment instead of the time Third we programa route planning algorithm in the top of the LC serverto test if our approach is able to provide reliable dataFinally we evaluate the improvements comparing thenew version of the LC framework with our previousapproaches as base line and we analyse the impact ofthe fragment size in the performance of a route plan

The paper is organized as follows Section 2presents some of the related work about relevant ap-proaches for exposing data on the web efficiently pre-vious steps about the Linked Connections frameworka description of the de-facto standard GTFS and thespecification of relevant route planning algorithmsSection 3 describes our proposal about the improve-ments of the Linked Connections framework and aworking example description Section 4 presents thedesign of our experiments Section 5 describes the re-sults we obtained evaluating our main contributionsSection 6 provides a brief discussion about the rele-vance of our contributions and Section 7 presents con-clusions and areas for future work

2 Related Work

On the current state of the Web huge amount of dataare exposed following the principles of Linked Data Inthis section we describe the main contributions on thistopic focused on an efficient publication of data on thetransport domain a description of previous approachesdeveloped using the Linked Connections frameworkand the analysis of the model for transport data that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 3

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

supports our work Finally we analyse the most rele-vant algorithms for route planning

One of the most well-known alternatives to publishdata on the Web is Linked Data [1] Linked Data al-lows to identify in an unique way resources on theWeb using identifiers or HTTP URIs It is a methodto distribute and scale data over large organizationssuch as the Web When looking up this identifier byusing the HTTP protocol or using a Web browser adefinition must be returned including links towardsother related resources a practice called dereferencingThe triple format to be used in combination with URIsis standardized within RDF The URIs used for thesetriples already existed in other data sources and wethus favoured using the same identifiers It is up to adata publisher to make a choice on which data sourcescan provide the identifiers for a certain type of entities

A common problem in Linked Data is the availabil-ity of the triple stores They provide a way to gettingdata using the SPARQL query language but at the mo-ment of queries involving long periods of time theseapproaches are not efficient [8] The Linked Data Frag-ments(LDF) [9 10] solve this issue fragmenting thedata in several HTTP documents Following this ap-proach the store moves the load from server side toclient side improving its availability Comunica [11]is a framework that extends the possibilities of LDFallowing to query other semantic interfaces as com-mon SPARQL endpoint RDF data dumps or HDTdatasets [12]

Linked Connections [5] applies this approach to de-velop a cost-efficient solution based on a HTTP inter-face for transport data The main assumption of LC isthat the relevant data for route planners can be basedon the connection concept Basically as the LC vo-cabulary7 defines it a connection describes a depar-ture at a certain stop and an arrival at a different stopwith their corresponding departure and arrival timesand without any intermediate stop A basic implemen-tation of LC is shown in Figure 2 where the routeplanning algorithms have to analyse the connections(small rectangles) through the pages (big rectangles)and across time to find the expected route The joinbetween the connections is possible because same re-sources have same identifiers based on one of the prin-ciples of Linked Data The Hydra Ontology [13] isused to specify the next and previous page links as wellas how the resource itself should be discovered

7httpsemwebmmlabbenslinkedconnections

Fig 2 Linked Connections implementation

It also relevant to describe the previous works thathas been carried out using the specification of LinkedConnections For example [14] describes and anal-yses the behaviour of a basic transport API and theLinked Connections framework for public transit routeplanning comparing the CPU and query executiontime The authors found that at the expense of ahigher bandwidth consumption more queries can beanswered using LC than the origin-destination APIIn [15] studies the impact of taking into account userpreferences in a public transit route planning addingthat features both on server and client and comparingthe two solution on query execution time cache per-formance and CPU usage on both sides A first stepfor providing reliable access to historical and live datausing Linked Connections is described in [6] where amechanism is introduced to tackle the problem of themanagement of data modifications when live data is in-volved in route planning queries Tripscore8 a LinkedData client that consume several Linked Connectionsservers with live and historical data is also described in[16] where the Connection Scan Algorithm (CSA) isimplemented as the route planning algorithm in top ofthe client [17] Tripscore add multiple user preferencesat the client side to provide an score for each possibleroute

All the solutions that we aforementioned rely onthe de-facto standard for representing public transportdata the General Transit Feed Specification or GTFSThis model and its extension for real-time (GTFS-RT)is used by Google Maps9 since 2005 but also by otherroute planners like Open Trip Planner 10 or Navitaio11It is also the most common model used by the trans-port companies to expose their data on open data por-tals like for example the Consorcio General de Trans-portes de Madrid12 the TRAM in Barcelona13 or the

8wwwtripscoreeu9httpmapsgooglees10httpwwwopentripplannerorg11httpswwwnavitiaio12httpdatoscrtmes13httpsopendatatramcat

4 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Belgium National Train System (NMBS) GTFS de-fines the headers of 13 types of CSV files and a set ofrules the must be take into account when the datasetis created Each file as well as their headers can bemandatory or optional and have relations among themas show in Figure 3 Linked Connections is getting thenecessary information from a subset of the full dataset

ndash stops Individual locations where vehicles pick upor drop off passengers

ndash calendar Dates for service IDs using a weeklyschedule Specify when service starts and ends aswell as days of the week where service is avail-able

ndash calendar_dates Exceptions for the service IDsdefined in the calendar file

ndash stop_times Times that a vehicle arrives at and de-parts from individual stops for each trip

ndash trips Trips for each route A trip is a sequence oftwo or more stops that occurs at specific time

ndash routes Transit routes A route is a group of tripsthat are displayed to riders as a single service

ndash transfers Rules for making connections at trans-fer points between routes

In order to link the terms and identifiers defined inthese files with the Linked Open Data cloud we usedthe Linked GTFS14 vocabulary We create mappingsable to transform GTFS files to Linked GTFS follow-ing the CSV2RDF [18] W3C recommendation15 butalso using other standard OBDA mapping languagesthat are able to deal with CSV files16 like RML [19]or R2RML [20]

The extension of GTFS for real-time GTFS-RT17 isa feed specification that allows public transport agen-cies to provide real time updates about their fleetThe specification supports three types of information(i) trips updates like delays cancellations or changeroutes (ii) service alerts like stop moved unforeseenevents affecting stations routes etc and (iii) informa-tion of vehicle positions including location and con-gestion level The data exchange format is based onProtocol Buffers18

It is also important to mention the works about routeplanning algorithms that can be developed on the topof the Linked Connections framework The problem

14httpvocabgtfsorgterms15httpsgithubcomOpenTransportgtfs-csv2rdf16httpsgithubcomdachafragtfsmappings17httpsdevelopersgooglecomtransitgtfs-realtime18httpsdevelopersgooglecomprotocol-buffers

Fig 3 The GTFS model and its primary relations

that these algorithms has to solve using the data isthe Earliest Arrival Time (EAT) An EAT query con-sists of a departure stop a departure time and a desti-nation stop The main goal is to find the fastest jour-ney to the destination stop starting from the depar-ture stop at the departure time There are more com-plex route planning questions like the Minimum Ex-pected Arrival Time (MEAT) [17] or multi-criteriaprofile queries [21ndash23] The Connection Scan Algo-rithm (CSA) [17] is an approach that models thetimetable data as a directed acyclic graph [24] usinga stream of connections As we defined before a con-nection is a combination of a departure stop (cdepstop)with a departure time (cdeptime) and an arrival stop(carrstop) with an arrival time (carrtime) All the connec-tions are combined in a stream of connections sortedby increasing departure time Thanks to this feature itis sufficient to only consider the connections where thecdeptime is equals to or later than the desirable depar-ture time The CSA algorithm works as follows whena new scanned connection leads to a faster route to thearrival stop the minimum spanning tree (MST) will beupdated This is the case when carrtime is earlier thanthe actual EAT at carrstop Finally the algorithm endswhen the destination stop is added to the MST and theresulting journey can be obtained by following the pathin the MST backwards

In summary after some proofs of concepts devel-oped taking into account live and historical data in theLC framework and the current advances in the state ofthe work exposing the data on the Web efficiently we

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

think that it is the moment to improve the features ofthe current LC framework to create a standard mecha-nism for publishing reliable public transport data Themain motivations to do that are on one hand the ne-cessity to improve the current access of the transportinformation services to the data where at the momentthat real-time is involved the solutions are basicallyad-hoc and they are not feasible if we are based onthe proposal of the EU Commission for publish pub-lic transport data On the other hand supported bythe characteristics of the transport data where a com-plex data management environment emerges the newapproach that we present in this paper can serve asa source of inspiration for historical and live LinkedData management on the Web in other domains

3 The Linked Connections Framework

In this section we describe the Linked Connectionsframework for providing reliable access to live andhistorical data First we describe a summary of theLinked Connections specification including new prop-erties about features of live data Second we describethe extensions we develop to deal with these types ofdata the linked connections library for transforminglive feeds into connections and the LC server for man-aging and exposing that connections on the Web effi-ciently

31 The Linked Connections specification

The LC specification19 explains how to implement adata publishing sever and explains what you can relyon when writing a route planning client Following theLinked Data principles and the REST constraints wemake sure that from any HTTP response hypermediacontrols can be followed to discover the rest of thedataset A Linked Connections graph is a paged col-lection of connections describing the time transit ve-hicles leave and arrive The Linked Connections vo-cabulary20 defines the basic properties used within thelcConnection class

ndash lcdepartureTime It is a date-time includ-ing delay at which the vehicle will leave for thelcarrivalStop

ndash lcdepartureStop The departure stop URI

19httpslinkedconnectionsorgspecification20httpsemwebmmlabbenslinkedconnections

ndash lcdepartureDelay Provides time in sec-onds when the lcdepartureTime is not theplanned departureTime

ndash lcarrivalTime It is a date-time includ-ing delay at which the vehicle will arrives atlcarrivalStop

ndash lcarrivalStop The arrival stop URIndash lcarrivalDelay Provides time in second

when the lcarrivalTime is not the plannedarrivalTime

We also reuse the terms from the Linked GTFS vo-cabulary21 in order to describe other properties of alcConnection class This vocabulary it is alignedwith the de-facto standard to exchange public transportdata on the Web GTFS

ndash gtfstrip Must be set to link a gtfsTripwith a lcConnection identifying whetheranother connection is part of the current trip of avehicle

ndash gtfspickupType Should be set to indi-cate whether people can be picked up at thisstop The possible values gtfsRegulargtfsNotAvailable gtfsMustPhoneand gtfsMustCoordinateWithDriver

ndash gtfsdropOffType Should be set to indicatewhether people can be dropped off at this stopThe possible values are the same as the defined inthe gtfspickupType property

It is important to remark that each Linked Connec-tions page must contain some metadata about itselfThe URL after redirection or the one indicated by theLocation HTTP header therefore must occur in thetriples of the response The current page must containthe hypermedia controls to discover under what condi-tions the data can be legally reused and must containthe hypermedia controls to understand how to navigatethrough the paged collection(s) Different ways existto implement the paging strategy At least one of thestrategies must be implemented for a Linked Connec-tions client to find the next pages to be processed

1 Each response must contain a hydranext andhydraprevious page link

2 Each response should contain ahydrasearch describing that you cansearch for a specific time and that the clientwill be redirected to a page containing in-

21httpvocabgtfsorgterms

6 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 4 Metadata LC response

formation about that timestamp To describethis functionality the hydrapropertylcdepartureTimeQuery is used Anexample is shown in the Figure 4

Finally for each document that is published by aLinked Connections server a Cross Origin ResourceSharing22 HTTP header must set the property Access-Control-Allow-Origin for sharing the response withany origin The server also should implement thor-ough caching strategies as the cachability is one of thebiggest advantages of the Linked Connections frame-work Both conditional requests23 as regular caching24

are recommended As every connection will need tohave a unique an persistent identifier an HTTP URIthe server must support at least one RDF11 formatsuch as JSON-LD [25] TriG [26] or N-Quads [27]The connection URI should follow the Linked Dataprinciples [1]

32 Live data in Linked Connections

As we aforementioned the transport domain shouldinvolve in their route planning algorithms the live datato provide consistent information to the passengers andimprove the current informational services The de-facto standard for exchange transport data on the WebGTFS has created a version for live updates GTFS-RT Providing globally unique identifiers to the differ-ent entities that comprise a public transport network isfundamental to lower the adoption cost of public trans-port data in route-planning applications Specificallyin the case of live updates about the schedules is im-portant to maintain stable identifiers that remain validover time Here we use the Linked Data principles totransform schedule updates given in the GTFS-RT for-

22httpenable-corsorg23httpsdevelopermozillaorgen-USdocsWebHTTP

Conditional_requests24httpsdevelopermozillaorgen-USdocsWebHTTPCaching

mat to Linked Connections and we give the option toserialize them in JSON CSV or RDF (turtle N-Triplesor JSON-LD) format

The URI strategy to be used during the conversionprocess is given following the RFC 657025 specifica-tion for URI templates The parameters used to buildthe URIs are given following an object-like notation(objectvariable) where the left side references a CSVfile present in the provided GTFS data source and theright side references a specific column of such fileWe use the data from a reference GTFS data sourceto create the URIs as with only the data present in aGTFS-RT update may not be feasible to create persis-tent URIs The GFTS files that can be used to createthe URIs are routes and trips files As for the variablesany column that exists in those files can be referencedA simple example is shown in Figure 5 and an standardtemplate is also available26 Next we describe how arethe entities URIs build based on these templates

ndash stop A Linked Connection references two differ-ent stops (departure and arrival stop) The dataused to build these specific URIs comes directlyfrom the GTFS-RT update so we do not specifyany CSV file and header from the reference GTFSdata source The variable name chosen in our caseis the stop_id but it can be freely named

ndash route For the route identifier we relyon the routesroute_short_name and thetripstrip_short_name variables

ndash trip In the case of the trip we add theexpected departure time on top of theroute URI based on the associated connec-tiondepartureTime(YYYYMMDD) and theinformation about the delay

ndash connection For a connection identifier weresort to its departure stop with connec-tiondepartureStop its departure time withconnectiondepartureTime(YYYYMMDD)and the routesroute_short_name and thetripstrip_short_name In this case we referencea special entity we called connection whichcontains the related basic data that can beextracted from a GTFS-RT update for everyLinked Connection A connection entity containsthese parameters that can be used on the URIsdefinition connectiondepartureStop connec-

25httpstoolsietforghtmlrfc657026httpsgithubcomlinkedconnectionsgtfsrt2lcblobmaster

uris_template_examplejson

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 5 Connection example with live updates

Fig 6 The GTFSRT2LC tool

tionarrivalStop connectiondepartureTime andconnectionarrivalTime As both departureTimeand arrivalTime are date objects the expectedformat can be defined using brackets

Finally we developed a tool27 that analyses the in-formation from an static GTFS dataset and the updatesfrom a GTFS-RT feed exploits the implicit relationsamong the CSV files and creates the live Linked Con-nections feed in the desirable format with the help ofthe URIs template as it is shown in Figure 6

33 Managing live and historical data with theLinked Connections server

The third main contribution of this paper after de-velop a way to create Linked Connections feeds tak-ing into account live data is how to provide reliableaccess to historical and live transport data in an cost-efficiently way For that reason we develop a LinkedConnections server which exposes data fragments us-ing JSON-LD serialization format First the serveruses the GTFS2LC28 library to convert a GTFS datasetinto a time sorted directed acyclic graph of connec-tions After that the fragmenting process starts using

27httpsgithubcomlinkedconnectionsgtfsrt2lc28httpsgithubcomlinkedconnectionsgtfs2lc

the information provided in the configuration of theserver about the fragment size Once the fragmentationprocess is completed it starts using the GTFSRT2LClibrary for mapping real time updates into a LinkedConnections feed as we describe in the above section

As we aforementioned the previous process forfragmenting LC was based on a predefined time spanwhich used this variable for the pagination of the dataand created an heterogeneous environment in terms ofthe fragment size In other words setting a fixed timewindow for the creation of the fragments (eg 10 min-utes) could lead to the creation of fragments contain-ing a high number of connections specially in the rushhours and thus being big fragments in terms of sizeAt the same time smaller fragments could also be cre-ated at times where there are few or no vehicles de-parting This is the main reason why we start to frag-menting the historical data of the LC feed homoge-neously providing a specific fragment size this has arelevant effect in terms of performance as we demon-strate in the Section 5 As for the LC live feed wekeep an unique timeline of connection updates that isstored and fragmented using a predefined time span incontrast to static timetable updates which may containa complete re-write or partially overlap with previousversions from a time perspective

The server is able to manage both the historical andthe live connections First the server create the LinkedConnections from the information provided by a GTFSdataset and fragment the historical data into severalfragments After that if the transport system has anopen live service the sever starts to process the updatesfrom the GTFS-RT feed and create another streamof connections based on that information Then theserver searches the correspondence between the his-torical and live connections based on the URIs of thetrips and keeps an registry over time of the historicalevolution of the connections regarding delays andorcancellations Finally the server exposes the stream ofhistorical and live connections on an API following thefragmentation approach This process shown in Fig-ure 7 is repeated every time that the server process anupdate

The server also allows querying historic data bymeans of the Memento Framework [28] which enablestime-based content negotiation over HTTP By usingthe Accept-Datetime header a client can request thestate of a resource at a given moment If existing theserver will respond with a 302 Found containing theURI of the stored version of such resource So if a re-quest is made to obtain a Linked Connections fragment

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 3: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 3

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

supports our work Finally we analyse the most rele-vant algorithms for route planning

One of the most well-known alternatives to publishdata on the Web is Linked Data [1] Linked Data al-lows to identify in an unique way resources on theWeb using identifiers or HTTP URIs It is a methodto distribute and scale data over large organizationssuch as the Web When looking up this identifier byusing the HTTP protocol or using a Web browser adefinition must be returned including links towardsother related resources a practice called dereferencingThe triple format to be used in combination with URIsis standardized within RDF The URIs used for thesetriples already existed in other data sources and wethus favoured using the same identifiers It is up to adata publisher to make a choice on which data sourcescan provide the identifiers for a certain type of entities

A common problem in Linked Data is the availabil-ity of the triple stores They provide a way to gettingdata using the SPARQL query language but at the mo-ment of queries involving long periods of time theseapproaches are not efficient [8] The Linked Data Frag-ments(LDF) [9 10] solve this issue fragmenting thedata in several HTTP documents Following this ap-proach the store moves the load from server side toclient side improving its availability Comunica [11]is a framework that extends the possibilities of LDFallowing to query other semantic interfaces as com-mon SPARQL endpoint RDF data dumps or HDTdatasets [12]

Linked Connections [5] applies this approach to de-velop a cost-efficient solution based on a HTTP inter-face for transport data The main assumption of LC isthat the relevant data for route planners can be basedon the connection concept Basically as the LC vo-cabulary7 defines it a connection describes a depar-ture at a certain stop and an arrival at a different stopwith their corresponding departure and arrival timesand without any intermediate stop A basic implemen-tation of LC is shown in Figure 2 where the routeplanning algorithms have to analyse the connections(small rectangles) through the pages (big rectangles)and across time to find the expected route The joinbetween the connections is possible because same re-sources have same identifiers based on one of the prin-ciples of Linked Data The Hydra Ontology [13] isused to specify the next and previous page links as wellas how the resource itself should be discovered

7httpsemwebmmlabbenslinkedconnections

Fig 2 Linked Connections implementation

It also relevant to describe the previous works thathas been carried out using the specification of LinkedConnections For example [14] describes and anal-yses the behaviour of a basic transport API and theLinked Connections framework for public transit routeplanning comparing the CPU and query executiontime The authors found that at the expense of ahigher bandwidth consumption more queries can beanswered using LC than the origin-destination APIIn [15] studies the impact of taking into account userpreferences in a public transit route planning addingthat features both on server and client and comparingthe two solution on query execution time cache per-formance and CPU usage on both sides A first stepfor providing reliable access to historical and live datausing Linked Connections is described in [6] where amechanism is introduced to tackle the problem of themanagement of data modifications when live data is in-volved in route planning queries Tripscore8 a LinkedData client that consume several Linked Connectionsservers with live and historical data is also described in[16] where the Connection Scan Algorithm (CSA) isimplemented as the route planning algorithm in top ofthe client [17] Tripscore add multiple user preferencesat the client side to provide an score for each possibleroute

All the solutions that we aforementioned rely onthe de-facto standard for representing public transportdata the General Transit Feed Specification or GTFSThis model and its extension for real-time (GTFS-RT)is used by Google Maps9 since 2005 but also by otherroute planners like Open Trip Planner 10 or Navitaio11It is also the most common model used by the trans-port companies to expose their data on open data por-tals like for example the Consorcio General de Trans-portes de Madrid12 the TRAM in Barcelona13 or the

8wwwtripscoreeu9httpmapsgooglees10httpwwwopentripplannerorg11httpswwwnavitiaio12httpdatoscrtmes13httpsopendatatramcat

4 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Belgium National Train System (NMBS) GTFS de-fines the headers of 13 types of CSV files and a set ofrules the must be take into account when the datasetis created Each file as well as their headers can bemandatory or optional and have relations among themas show in Figure 3 Linked Connections is getting thenecessary information from a subset of the full dataset

ndash stops Individual locations where vehicles pick upor drop off passengers

ndash calendar Dates for service IDs using a weeklyschedule Specify when service starts and ends aswell as days of the week where service is avail-able

ndash calendar_dates Exceptions for the service IDsdefined in the calendar file

ndash stop_times Times that a vehicle arrives at and de-parts from individual stops for each trip

ndash trips Trips for each route A trip is a sequence oftwo or more stops that occurs at specific time

ndash routes Transit routes A route is a group of tripsthat are displayed to riders as a single service

ndash transfers Rules for making connections at trans-fer points between routes

In order to link the terms and identifiers defined inthese files with the Linked Open Data cloud we usedthe Linked GTFS14 vocabulary We create mappingsable to transform GTFS files to Linked GTFS follow-ing the CSV2RDF [18] W3C recommendation15 butalso using other standard OBDA mapping languagesthat are able to deal with CSV files16 like RML [19]or R2RML [20]

The extension of GTFS for real-time GTFS-RT17 isa feed specification that allows public transport agen-cies to provide real time updates about their fleetThe specification supports three types of information(i) trips updates like delays cancellations or changeroutes (ii) service alerts like stop moved unforeseenevents affecting stations routes etc and (iii) informa-tion of vehicle positions including location and con-gestion level The data exchange format is based onProtocol Buffers18

It is also important to mention the works about routeplanning algorithms that can be developed on the topof the Linked Connections framework The problem

14httpvocabgtfsorgterms15httpsgithubcomOpenTransportgtfs-csv2rdf16httpsgithubcomdachafragtfsmappings17httpsdevelopersgooglecomtransitgtfs-realtime18httpsdevelopersgooglecomprotocol-buffers

Fig 3 The GTFS model and its primary relations

that these algorithms has to solve using the data isthe Earliest Arrival Time (EAT) An EAT query con-sists of a departure stop a departure time and a desti-nation stop The main goal is to find the fastest jour-ney to the destination stop starting from the depar-ture stop at the departure time There are more com-plex route planning questions like the Minimum Ex-pected Arrival Time (MEAT) [17] or multi-criteriaprofile queries [21ndash23] The Connection Scan Algo-rithm (CSA) [17] is an approach that models thetimetable data as a directed acyclic graph [24] usinga stream of connections As we defined before a con-nection is a combination of a departure stop (cdepstop)with a departure time (cdeptime) and an arrival stop(carrstop) with an arrival time (carrtime) All the connec-tions are combined in a stream of connections sortedby increasing departure time Thanks to this feature itis sufficient to only consider the connections where thecdeptime is equals to or later than the desirable depar-ture time The CSA algorithm works as follows whena new scanned connection leads to a faster route to thearrival stop the minimum spanning tree (MST) will beupdated This is the case when carrtime is earlier thanthe actual EAT at carrstop Finally the algorithm endswhen the destination stop is added to the MST and theresulting journey can be obtained by following the pathin the MST backwards

In summary after some proofs of concepts devel-oped taking into account live and historical data in theLC framework and the current advances in the state ofthe work exposing the data on the Web efficiently we

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

think that it is the moment to improve the features ofthe current LC framework to create a standard mecha-nism for publishing reliable public transport data Themain motivations to do that are on one hand the ne-cessity to improve the current access of the transportinformation services to the data where at the momentthat real-time is involved the solutions are basicallyad-hoc and they are not feasible if we are based onthe proposal of the EU Commission for publish pub-lic transport data On the other hand supported bythe characteristics of the transport data where a com-plex data management environment emerges the newapproach that we present in this paper can serve asa source of inspiration for historical and live LinkedData management on the Web in other domains

3 The Linked Connections Framework

In this section we describe the Linked Connectionsframework for providing reliable access to live andhistorical data First we describe a summary of theLinked Connections specification including new prop-erties about features of live data Second we describethe extensions we develop to deal with these types ofdata the linked connections library for transforminglive feeds into connections and the LC server for man-aging and exposing that connections on the Web effi-ciently

31 The Linked Connections specification

The LC specification19 explains how to implement adata publishing sever and explains what you can relyon when writing a route planning client Following theLinked Data principles and the REST constraints wemake sure that from any HTTP response hypermediacontrols can be followed to discover the rest of thedataset A Linked Connections graph is a paged col-lection of connections describing the time transit ve-hicles leave and arrive The Linked Connections vo-cabulary20 defines the basic properties used within thelcConnection class

ndash lcdepartureTime It is a date-time includ-ing delay at which the vehicle will leave for thelcarrivalStop

ndash lcdepartureStop The departure stop URI

19httpslinkedconnectionsorgspecification20httpsemwebmmlabbenslinkedconnections

ndash lcdepartureDelay Provides time in sec-onds when the lcdepartureTime is not theplanned departureTime

ndash lcarrivalTime It is a date-time includ-ing delay at which the vehicle will arrives atlcarrivalStop

ndash lcarrivalStop The arrival stop URIndash lcarrivalDelay Provides time in second

when the lcarrivalTime is not the plannedarrivalTime

We also reuse the terms from the Linked GTFS vo-cabulary21 in order to describe other properties of alcConnection class This vocabulary it is alignedwith the de-facto standard to exchange public transportdata on the Web GTFS

ndash gtfstrip Must be set to link a gtfsTripwith a lcConnection identifying whetheranother connection is part of the current trip of avehicle

ndash gtfspickupType Should be set to indi-cate whether people can be picked up at thisstop The possible values gtfsRegulargtfsNotAvailable gtfsMustPhoneand gtfsMustCoordinateWithDriver

ndash gtfsdropOffType Should be set to indicatewhether people can be dropped off at this stopThe possible values are the same as the defined inthe gtfspickupType property

It is important to remark that each Linked Connec-tions page must contain some metadata about itselfThe URL after redirection or the one indicated by theLocation HTTP header therefore must occur in thetriples of the response The current page must containthe hypermedia controls to discover under what condi-tions the data can be legally reused and must containthe hypermedia controls to understand how to navigatethrough the paged collection(s) Different ways existto implement the paging strategy At least one of thestrategies must be implemented for a Linked Connec-tions client to find the next pages to be processed

1 Each response must contain a hydranext andhydraprevious page link

2 Each response should contain ahydrasearch describing that you cansearch for a specific time and that the clientwill be redirected to a page containing in-

21httpvocabgtfsorgterms

6 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 4 Metadata LC response

formation about that timestamp To describethis functionality the hydrapropertylcdepartureTimeQuery is used Anexample is shown in the Figure 4

Finally for each document that is published by aLinked Connections server a Cross Origin ResourceSharing22 HTTP header must set the property Access-Control-Allow-Origin for sharing the response withany origin The server also should implement thor-ough caching strategies as the cachability is one of thebiggest advantages of the Linked Connections frame-work Both conditional requests23 as regular caching24

are recommended As every connection will need tohave a unique an persistent identifier an HTTP URIthe server must support at least one RDF11 formatsuch as JSON-LD [25] TriG [26] or N-Quads [27]The connection URI should follow the Linked Dataprinciples [1]

32 Live data in Linked Connections

As we aforementioned the transport domain shouldinvolve in their route planning algorithms the live datato provide consistent information to the passengers andimprove the current informational services The de-facto standard for exchange transport data on the WebGTFS has created a version for live updates GTFS-RT Providing globally unique identifiers to the differ-ent entities that comprise a public transport network isfundamental to lower the adoption cost of public trans-port data in route-planning applications Specificallyin the case of live updates about the schedules is im-portant to maintain stable identifiers that remain validover time Here we use the Linked Data principles totransform schedule updates given in the GTFS-RT for-

22httpenable-corsorg23httpsdevelopermozillaorgen-USdocsWebHTTP

Conditional_requests24httpsdevelopermozillaorgen-USdocsWebHTTPCaching

mat to Linked Connections and we give the option toserialize them in JSON CSV or RDF (turtle N-Triplesor JSON-LD) format

The URI strategy to be used during the conversionprocess is given following the RFC 657025 specifica-tion for URI templates The parameters used to buildthe URIs are given following an object-like notation(objectvariable) where the left side references a CSVfile present in the provided GTFS data source and theright side references a specific column of such fileWe use the data from a reference GTFS data sourceto create the URIs as with only the data present in aGTFS-RT update may not be feasible to create persis-tent URIs The GFTS files that can be used to createthe URIs are routes and trips files As for the variablesany column that exists in those files can be referencedA simple example is shown in Figure 5 and an standardtemplate is also available26 Next we describe how arethe entities URIs build based on these templates

ndash stop A Linked Connection references two differ-ent stops (departure and arrival stop) The dataused to build these specific URIs comes directlyfrom the GTFS-RT update so we do not specifyany CSV file and header from the reference GTFSdata source The variable name chosen in our caseis the stop_id but it can be freely named

ndash route For the route identifier we relyon the routesroute_short_name and thetripstrip_short_name variables

ndash trip In the case of the trip we add theexpected departure time on top of theroute URI based on the associated connec-tiondepartureTime(YYYYMMDD) and theinformation about the delay

ndash connection For a connection identifier weresort to its departure stop with connec-tiondepartureStop its departure time withconnectiondepartureTime(YYYYMMDD)and the routesroute_short_name and thetripstrip_short_name In this case we referencea special entity we called connection whichcontains the related basic data that can beextracted from a GTFS-RT update for everyLinked Connection A connection entity containsthese parameters that can be used on the URIsdefinition connectiondepartureStop connec-

25httpstoolsietforghtmlrfc657026httpsgithubcomlinkedconnectionsgtfsrt2lcblobmaster

uris_template_examplejson

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 5 Connection example with live updates

Fig 6 The GTFSRT2LC tool

tionarrivalStop connectiondepartureTime andconnectionarrivalTime As both departureTimeand arrivalTime are date objects the expectedformat can be defined using brackets

Finally we developed a tool27 that analyses the in-formation from an static GTFS dataset and the updatesfrom a GTFS-RT feed exploits the implicit relationsamong the CSV files and creates the live Linked Con-nections feed in the desirable format with the help ofthe URIs template as it is shown in Figure 6

33 Managing live and historical data with theLinked Connections server

The third main contribution of this paper after de-velop a way to create Linked Connections feeds tak-ing into account live data is how to provide reliableaccess to historical and live transport data in an cost-efficiently way For that reason we develop a LinkedConnections server which exposes data fragments us-ing JSON-LD serialization format First the serveruses the GTFS2LC28 library to convert a GTFS datasetinto a time sorted directed acyclic graph of connec-tions After that the fragmenting process starts using

27httpsgithubcomlinkedconnectionsgtfsrt2lc28httpsgithubcomlinkedconnectionsgtfs2lc

the information provided in the configuration of theserver about the fragment size Once the fragmentationprocess is completed it starts using the GTFSRT2LClibrary for mapping real time updates into a LinkedConnections feed as we describe in the above section

As we aforementioned the previous process forfragmenting LC was based on a predefined time spanwhich used this variable for the pagination of the dataand created an heterogeneous environment in terms ofthe fragment size In other words setting a fixed timewindow for the creation of the fragments (eg 10 min-utes) could lead to the creation of fragments contain-ing a high number of connections specially in the rushhours and thus being big fragments in terms of sizeAt the same time smaller fragments could also be cre-ated at times where there are few or no vehicles de-parting This is the main reason why we start to frag-menting the historical data of the LC feed homoge-neously providing a specific fragment size this has arelevant effect in terms of performance as we demon-strate in the Section 5 As for the LC live feed wekeep an unique timeline of connection updates that isstored and fragmented using a predefined time span incontrast to static timetable updates which may containa complete re-write or partially overlap with previousversions from a time perspective

The server is able to manage both the historical andthe live connections First the server create the LinkedConnections from the information provided by a GTFSdataset and fragment the historical data into severalfragments After that if the transport system has anopen live service the sever starts to process the updatesfrom the GTFS-RT feed and create another streamof connections based on that information Then theserver searches the correspondence between the his-torical and live connections based on the URIs of thetrips and keeps an registry over time of the historicalevolution of the connections regarding delays andorcancellations Finally the server exposes the stream ofhistorical and live connections on an API following thefragmentation approach This process shown in Fig-ure 7 is repeated every time that the server process anupdate

The server also allows querying historic data bymeans of the Memento Framework [28] which enablestime-based content negotiation over HTTP By usingthe Accept-Datetime header a client can request thestate of a resource at a given moment If existing theserver will respond with a 302 Found containing theURI of the stored version of such resource So if a re-quest is made to obtain a Linked Connections fragment

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 4: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

4 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Belgium National Train System (NMBS) GTFS de-fines the headers of 13 types of CSV files and a set ofrules the must be take into account when the datasetis created Each file as well as their headers can bemandatory or optional and have relations among themas show in Figure 3 Linked Connections is getting thenecessary information from a subset of the full dataset

ndash stops Individual locations where vehicles pick upor drop off passengers

ndash calendar Dates for service IDs using a weeklyschedule Specify when service starts and ends aswell as days of the week where service is avail-able

ndash calendar_dates Exceptions for the service IDsdefined in the calendar file

ndash stop_times Times that a vehicle arrives at and de-parts from individual stops for each trip

ndash trips Trips for each route A trip is a sequence oftwo or more stops that occurs at specific time

ndash routes Transit routes A route is a group of tripsthat are displayed to riders as a single service

ndash transfers Rules for making connections at trans-fer points between routes

In order to link the terms and identifiers defined inthese files with the Linked Open Data cloud we usedthe Linked GTFS14 vocabulary We create mappingsable to transform GTFS files to Linked GTFS follow-ing the CSV2RDF [18] W3C recommendation15 butalso using other standard OBDA mapping languagesthat are able to deal with CSV files16 like RML [19]or R2RML [20]

The extension of GTFS for real-time GTFS-RT17 isa feed specification that allows public transport agen-cies to provide real time updates about their fleetThe specification supports three types of information(i) trips updates like delays cancellations or changeroutes (ii) service alerts like stop moved unforeseenevents affecting stations routes etc and (iii) informa-tion of vehicle positions including location and con-gestion level The data exchange format is based onProtocol Buffers18

It is also important to mention the works about routeplanning algorithms that can be developed on the topof the Linked Connections framework The problem

14httpvocabgtfsorgterms15httpsgithubcomOpenTransportgtfs-csv2rdf16httpsgithubcomdachafragtfsmappings17httpsdevelopersgooglecomtransitgtfs-realtime18httpsdevelopersgooglecomprotocol-buffers

Fig 3 The GTFS model and its primary relations

that these algorithms has to solve using the data isthe Earliest Arrival Time (EAT) An EAT query con-sists of a departure stop a departure time and a desti-nation stop The main goal is to find the fastest jour-ney to the destination stop starting from the depar-ture stop at the departure time There are more com-plex route planning questions like the Minimum Ex-pected Arrival Time (MEAT) [17] or multi-criteriaprofile queries [21ndash23] The Connection Scan Algo-rithm (CSA) [17] is an approach that models thetimetable data as a directed acyclic graph [24] usinga stream of connections As we defined before a con-nection is a combination of a departure stop (cdepstop)with a departure time (cdeptime) and an arrival stop(carrstop) with an arrival time (carrtime) All the connec-tions are combined in a stream of connections sortedby increasing departure time Thanks to this feature itis sufficient to only consider the connections where thecdeptime is equals to or later than the desirable depar-ture time The CSA algorithm works as follows whena new scanned connection leads to a faster route to thearrival stop the minimum spanning tree (MST) will beupdated This is the case when carrtime is earlier thanthe actual EAT at carrstop Finally the algorithm endswhen the destination stop is added to the MST and theresulting journey can be obtained by following the pathin the MST backwards

In summary after some proofs of concepts devel-oped taking into account live and historical data in theLC framework and the current advances in the state ofthe work exposing the data on the Web efficiently we

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

think that it is the moment to improve the features ofthe current LC framework to create a standard mecha-nism for publishing reliable public transport data Themain motivations to do that are on one hand the ne-cessity to improve the current access of the transportinformation services to the data where at the momentthat real-time is involved the solutions are basicallyad-hoc and they are not feasible if we are based onthe proposal of the EU Commission for publish pub-lic transport data On the other hand supported bythe characteristics of the transport data where a com-plex data management environment emerges the newapproach that we present in this paper can serve asa source of inspiration for historical and live LinkedData management on the Web in other domains

3 The Linked Connections Framework

In this section we describe the Linked Connectionsframework for providing reliable access to live andhistorical data First we describe a summary of theLinked Connections specification including new prop-erties about features of live data Second we describethe extensions we develop to deal with these types ofdata the linked connections library for transforminglive feeds into connections and the LC server for man-aging and exposing that connections on the Web effi-ciently

31 The Linked Connections specification

The LC specification19 explains how to implement adata publishing sever and explains what you can relyon when writing a route planning client Following theLinked Data principles and the REST constraints wemake sure that from any HTTP response hypermediacontrols can be followed to discover the rest of thedataset A Linked Connections graph is a paged col-lection of connections describing the time transit ve-hicles leave and arrive The Linked Connections vo-cabulary20 defines the basic properties used within thelcConnection class

ndash lcdepartureTime It is a date-time includ-ing delay at which the vehicle will leave for thelcarrivalStop

ndash lcdepartureStop The departure stop URI

19httpslinkedconnectionsorgspecification20httpsemwebmmlabbenslinkedconnections

ndash lcdepartureDelay Provides time in sec-onds when the lcdepartureTime is not theplanned departureTime

ndash lcarrivalTime It is a date-time includ-ing delay at which the vehicle will arrives atlcarrivalStop

ndash lcarrivalStop The arrival stop URIndash lcarrivalDelay Provides time in second

when the lcarrivalTime is not the plannedarrivalTime

We also reuse the terms from the Linked GTFS vo-cabulary21 in order to describe other properties of alcConnection class This vocabulary it is alignedwith the de-facto standard to exchange public transportdata on the Web GTFS

ndash gtfstrip Must be set to link a gtfsTripwith a lcConnection identifying whetheranother connection is part of the current trip of avehicle

ndash gtfspickupType Should be set to indi-cate whether people can be picked up at thisstop The possible values gtfsRegulargtfsNotAvailable gtfsMustPhoneand gtfsMustCoordinateWithDriver

ndash gtfsdropOffType Should be set to indicatewhether people can be dropped off at this stopThe possible values are the same as the defined inthe gtfspickupType property

It is important to remark that each Linked Connec-tions page must contain some metadata about itselfThe URL after redirection or the one indicated by theLocation HTTP header therefore must occur in thetriples of the response The current page must containthe hypermedia controls to discover under what condi-tions the data can be legally reused and must containthe hypermedia controls to understand how to navigatethrough the paged collection(s) Different ways existto implement the paging strategy At least one of thestrategies must be implemented for a Linked Connec-tions client to find the next pages to be processed

1 Each response must contain a hydranext andhydraprevious page link

2 Each response should contain ahydrasearch describing that you cansearch for a specific time and that the clientwill be redirected to a page containing in-

21httpvocabgtfsorgterms

6 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 4 Metadata LC response

formation about that timestamp To describethis functionality the hydrapropertylcdepartureTimeQuery is used Anexample is shown in the Figure 4

Finally for each document that is published by aLinked Connections server a Cross Origin ResourceSharing22 HTTP header must set the property Access-Control-Allow-Origin for sharing the response withany origin The server also should implement thor-ough caching strategies as the cachability is one of thebiggest advantages of the Linked Connections frame-work Both conditional requests23 as regular caching24

are recommended As every connection will need tohave a unique an persistent identifier an HTTP URIthe server must support at least one RDF11 formatsuch as JSON-LD [25] TriG [26] or N-Quads [27]The connection URI should follow the Linked Dataprinciples [1]

32 Live data in Linked Connections

As we aforementioned the transport domain shouldinvolve in their route planning algorithms the live datato provide consistent information to the passengers andimprove the current informational services The de-facto standard for exchange transport data on the WebGTFS has created a version for live updates GTFS-RT Providing globally unique identifiers to the differ-ent entities that comprise a public transport network isfundamental to lower the adoption cost of public trans-port data in route-planning applications Specificallyin the case of live updates about the schedules is im-portant to maintain stable identifiers that remain validover time Here we use the Linked Data principles totransform schedule updates given in the GTFS-RT for-

22httpenable-corsorg23httpsdevelopermozillaorgen-USdocsWebHTTP

Conditional_requests24httpsdevelopermozillaorgen-USdocsWebHTTPCaching

mat to Linked Connections and we give the option toserialize them in JSON CSV or RDF (turtle N-Triplesor JSON-LD) format

The URI strategy to be used during the conversionprocess is given following the RFC 657025 specifica-tion for URI templates The parameters used to buildthe URIs are given following an object-like notation(objectvariable) where the left side references a CSVfile present in the provided GTFS data source and theright side references a specific column of such fileWe use the data from a reference GTFS data sourceto create the URIs as with only the data present in aGTFS-RT update may not be feasible to create persis-tent URIs The GFTS files that can be used to createthe URIs are routes and trips files As for the variablesany column that exists in those files can be referencedA simple example is shown in Figure 5 and an standardtemplate is also available26 Next we describe how arethe entities URIs build based on these templates

ndash stop A Linked Connection references two differ-ent stops (departure and arrival stop) The dataused to build these specific URIs comes directlyfrom the GTFS-RT update so we do not specifyany CSV file and header from the reference GTFSdata source The variable name chosen in our caseis the stop_id but it can be freely named

ndash route For the route identifier we relyon the routesroute_short_name and thetripstrip_short_name variables

ndash trip In the case of the trip we add theexpected departure time on top of theroute URI based on the associated connec-tiondepartureTime(YYYYMMDD) and theinformation about the delay

ndash connection For a connection identifier weresort to its departure stop with connec-tiondepartureStop its departure time withconnectiondepartureTime(YYYYMMDD)and the routesroute_short_name and thetripstrip_short_name In this case we referencea special entity we called connection whichcontains the related basic data that can beextracted from a GTFS-RT update for everyLinked Connection A connection entity containsthese parameters that can be used on the URIsdefinition connectiondepartureStop connec-

25httpstoolsietforghtmlrfc657026httpsgithubcomlinkedconnectionsgtfsrt2lcblobmaster

uris_template_examplejson

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 5 Connection example with live updates

Fig 6 The GTFSRT2LC tool

tionarrivalStop connectiondepartureTime andconnectionarrivalTime As both departureTimeand arrivalTime are date objects the expectedformat can be defined using brackets

Finally we developed a tool27 that analyses the in-formation from an static GTFS dataset and the updatesfrom a GTFS-RT feed exploits the implicit relationsamong the CSV files and creates the live Linked Con-nections feed in the desirable format with the help ofthe URIs template as it is shown in Figure 6

33 Managing live and historical data with theLinked Connections server

The third main contribution of this paper after de-velop a way to create Linked Connections feeds tak-ing into account live data is how to provide reliableaccess to historical and live transport data in an cost-efficiently way For that reason we develop a LinkedConnections server which exposes data fragments us-ing JSON-LD serialization format First the serveruses the GTFS2LC28 library to convert a GTFS datasetinto a time sorted directed acyclic graph of connec-tions After that the fragmenting process starts using

27httpsgithubcomlinkedconnectionsgtfsrt2lc28httpsgithubcomlinkedconnectionsgtfs2lc

the information provided in the configuration of theserver about the fragment size Once the fragmentationprocess is completed it starts using the GTFSRT2LClibrary for mapping real time updates into a LinkedConnections feed as we describe in the above section

As we aforementioned the previous process forfragmenting LC was based on a predefined time spanwhich used this variable for the pagination of the dataand created an heterogeneous environment in terms ofthe fragment size In other words setting a fixed timewindow for the creation of the fragments (eg 10 min-utes) could lead to the creation of fragments contain-ing a high number of connections specially in the rushhours and thus being big fragments in terms of sizeAt the same time smaller fragments could also be cre-ated at times where there are few or no vehicles de-parting This is the main reason why we start to frag-menting the historical data of the LC feed homoge-neously providing a specific fragment size this has arelevant effect in terms of performance as we demon-strate in the Section 5 As for the LC live feed wekeep an unique timeline of connection updates that isstored and fragmented using a predefined time span incontrast to static timetable updates which may containa complete re-write or partially overlap with previousversions from a time perspective

The server is able to manage both the historical andthe live connections First the server create the LinkedConnections from the information provided by a GTFSdataset and fragment the historical data into severalfragments After that if the transport system has anopen live service the sever starts to process the updatesfrom the GTFS-RT feed and create another streamof connections based on that information Then theserver searches the correspondence between the his-torical and live connections based on the URIs of thetrips and keeps an registry over time of the historicalevolution of the connections regarding delays andorcancellations Finally the server exposes the stream ofhistorical and live connections on an API following thefragmentation approach This process shown in Fig-ure 7 is repeated every time that the server process anupdate

The server also allows querying historic data bymeans of the Memento Framework [28] which enablestime-based content negotiation over HTTP By usingthe Accept-Datetime header a client can request thestate of a resource at a given moment If existing theserver will respond with a 302 Found containing theURI of the stored version of such resource So if a re-quest is made to obtain a Linked Connections fragment

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 5: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 5

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

think that it is the moment to improve the features ofthe current LC framework to create a standard mecha-nism for publishing reliable public transport data Themain motivations to do that are on one hand the ne-cessity to improve the current access of the transportinformation services to the data where at the momentthat real-time is involved the solutions are basicallyad-hoc and they are not feasible if we are based onthe proposal of the EU Commission for publish pub-lic transport data On the other hand supported bythe characteristics of the transport data where a com-plex data management environment emerges the newapproach that we present in this paper can serve asa source of inspiration for historical and live LinkedData management on the Web in other domains

3 The Linked Connections Framework

In this section we describe the Linked Connectionsframework for providing reliable access to live andhistorical data First we describe a summary of theLinked Connections specification including new prop-erties about features of live data Second we describethe extensions we develop to deal with these types ofdata the linked connections library for transforminglive feeds into connections and the LC server for man-aging and exposing that connections on the Web effi-ciently

31 The Linked Connections specification

The LC specification19 explains how to implement adata publishing sever and explains what you can relyon when writing a route planning client Following theLinked Data principles and the REST constraints wemake sure that from any HTTP response hypermediacontrols can be followed to discover the rest of thedataset A Linked Connections graph is a paged col-lection of connections describing the time transit ve-hicles leave and arrive The Linked Connections vo-cabulary20 defines the basic properties used within thelcConnection class

ndash lcdepartureTime It is a date-time includ-ing delay at which the vehicle will leave for thelcarrivalStop

ndash lcdepartureStop The departure stop URI

19httpslinkedconnectionsorgspecification20httpsemwebmmlabbenslinkedconnections

ndash lcdepartureDelay Provides time in sec-onds when the lcdepartureTime is not theplanned departureTime

ndash lcarrivalTime It is a date-time includ-ing delay at which the vehicle will arrives atlcarrivalStop

ndash lcarrivalStop The arrival stop URIndash lcarrivalDelay Provides time in second

when the lcarrivalTime is not the plannedarrivalTime

We also reuse the terms from the Linked GTFS vo-cabulary21 in order to describe other properties of alcConnection class This vocabulary it is alignedwith the de-facto standard to exchange public transportdata on the Web GTFS

ndash gtfstrip Must be set to link a gtfsTripwith a lcConnection identifying whetheranother connection is part of the current trip of avehicle

ndash gtfspickupType Should be set to indi-cate whether people can be picked up at thisstop The possible values gtfsRegulargtfsNotAvailable gtfsMustPhoneand gtfsMustCoordinateWithDriver

ndash gtfsdropOffType Should be set to indicatewhether people can be dropped off at this stopThe possible values are the same as the defined inthe gtfspickupType property

It is important to remark that each Linked Connec-tions page must contain some metadata about itselfThe URL after redirection or the one indicated by theLocation HTTP header therefore must occur in thetriples of the response The current page must containthe hypermedia controls to discover under what condi-tions the data can be legally reused and must containthe hypermedia controls to understand how to navigatethrough the paged collection(s) Different ways existto implement the paging strategy At least one of thestrategies must be implemented for a Linked Connec-tions client to find the next pages to be processed

1 Each response must contain a hydranext andhydraprevious page link

2 Each response should contain ahydrasearch describing that you cansearch for a specific time and that the clientwill be redirected to a page containing in-

21httpvocabgtfsorgterms

6 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 4 Metadata LC response

formation about that timestamp To describethis functionality the hydrapropertylcdepartureTimeQuery is used Anexample is shown in the Figure 4

Finally for each document that is published by aLinked Connections server a Cross Origin ResourceSharing22 HTTP header must set the property Access-Control-Allow-Origin for sharing the response withany origin The server also should implement thor-ough caching strategies as the cachability is one of thebiggest advantages of the Linked Connections frame-work Both conditional requests23 as regular caching24

are recommended As every connection will need tohave a unique an persistent identifier an HTTP URIthe server must support at least one RDF11 formatsuch as JSON-LD [25] TriG [26] or N-Quads [27]The connection URI should follow the Linked Dataprinciples [1]

32 Live data in Linked Connections

As we aforementioned the transport domain shouldinvolve in their route planning algorithms the live datato provide consistent information to the passengers andimprove the current informational services The de-facto standard for exchange transport data on the WebGTFS has created a version for live updates GTFS-RT Providing globally unique identifiers to the differ-ent entities that comprise a public transport network isfundamental to lower the adoption cost of public trans-port data in route-planning applications Specificallyin the case of live updates about the schedules is im-portant to maintain stable identifiers that remain validover time Here we use the Linked Data principles totransform schedule updates given in the GTFS-RT for-

22httpenable-corsorg23httpsdevelopermozillaorgen-USdocsWebHTTP

Conditional_requests24httpsdevelopermozillaorgen-USdocsWebHTTPCaching

mat to Linked Connections and we give the option toserialize them in JSON CSV or RDF (turtle N-Triplesor JSON-LD) format

The URI strategy to be used during the conversionprocess is given following the RFC 657025 specifica-tion for URI templates The parameters used to buildthe URIs are given following an object-like notation(objectvariable) where the left side references a CSVfile present in the provided GTFS data source and theright side references a specific column of such fileWe use the data from a reference GTFS data sourceto create the URIs as with only the data present in aGTFS-RT update may not be feasible to create persis-tent URIs The GFTS files that can be used to createthe URIs are routes and trips files As for the variablesany column that exists in those files can be referencedA simple example is shown in Figure 5 and an standardtemplate is also available26 Next we describe how arethe entities URIs build based on these templates

ndash stop A Linked Connection references two differ-ent stops (departure and arrival stop) The dataused to build these specific URIs comes directlyfrom the GTFS-RT update so we do not specifyany CSV file and header from the reference GTFSdata source The variable name chosen in our caseis the stop_id but it can be freely named

ndash route For the route identifier we relyon the routesroute_short_name and thetripstrip_short_name variables

ndash trip In the case of the trip we add theexpected departure time on top of theroute URI based on the associated connec-tiondepartureTime(YYYYMMDD) and theinformation about the delay

ndash connection For a connection identifier weresort to its departure stop with connec-tiondepartureStop its departure time withconnectiondepartureTime(YYYYMMDD)and the routesroute_short_name and thetripstrip_short_name In this case we referencea special entity we called connection whichcontains the related basic data that can beextracted from a GTFS-RT update for everyLinked Connection A connection entity containsthese parameters that can be used on the URIsdefinition connectiondepartureStop connec-

25httpstoolsietforghtmlrfc657026httpsgithubcomlinkedconnectionsgtfsrt2lcblobmaster

uris_template_examplejson

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 5 Connection example with live updates

Fig 6 The GTFSRT2LC tool

tionarrivalStop connectiondepartureTime andconnectionarrivalTime As both departureTimeand arrivalTime are date objects the expectedformat can be defined using brackets

Finally we developed a tool27 that analyses the in-formation from an static GTFS dataset and the updatesfrom a GTFS-RT feed exploits the implicit relationsamong the CSV files and creates the live Linked Con-nections feed in the desirable format with the help ofthe URIs template as it is shown in Figure 6

33 Managing live and historical data with theLinked Connections server

The third main contribution of this paper after de-velop a way to create Linked Connections feeds tak-ing into account live data is how to provide reliableaccess to historical and live transport data in an cost-efficiently way For that reason we develop a LinkedConnections server which exposes data fragments us-ing JSON-LD serialization format First the serveruses the GTFS2LC28 library to convert a GTFS datasetinto a time sorted directed acyclic graph of connec-tions After that the fragmenting process starts using

27httpsgithubcomlinkedconnectionsgtfsrt2lc28httpsgithubcomlinkedconnectionsgtfs2lc

the information provided in the configuration of theserver about the fragment size Once the fragmentationprocess is completed it starts using the GTFSRT2LClibrary for mapping real time updates into a LinkedConnections feed as we describe in the above section

As we aforementioned the previous process forfragmenting LC was based on a predefined time spanwhich used this variable for the pagination of the dataand created an heterogeneous environment in terms ofthe fragment size In other words setting a fixed timewindow for the creation of the fragments (eg 10 min-utes) could lead to the creation of fragments contain-ing a high number of connections specially in the rushhours and thus being big fragments in terms of sizeAt the same time smaller fragments could also be cre-ated at times where there are few or no vehicles de-parting This is the main reason why we start to frag-menting the historical data of the LC feed homoge-neously providing a specific fragment size this has arelevant effect in terms of performance as we demon-strate in the Section 5 As for the LC live feed wekeep an unique timeline of connection updates that isstored and fragmented using a predefined time span incontrast to static timetable updates which may containa complete re-write or partially overlap with previousversions from a time perspective

The server is able to manage both the historical andthe live connections First the server create the LinkedConnections from the information provided by a GTFSdataset and fragment the historical data into severalfragments After that if the transport system has anopen live service the sever starts to process the updatesfrom the GTFS-RT feed and create another streamof connections based on that information Then theserver searches the correspondence between the his-torical and live connections based on the URIs of thetrips and keeps an registry over time of the historicalevolution of the connections regarding delays andorcancellations Finally the server exposes the stream ofhistorical and live connections on an API following thefragmentation approach This process shown in Fig-ure 7 is repeated every time that the server process anupdate

The server also allows querying historic data bymeans of the Memento Framework [28] which enablestime-based content negotiation over HTTP By usingthe Accept-Datetime header a client can request thestate of a resource at a given moment If existing theserver will respond with a 302 Found containing theURI of the stored version of such resource So if a re-quest is made to obtain a Linked Connections fragment

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 6: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

6 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 4 Metadata LC response

formation about that timestamp To describethis functionality the hydrapropertylcdepartureTimeQuery is used Anexample is shown in the Figure 4

Finally for each document that is published by aLinked Connections server a Cross Origin ResourceSharing22 HTTP header must set the property Access-Control-Allow-Origin for sharing the response withany origin The server also should implement thor-ough caching strategies as the cachability is one of thebiggest advantages of the Linked Connections frame-work Both conditional requests23 as regular caching24

are recommended As every connection will need tohave a unique an persistent identifier an HTTP URIthe server must support at least one RDF11 formatsuch as JSON-LD [25] TriG [26] or N-Quads [27]The connection URI should follow the Linked Dataprinciples [1]

32 Live data in Linked Connections

As we aforementioned the transport domain shouldinvolve in their route planning algorithms the live datato provide consistent information to the passengers andimprove the current informational services The de-facto standard for exchange transport data on the WebGTFS has created a version for live updates GTFS-RT Providing globally unique identifiers to the differ-ent entities that comprise a public transport network isfundamental to lower the adoption cost of public trans-port data in route-planning applications Specificallyin the case of live updates about the schedules is im-portant to maintain stable identifiers that remain validover time Here we use the Linked Data principles totransform schedule updates given in the GTFS-RT for-

22httpenable-corsorg23httpsdevelopermozillaorgen-USdocsWebHTTP

Conditional_requests24httpsdevelopermozillaorgen-USdocsWebHTTPCaching

mat to Linked Connections and we give the option toserialize them in JSON CSV or RDF (turtle N-Triplesor JSON-LD) format

The URI strategy to be used during the conversionprocess is given following the RFC 657025 specifica-tion for URI templates The parameters used to buildthe URIs are given following an object-like notation(objectvariable) where the left side references a CSVfile present in the provided GTFS data source and theright side references a specific column of such fileWe use the data from a reference GTFS data sourceto create the URIs as with only the data present in aGTFS-RT update may not be feasible to create persis-tent URIs The GFTS files that can be used to createthe URIs are routes and trips files As for the variablesany column that exists in those files can be referencedA simple example is shown in Figure 5 and an standardtemplate is also available26 Next we describe how arethe entities URIs build based on these templates

ndash stop A Linked Connection references two differ-ent stops (departure and arrival stop) The dataused to build these specific URIs comes directlyfrom the GTFS-RT update so we do not specifyany CSV file and header from the reference GTFSdata source The variable name chosen in our caseis the stop_id but it can be freely named

ndash route For the route identifier we relyon the routesroute_short_name and thetripstrip_short_name variables

ndash trip In the case of the trip we add theexpected departure time on top of theroute URI based on the associated connec-tiondepartureTime(YYYYMMDD) and theinformation about the delay

ndash connection For a connection identifier weresort to its departure stop with connec-tiondepartureStop its departure time withconnectiondepartureTime(YYYYMMDD)and the routesroute_short_name and thetripstrip_short_name In this case we referencea special entity we called connection whichcontains the related basic data that can beextracted from a GTFS-RT update for everyLinked Connection A connection entity containsthese parameters that can be used on the URIsdefinition connectiondepartureStop connec-

25httpstoolsietforghtmlrfc657026httpsgithubcomlinkedconnectionsgtfsrt2lcblobmaster

uris_template_examplejson

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 5 Connection example with live updates

Fig 6 The GTFSRT2LC tool

tionarrivalStop connectiondepartureTime andconnectionarrivalTime As both departureTimeand arrivalTime are date objects the expectedformat can be defined using brackets

Finally we developed a tool27 that analyses the in-formation from an static GTFS dataset and the updatesfrom a GTFS-RT feed exploits the implicit relationsamong the CSV files and creates the live Linked Con-nections feed in the desirable format with the help ofthe URIs template as it is shown in Figure 6

33 Managing live and historical data with theLinked Connections server

The third main contribution of this paper after de-velop a way to create Linked Connections feeds tak-ing into account live data is how to provide reliableaccess to historical and live transport data in an cost-efficiently way For that reason we develop a LinkedConnections server which exposes data fragments us-ing JSON-LD serialization format First the serveruses the GTFS2LC28 library to convert a GTFS datasetinto a time sorted directed acyclic graph of connec-tions After that the fragmenting process starts using

27httpsgithubcomlinkedconnectionsgtfsrt2lc28httpsgithubcomlinkedconnectionsgtfs2lc

the information provided in the configuration of theserver about the fragment size Once the fragmentationprocess is completed it starts using the GTFSRT2LClibrary for mapping real time updates into a LinkedConnections feed as we describe in the above section

As we aforementioned the previous process forfragmenting LC was based on a predefined time spanwhich used this variable for the pagination of the dataand created an heterogeneous environment in terms ofthe fragment size In other words setting a fixed timewindow for the creation of the fragments (eg 10 min-utes) could lead to the creation of fragments contain-ing a high number of connections specially in the rushhours and thus being big fragments in terms of sizeAt the same time smaller fragments could also be cre-ated at times where there are few or no vehicles de-parting This is the main reason why we start to frag-menting the historical data of the LC feed homoge-neously providing a specific fragment size this has arelevant effect in terms of performance as we demon-strate in the Section 5 As for the LC live feed wekeep an unique timeline of connection updates that isstored and fragmented using a predefined time span incontrast to static timetable updates which may containa complete re-write or partially overlap with previousversions from a time perspective

The server is able to manage both the historical andthe live connections First the server create the LinkedConnections from the information provided by a GTFSdataset and fragment the historical data into severalfragments After that if the transport system has anopen live service the sever starts to process the updatesfrom the GTFS-RT feed and create another streamof connections based on that information Then theserver searches the correspondence between the his-torical and live connections based on the URIs of thetrips and keeps an registry over time of the historicalevolution of the connections regarding delays andorcancellations Finally the server exposes the stream ofhistorical and live connections on an API following thefragmentation approach This process shown in Fig-ure 7 is repeated every time that the server process anupdate

The server also allows querying historic data bymeans of the Memento Framework [28] which enablestime-based content negotiation over HTTP By usingthe Accept-Datetime header a client can request thestate of a resource at a given moment If existing theserver will respond with a 302 Found containing theURI of the stored version of such resource So if a re-quest is made to obtain a Linked Connections fragment

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 7: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 7

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 5 Connection example with live updates

Fig 6 The GTFSRT2LC tool

tionarrivalStop connectiondepartureTime andconnectionarrivalTime As both departureTimeand arrivalTime are date objects the expectedformat can be defined using brackets

Finally we developed a tool27 that analyses the in-formation from an static GTFS dataset and the updatesfrom a GTFS-RT feed exploits the implicit relationsamong the CSV files and creates the live Linked Con-nections feed in the desirable format with the help ofthe URIs template as it is shown in Figure 6

33 Managing live and historical data with theLinked Connections server

The third main contribution of this paper after de-velop a way to create Linked Connections feeds tak-ing into account live data is how to provide reliableaccess to historical and live transport data in an cost-efficiently way For that reason we develop a LinkedConnections server which exposes data fragments us-ing JSON-LD serialization format First the serveruses the GTFS2LC28 library to convert a GTFS datasetinto a time sorted directed acyclic graph of connec-tions After that the fragmenting process starts using

27httpsgithubcomlinkedconnectionsgtfsrt2lc28httpsgithubcomlinkedconnectionsgtfs2lc

the information provided in the configuration of theserver about the fragment size Once the fragmentationprocess is completed it starts using the GTFSRT2LClibrary for mapping real time updates into a LinkedConnections feed as we describe in the above section

As we aforementioned the previous process forfragmenting LC was based on a predefined time spanwhich used this variable for the pagination of the dataand created an heterogeneous environment in terms ofthe fragment size In other words setting a fixed timewindow for the creation of the fragments (eg 10 min-utes) could lead to the creation of fragments contain-ing a high number of connections specially in the rushhours and thus being big fragments in terms of sizeAt the same time smaller fragments could also be cre-ated at times where there are few or no vehicles de-parting This is the main reason why we start to frag-menting the historical data of the LC feed homoge-neously providing a specific fragment size this has arelevant effect in terms of performance as we demon-strate in the Section 5 As for the LC live feed wekeep an unique timeline of connection updates that isstored and fragmented using a predefined time span incontrast to static timetable updates which may containa complete re-write or partially overlap with previousversions from a time perspective

The server is able to manage both the historical andthe live connections First the server create the LinkedConnections from the information provided by a GTFSdataset and fragment the historical data into severalfragments After that if the transport system has anopen live service the sever starts to process the updatesfrom the GTFS-RT feed and create another streamof connections based on that information Then theserver searches the correspondence between the his-torical and live connections based on the URIs of thetrips and keeps an registry over time of the historicalevolution of the connections regarding delays andorcancellations Finally the server exposes the stream ofhistorical and live connections on an API following thefragmentation approach This process shown in Fig-ure 7 is repeated every time that the server process anupdate

The server also allows querying historic data bymeans of the Memento Framework [28] which enablestime-based content negotiation over HTTP By usingthe Accept-Datetime header a client can request thestate of a resource at a given moment If existing theserver will respond with a 302 Found containing theURI of the stored version of such resource So if a re-quest is made to obtain a Linked Connections fragment

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 8: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

8 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 7 The Linked Connections server

identified by a departure time the headers of the frag-ment will provide the information about the Accept-Datetime That means that is possible to know what isthe state of the delays at a given date-time for a set ofconnections enabling different kind of analysis of thebehaviour of a transport network that may contributeon the improvement of its design This approach alsoguarantees the coherence of the data used to answer agive query ie when a query is issued by a client itstarts to download data fragments of relevant connec-tions to form a MST that leads to an appropriate an-swer but if a new update is processed by the serverwhile the client is still processing the query some ofthe connections that the client has already processedmay reappear to the client in later fragments due toupdated delays which will render invalid the MSTcreated so far and may lead to incoherent answersHowever by using Memento and including a Accept-Datetime header in the requests made by clients theserver will guarantee that all the provided responseswill belong to the same moment and will lead to validanswers This mechanism is further described in [6]

34 A running example Providing reliable access toNMBS historical and real time data

The NMBS is the national company of trains in Bel-gium They provide their static and live data as opendata in GTFS and GTFS-RT formats We use this caseto explain how our approach works

As we incorporate the GTFS2LC and GTFSRT2LCtools to the Linked Connections server we only have toconfigure it to start producing the fragments of live andhistorical data The basic configuration for the GTFSdataset is shown in Figure 8 where the companyNamemust be a unique identifier of the dataset the down-loadUrl must be a public URL where the server has

Fig 8 NMBS GTFS configuration

Fig 9 NMBS GTFS-RT configuration

to download the GTFS dataset the downloadOnLanchboolean indicates if the dataset must be downloadedand processed upon server launch the updatePeriod isa node-cron29 expression to check if a new version ofthe dataset has been published for creating a new LCfeed and finally the fragmentSize represents the sizeof each fragment in bytes

For the GTFS-RT feed the configuration is showin the Figure 9 where the downloadUrl is the URLwhere the updates are published the updatePeriod is anode-cron expression that indicates how often shouldthe server look for and process a new version of thefeed the fragmentTimeSpan defines the fragmentationof live data and it represents the time span of everyfragment in seconds and finally the compressionPeriodis also a node-cron expression that defines how oftenwill the live data be compressed using gzip in order toreduce storage consumption

Finally we have to configure the URIs using thetemplate we describe in the Section 32 and launch theserver After the GTFS datasets have been processedwe are able to start querying the data All the frag-ments have a uniform starting point which is the ex-pected departure time of its first lcconnectionSo the clients can start accessing the transport datausing using the departure time as a parameter likethis example httphostportNMBSconnectionsdepartureTime=2017-08-11T164500000Z In thisexample we use a client implementing the CSA routeplanning algorithm to perform a query The clientsource code is available at github30 The parameters ofthe query are as follows

29httpswwwnpmjscompackagenode-cron30httpsgithubcomzoeparmanconnectionscan

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 9: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 9

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Departure Stop httpirailbestationsNMBS008811189

ndash Arrival Stop httpirailbestationsNMBS008821246

ndash Departure Time 2018-06-14T150000000Z

The client starts by sending a request to the serverusing the departure time defined in the query as a pa-rameter for the fragments query URL Once the serverreceives the request it will determine which staticfragment contains connections relevant for the query Itdoes this through a binary search looking into the firstconnection of every fragment until it finds a fragmentFk whose first connectionrsquos departure time is greaterthan the requested value and a fragment Fk-1 whosefirst connectionrsquos departure time is less or equal to therequested value determining then that Fk-1 is the frag-ment to be returned in the response to client After thisthe server checks if there are live updates for the con-nections contained in the selected fragment and if soit proceeds to update the departure and arrival times ofthe connections according to the reported delays Thisprocess may involve the inclusion of new connectionsthat originally belonged to previous fragments but dueto the delays are now relevant for the query In thesame way some connections may have to be removedas their delays make them belong to further fragmentsnow Once this is done the connections are resorted bytheir new departure times to ensure a correct executionof the CSA algorithm on the client side Finally theserver adds the necessary metadata (as shown on fig-ure 4) and gives back the response using the JSON-LDformat

Once the client receives the first response it startslooking for the first connection with a departure stopequals to the one defined in the query When it findsone the algorithm starts building a MST where eachbranch represent a different vehicle departing from thespecified departure stop at a different time The algo-rithm process the connections and requests new frag-ments for it and if such a route exists within the trans-port network eventually it will complete at least one ofthe branches of the MST It will keep processing con-nections until the departure time of the next connec-tions to be processed is greater than the arrival time ofthe fastest found route as this means that is no longerpossible to find a faster route Finally the client will re-port the found routes in a JSON object as seen in figure10

Fig 10 CSA result for a query over the NMBS Linked Connectionsstream

4 Evaluation Design

We design a set of experiments in order to evaluateour approach In this section we describe the main fea-tures of our developments the data used for the exper-iments and the testbed we created We implement dif-ferent components in JavaScript for the Nodejs plat-form We chose JavaScript as it allows us to use thecomponents both on a command-line environment aswell as in the browser which is ideal for client sideapplications

ndash GTFSRT2LC A tool to convert live updates andtimetables as open data to Linked ConnectionsVocabulary

ndash LC server Publishes streams of connections inJSON-LD as we explain in the Section 3

ndash CSA algorithm We developed the CSA algorithmon the top of the Linked Connections Frameworkas the client

We also use other tools that we developed previ-ously as the GTFS2LC library that it is able to convertstatic timetables as open data to Linked ConnectionsVocabulary and the Memento Protocol on the top ofthe Linked Connections Server [6] to enable reliableaccess to historical data as we describe in the Section3

These tools are combined in different set-ups with aroute planning algorithm in the client-side and are usedin the following scenarios

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 10: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

10 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

ndash Static data fragmentation by size without cacheThe first experiment executes the queries tak-ing into account only timetable static data wherethe size of the exposed fragments are always thesame Client caching is disabled making this sim-ulate the LC without cache set-up where everyrequest could originate from an end-userrsquos deviceWe test different fragments size to test how it af-fects the query answering performance

ndash Static data fragmentation by size with cache Thesecond experiment does the same as the first ex-cept that it uses a client side cache and simulatesthe LC with cache set-up

ndash Historical data fragmentation by size withoutcache The third experiment does the same as thefirst except that it uses the Memento protocol toensure data coherence As the Memento protocolinvolves HTTP redirections to access specific re-sources at different points in time this experimentis meant to measure its impact on query answer-ing performance

ndash Historical data fragmentation by size with cacheThe fourth experiment does the same as the thirdexcept that it uses a client side cache and simu-lates the LC with cache set-up

ndash Live and historical data fragmentation by sizewithout cache The fifth experiment does thesame as the third but at this tame live data is alsotake into account modifying the fragments sizedue to the delays We also test different fragmentssizes

ndash Live and historical data fragmentation by sizewith cache The sixth experiment does the sameas the fifth except that it uses a client side cacheand simulates the LC with cache set-up

ndash Live and historical data fragmentation by timewithout cache The seventh experiment fragmentsthe data based on time as our previous approacheswith the client cache disable We want to test ifthe performance of our pagination by the size isbetter than by time

ndash Live and historical data fragmentation by timewith cache The eighth does the same as the sev-enth but it uses the client side cache

As far as we are aware of there are no general-purpose benchmarks in the state of the art for testingour main contributions with the Linked Connectionsframework Therefore we have created a testbed asfollows 1) we have selected and downloaded 4 dif-ferent representative GTFS datasets available as open

Dataset Stops Routes Trips ConnectionsTBS Barcelona 27 3 5186 1957354

MLM Madrid 81 4 2872 3293575

NMBS Belgium 2615 479 12621 6941813

De Lijn Flanders 36050 1412 305079 63006596Table 1

Characteristics of the transport networks datasets used for thebenchmarks

data that also cover several geographical locations 2)we have created several route planning queries (depar-ture and arrival stop with a departure time) that reflectreal-life trips with multiple levels of complexity and3) we have evaluated these queries in the different setups taking into account the features of each datasetThe query set created for each GTFS dataset was de-fined by randomly taking departure stops and arrivalstops with a random departure time and using the CSAalgorithm to find a route between them This processwas repeated for each transport network and spanninga 24 hours time frame of a regular week day In the endthis gave us a set of queries for each network with dif-ferent levels of complexity measured in terms of theamount of connections that need to be processed bya client to obtain a valid answer The complexity ofthe queries is directly related to the complexity of thetransport network itself Meaning that the amount ofconnections that need to be processed in a query in-crease as the amount of stops and vehicles of the net-work also increase Table 1 shows the characteristics(amount of stops routes and trips) of each transportnetwork and table 2 portrays the least and the mostcomplex queries used during the benchmarks for eachtransport network

The used GTFS datasets and GTFS-RT feeds areavailable as open data on the Web There are twodatasets that represent the Belgian Rail Transport Sys-tem (NMBS) and the Flemish Transport Company (DeLijn) and other two that represent the Tram Companyof Barcelona (TBS) and the Tram Company of Madrid(MLM) NMBS also provides open access to its real-time updates following the GTFS-RT format thereforewe use this data to carry out the evaluation that takesinto account live updates TBS have also developedan ad-hoc live API31 and we are currently developinga tool to transform the ad-hoc live information to theGTFS-RT format but at this moment only historicaldata can be taken into account in this case De Lijn

31httpsopendatatramcatmanual_enpdf

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 11: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 11

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Query Dataset of ConnectionsdepStop httpirailbestationsNMBS008841608

NMBS 3371arrStop httpirailbestationsNMBS008841319

depTime 2018-06-14T170000000Z

depStop httpirailbestationsNMBS008886504

NMBS 13797arrStop httpirailbestationsNMBS008841004

depTime 2018-06-14T140000000Z

depStop httpsdatadelijnbestops120281

De Lijn 34696arrStop httpsdatadelijnbestops126536

depTime 2018-06-07T130000000Z

depStop httpsdatadelijnbestops112358

De Lijn 114059arrStop httpsdatadelijnbestops111446

depTime 2018-06-07T130000000Z

depStop httpsbarcelonatbsesstops22

TBS 41arrStop httpsbarcelonatbsesstops21

depTime 2018-06-07T200000000Z

depStop httpsbarcelonatbsesstops14

TBS 556arrStop httpsbarcelonatbsesstops10

depTime 2018-06-07T050000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 148arrStop httpsmadridtramesstopspar_10_32

depTime 2018-06-07T210000000Z

depStop httpsmadridtramesstopspar_10_37

MLM 799arrStop httpsmadridtramesstopspar_10_26

depTime 2018-06-07T160000000Z

Table 2Least and most complex queries of the LC testbed created for everybechmarked transport network

and MLM does not provide access to their live updateswhich is why we only perform static and historicalevaluations over this transport datasets

We ran the experiments on a laptop with an IntelCore i5-7440HQ 28GHz x 4 and 16GB of RAMEach query is executed at least 2 times and we takethe arithmetic average response time Our experimentscan be reproduced using the code at httpsgithubcomjulianrojas87linked-connections-server and thetestbed and the data at httpsgithubcomcef-oasislinkedconnections-tests

5 Results

Next we present the obtained results for each of thedescribed scenarios in the above section

51 Static data with fragmentation by size

These results are obtained from the evaluation madeover Linked Connection stream derived from GTFS

datasets of each transport network The queries exe-cuted on this evaluation do not follow the Mementoprotocol We present the results for both the evaluationwith and without an active client-side cache coveringthe first and second experiment described above

511 NMBSFor NMBS we created fragmentations of 10KB

50KB 300KB 500KB 1MB 3MB 7MB and 10MBWe used the same query set for both cache and nocache evaluations

Fig 11 NMBS static evaluation without cache

Figure 11 shows the response time distribution foreach fragmentation in a set up without a cache Herewe have that for a fragmentation of 300KB we havethe best performance with a median response time of1196ms for the used query set

Fig 12 NMBS static evaluation with cache

Figure 12 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 704ms

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 12: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

12 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

512 TBS BarcelonaFor TBS we used fragmentations going from 10KB

to 3MB and we used the same query set for both cacheand no cache set ups

Fig 13 TBS static evaluation without cache

Figure 13 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of75ms

Fig 14 TBS static evaluation with cache

Figure 14 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 10KB and a medianresponse time of 28ms

513 MLMFor MLM we used fragmentations going from

10KB to 3MB and we used the same query set for bothcache and no cache set ups

Fig 15 MLM static evaluation without cache

Figure 15 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained with afragmentation of 50KB and a median response time of128ms

Fig 16 MLM static evaluation with cache

Figure 16 shows the response time distribution foreach fragmentation in a set up with an active cache onthe client side In this case the best performance wasachieved with a fragmentation of 50KB and a medianresponse time of 57ms

514 De LijnFor De Lijn the fragmentations created go from

500KB to 15MB The query set used for the bench-marks is the same in both active and inactive cache setups

Figure 17 portrays the response time distribution foreach tested fragmentation in a set up without a cacheFor this case the best performance was obtained witha fragmentation of 500KB and a median response timeof 33863ms

Figure 18 shows the response time distribution foreach fragmentation in a set up with an active cache on

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 13: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 13

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 17 De Lijn static evaluation without cache

Fig 18 De Lijn static evaluation with cache

the client side In this case the best performance wasachieved with a fragmentation of 500KB and a medianresponse time of 33729ms

52 Historical data with fragmentation by size

The results presented here correspond to the evalu-ations made for historical queries on top of the NMBSand the MLM datasets We took 3 different versionsof each dataset and performed the queries defined onthe query set of each transport network but in this casethe client and the server follow the Memento proto-col to access historical data By including the Accept-Datetime header on the fragment requests the client isable to calculate a route at a specific moment in timeEg calculate a route from Ghent to Brussels using thedata as it was 3 months ago Historical queries weretested in set ups with and without a cache This resultscover the third and fourth experiments described in theprevious section

521 NMBSIn this set up we used fragmentations going from

50KB to 10MB We performed the evaluations usingan active and an inactive client-side cache

Fig 19 NMBS historic evaluation without cache

Figure 19 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of300KB and a median response time of 1207ms

Fig 20 NMBS historic evaluation with cache

Figure 20 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 714ms

522 MLMIn this set up we used fragmentations going from

10KB to 7MB We performed the evaluations using anactive and an inactive client-side cache

Figure 21 shows the response time distribution ofthe historical queries in a set up without a cache Thebest performance is achieved with a fragmentation of50KB and a median response time of 158ms

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 14: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

14 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 21 MLM historic evaluation without cache

Fig 22 MLM historic evaluation with cache

Figure 22 shows the response time distribution ofthe historical queries in a set up with an active cacheon the client side The best performance is achievedwith a fragmentation of 50KB and a median responsetime of 76ms

53 Live and historical data with fragmentation bysize

The results presented here were obtained throughthe evaluation made over the NMBS dataset includinglive data We only use the NMBS dataset as is the onlyone that provides a GTFS-RT feed as open data We re-fer to this experiment also as historical because whenwe are dealing with live data we use the Memento pro-tocol to ensure coherence and consistency on the queryanswers as described in section 33 We test differentset ups using a range of fragment sizes and measure theimpact of having an active cache This results cover thefifth ans sixth experiments described in the previoussection

Figure 23 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-

Fig 23 NMBS live and historic evaluation without cache

tained using a fragmentation of 300KB and a medianresponse time of 1701ms

Fig 24 NMBS live and historic evaluation with cache

Figure 24 shows the response time distribution forthe live and historical queries executed in a set upwithout an active cache The best performance is ob-tained using a fragmentation of 50KB and a medianresponse time of 859ms

54 Live and historical data with fragmentation bytime

The results presented here correspond to tests runfor the NMBS transport network using their GTFSdataset and GTFS-RT feed We defined a fragmenta-tion for the LC stream based on a time window of 10minutes as we arbitrarily did in previous versions ofthe LC framework

Figure 25 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 1571ms fora set up without a cache

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 15: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 15

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

Fig 25 NMBS live and historic evaluation based on time withoutcache

Fig 26 NMBS live and historic evaluation based on time with cache

Figure 26 shows the response distribution for a frag-mentation based on a time window of 10 minutes andwe can observe a median response time of 896ms fora set up with an active cache

6 Discussion

In the previous section we presented the results ob-tained for the tests we executed in 4 different scenar-ios The main goal of these tests was to determine howthe underlining fragmentation strategy of a LC datasetcould affect the performance of route planning queriesunder different conditions In previous versions of LCimplementations we always used an arbitrary fragmen-tation strategy based on time windows of 10 minutesthat lead to uneven data fragments in terms of sizewhich depended on the complexity of the transport net-work and that could affect the performance of routeplanning algorithms (eg in rush hours by having frag-ments too big) Therefore we decided to move to-wards a fragmentation strategy based on even file sizes

that allow route planners to always deal with similarfragments and thus have a more predictable behaviourWe tested different file sizes on different transport net-works trying to determine an optimal fragment size oneach case and also run the tests with and without an ac-tive cache to measure its impact on the performance ofroute planning queries The first scenario consisted onroute planning queries over LC streams derived onlyfrom GTFS datasets We preformed these tests for allour considered transport networks (NMBS De LijnTBS and MLM) in order to have a clear picture of theperformance of a route planning algorithm (CSA al-gorithm) running on the client side and relying on animplementation of the LC specification for transportnetworks of different sizes and complexities The sec-ond scenario focused on historical queries followingthe Memento protocol The capacity to query through-out different versions in time of the same dataset is animportant tool for performing behavioural analysis ofa transport network We run the tests for two differ-ent transport networks (NMBS and MLM) and on eachcase used three different versions of the GTFS datasetto perform historical queries The third scenario testedthe fragmentation strategies involving live data updatesand measure its impact on answering route planningqueries In this case we only used the NMBS transportnetwork as is the only company that publishes their liveupdates as open data and using the GTFS-RT specifi-cation This scenario also takes into account historicalqueries as we use the Memento protocol when dealingwith live data to maintain coherence in the query an-swers Finally the fourth scenario consisted on testingour previous fragmentation strategy based on a timewindow of 10 minutes and compare the performanceof route planning queries with file size-based fragmen-tation strategies Again in this case we only used theNMBS transport network including live data

The results allows us to draw some generals remarksthat apply for all the tested scenarios First of all isclear in every case that using a cache has a positive im-pact on the performance of route planning query an-swering In the case of static data we can see an im-provement on the performance of 411 for NMBS626 for TBS 555 for MLM and 039 for DeLijn In the case of historical queries we have an im-provement of performance of 386 for NMBS and518 for MLM In the scenario of live and historicalqueries we see a performance improvement of 495for NMBS Finally using a fragmentation based on atime window of 10 minutes we can observe a perfor-mance improvement of 429 for NMBS Allowing

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 16: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

16 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the usage of caching mechanisms is one of main char-acteristics that the LC framework brings to the routeplanning ecosystem Traditional route planning APIsbased on RPC (Remote Procedure Call) architecturesdo not support caching as every route planning queryrequires a new request to the API with an unique re-sponse in contrast to the LC framework where datafragments can be cached and reused for multiple routeplanning queries Is important to highlight this resultas in previous work caching proved to be a main factorfor increasing server scalability and cost-efficiency forroute planning applications [14] and now it has alsoproven to be a main factor for increasing the perfor-mance of route planning algorithms executed on theclient side

Another general remark that can be drawn is relatedto the significant difference in the query response timesthat can be observed from one transport network toanother If we take into account the characteristics ofeach network (shown in table 1) we can infer that thebigger and complex a transport network is the higherthe time to answer a specific route planning query Thisis an important issue from the usability perspective forroute planning applications as response times that aretoo long could render useless all the benefits that theLC framework brings since users wonrsquot be willing towait that long to get an answer for a query and mayresort to other route planning alternatives In the re-sults we can see that for a big and complex transportnetwork as De Lijn with 36050 stops and 1412 de-fined routes the lower median response time obtainedwas over 33 seconds which from an usability perspec-tive can be perceived as unacceptable In contrast wecan see that a much smaller transport network as TBSwith 27 stops and only 3 defined routes or MLM with81 stops and 4 defined routes achieve median responsetimes of 28ms and 57ms respectively which from ausability perspective is significantly fast These resultscan be explained also from a geographical perspec-tive If we consider that the TBS network comprehendsonly the city of Barcelona and the MLM network isrestricted only to the city of Madrid and we com-pare them De Lijn which covers several cities through-out the region of Flanders in Belgium and with dif-ferent transport modes then we can justify the signif-icant difference in query response times For examplewhen a query is being processed for a network as DeLijn to go from one stop to another within the city ofGhent the client will need to process also connectionsfrom vehicles departing in Antwerp Bruges and all thecities that the network covers Therefore this results

shred light over a very important requirement for theLC framework where datasets need to be as geograph-ically restricted as possible and clients should be ableto automatically select the LC streams that are relevantfor a particular query in order to maximize the queryanswering performance

Zooming in on the results we can also observe thatfor NMBS handling historical queries and includinglive data increment the response time of the route plan-ning queries For plain static data we have a medianresponse time of 704ms Querying for historical datathroughout different versions of a dataset raises themedian response time to 714ms And using live datafor answering route planning queries raises further themedian response time up to 859ms This was the ex-pected behaviour since using the Memento protocolfor historical queries involves additional HTTP redi-rections for the clients which increases the processingtime In the same way handling live data requires addi-tional processing by the LC server to retrieve and cor-rectly add the live updates into the data fragments be-fore giving them back to the clients However we canargue that the increment measured by adding live andhistorical data is not significantly high from a usabil-ity perspective For historical queries only we have anincrement of 10ms and for live data we have and in-crement of 155ms compared to plain static data In thesame way we observe that for MLM the increment ofthe median response time when dealing with historicalqueries goes from 57ms to 76ms giving and incrementof 19ms Such increments are small enough to notbe perceived by an end-user issuing a route planningquery These results represent an important achieve-ment for the LC framework as they show that evenhandling historical queries and taking into account livedata updates the LC framework in conjunction witha client implementing the CSA route planning algo-rithm are capable of providing an acceptable user ex-perience for real world route planning applications

As we previously mentioned the main goal of theseset of experiments was to determine if there is an op-timal fragment size for transport network datasets thatmaximize the performance of answering route plan-ning queries Looking at the results we can observethat there is For NMBS that was tested in all the de-fined scenarios we have that 50KB fragments with anactive cache always provide the best performance forroute planning queries over this transport network Thesame goes for the case of TBS having 50KB frag-ments with an active cache as the enablers of the high-est performance In the case of MLM we have that

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 17: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 17

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

10KB fragments provide the best performance how-ever when compared to 50KB fragments the differ-ence is only of 9ms which can be considered not sig-nificant and can be attributed to unpredictable delaysof the test environment In the case of De Lijn we havethat 500KB fragments provide the best results but thenas previously mentioned the De Lijn dataset does notconstitutes an acceptable input for the LC frameworkas its size and complexity cause unacceptable process-ing delays for a real world application and datasetswith this characteristics need to be further processedin a geographical sense to be exposed as LC There-fore we can affirm that a 50KB fragment size with anactive cache provides the optimal performance for an-swering route planning queries on top of the LC frame-work We can also observe that the performance de-creases when moving away from this optimal pointwhich can be explain by higher processing times whenthe fragment size increases and a lower probability ofcaching of smaller fragments Another particular re-mark from these results is that when there is not an ac-tive cache available there is also an optimal point butthis is higher than when there is a cache present Forthe case of NMBS we see such point with fragmentsof 300KB which represent the balance point betweenthe amount of requests needed for answering a queryand the processing time on the client side for a givenfragment For TBS and MLM we see that the optimalpoint is still 50KB which seems to highlight that with-out a cache there may be a relation between the opti-mal fragment size and the size and complexity of thetransport network

Finally we compared the performance of the NMBSdataset with historical queries and including live up-dates using the previous fragmentation strategy us-ing fragments based on a time window of 10 minutesagainst the current approach based on a constant frag-ment size The results show that using a fragment sizeof 50KB and an active cache there is an improvementof 41 in the performance At first glance this re-sult seems to be not a significant improvement how-ever we need to consider that since the time-based ap-proach does not have a constant fragment size the per-formance for a given query will depend on the query it-self and the complexity of the network thus renderingthe file size-based fragmentation approach more pre-dictable and appropriate for real world route planningapplications supported by the LC framework

7 Conclusions and Future work

In this work we present the Linked Connectionsframework for providing reliable access to historicaland live data First we analyse the previous worksmade over the Linked Connections and their problemson efficiency when historical and live data are takeninto account Second we extend the Linked Connec-tions vocabulary to consider these types of data Thirdwe develop a tool for creating streams of connectionsbased on the information provided by live updatesFourth we adapt the LC server to efficiently manageand expose the transport data on the Web Finally wetest our approach by analysing how the fragment sizeaffects the performance of the CSA route planning al-gorithm run on the client side and compare them withour previous approach that fragmented the data basedon a time span

The conclusions that can be drawn from this workare the following

ndash Using a cache mechanism significantly improvesthe performance of answering route planningqueries using the CSA algorithm on top of an im-plementation of the LC framework In previouswork we were able to demonstrate the positiveimpact that a cache mechanism has for the scala-bility and cost-efficiency of a LC server comparedto traditional approaches where caching is notpossible Now we have also proven how cachingdata fragments reduce the processing time ofroute planning queries as the clients keep in mem-ory the data fragments that can be reused for an-swering multiple queries that are on the a similartime frame and do not have the need to requestthem again to the server

ndash The size and complexity of a transport networkare directly related to the response times for an-swering route planning queries We observed thatas the size and complexity in terms of numberof stops routes an trips of a transport dataset in-crease the response times also increase up to apoint where it becomes unusable for a real worldroute planning application This fact allowed us toidentify a new requirement for datasets to be pub-lished as LC where the smaller and less complexthe dataset the higher the performance This canbe seen from a geographical perspective wherethe datasets may be also fragmented by geograph-ical areas and also by transport modes thus low-ering their complexity However this will require

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 18: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

18 J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

the clients to have the capacity to automaticallydiscover and consume the relevant LC sources fora specific query taking into account the involvedgeographical regions and transport modes

ndash Adding the capacity to execute historical queriesby means of the Memento protocol and han-dling live data updates openly published usingthe GTFS-RT format reduce the performance ofquery answering as expected However the mea-sured increase in the response time for viabledatasets using the LC framework do not representa significant decrease to render unusable a realworld route planning application based on the LCframework

ndash The main conclusion drawn from this work isrelated to the different fragmentation strategiesthat may be used within the LC framework Weobserved that for a set up that includes an ac-tive cache the best performance for route plan-ning query answering is achieved by having a filesize-based fragmentation of 50KB We also ob-served that without a cache mechanism the opti-mal fragment size seems to be related to the trans-port network size and complexity however as wepreviously mentioned caching is a main featureof the LC framework which brings out its wholepotential and differentiates it from traditional ap-proaches

ndash Finally we proved that by having a constant frag-ment size it is possible to predict and maximizethe performance of route planning query answer-ing in the LC framework in contrast to our pre-vious approach where we used time-based frag-mentations that created a heterogeneous environ-ment of fragments regarding their size where theperformance depends on the type of query and thecomplexity of the transport network

The LC framework stands as cost-efficient alterna-tive to build knowledge graphs from open data in thetransport sector that can be reused to build richer ap-plications within the Web of data By relaying on theLinked Data principles and defining a set of commonsemantics the LC framework allows to publish trans-port datasets on the Web optimized for route planningapplications in a decentralized fashion that lower theadoption costs of the data This establishes the foun-dations for creating a route planning ecosystem sup-ported on open data that fosters innovation on the sec-tor and aligns to the directives given by the EU re-

garding the discoverability and accessibility of publictransport data

For future work we plan to develop toolslibrariesable to transform other transport format standards likeDatex II Transmodel or SIRI and expose streams ofconnections following the LC approach on the WebWe also want to create a method able to find the op-timal fragment size of a dataset based on several fea-tures like the type of the transport the size and com-plexity of the full Linked Connections feed the vari-ability of the updates or the geographical location Im-proving the discoverability of transport datasets is an-other line we want to take into account we started towork on a registry by extending DCAT-AP to a Trans-port application profile32 to improve the discoverabil-ity of these datasets The implementation of an usableclient that can perform multimodal route planning andeven take into account other types of datasets throughquery federation is also one of the lines of work weplan to pursue Finally we want to create a standardbenchmarking with GTFS and GTFS-RT datasets sev-eral representative route planning algorithms and ad-hoc relevant metrics

Acknowledgements

This work has been partially supported by a predoc-toral grant from the I+D+i program of the UniversidadPoliteacutecnica de Madrid and also by the CEF Europeanproject OASIS CEF-26696297

References

[1] C Bizer T Heath and T Berners-Lee Linked data-the storyso far International journal on semantic web and informationsystems 5(3) (2009) 1ndash22

[2] T Heath and C Bizer Linked data Evolving the web into aglobal data space Synthesis lectures on the semantic web the-ory and technology 1(1) (2011) 1ndash136

[3] E Prud A Seaborne et al SPARQL query language for RDF(2006)

[4] C Buil-Aranda M Arenas O Corcho and A Polleres Fed-erating queries in SPARQL 11 Syntax semantics and eval-uation Web Semantics Science Services and Agents on theWorld Wide Web 18(1) (2013) 1ndash17

[5] P Colpaert A Llaves R Verborgh O Corcho E Mannensand R Van de Walle Intermodal public transit routing usingLiked Connections in International Semantic Web Confer-ence Posters and Demos 2015 pp 1ndash5

32httpsgithubcomcef-oasisDCAT-AP

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References
Page 19: Semantic Web 0 (0) 1 IOS Press Publishing and archiving planned … · Julian Rojasa, David Chaves-Fragab, Pieter Colpaerta, Oscar Corchob and Ruben Verborgha ... on the Web [2]

J Rojas et al Publishing and archiving planned and live public transport events with the Linked Connections framework 19

1 1

2 2

3 3

4 4

5 5

6 6

7 7

8 8

9 9

10 10

11 11

12 12

13 13

14 14

15 15

16 16

17 17

18 18

19 19

20 20

21 21

22 22

23 23

24 24

25 25

26 26

27 27

28 28

29 29

30 30

31 31

32 32

33 33

34 34

35 35

36 36

37 37

38 38

39 39

40 40

41 41

42 42

43 43

44 44

45 45

46 46

47 47

48 48

49 49

50 50

51 51

[6] JA Rojas Melendez D Chaves P Colpaert R Verborghand E Mannens Providing reliable access to real-time andhistoric public transport data using linked v-connections inISWC2017 the 16e International Semantic Web ConferenceVol 1931 2017 pp 1ndash4

[7] P Colpaert A Chua R Verborgh E Mannens R Van deWalle and A Vande Moere What public transit API logstell us about travel flows in Proceedings of the 25th Inter-national Conference Companion on World Wide Web Inter-national World Wide Web Conferences Steering Committee2016 pp 873ndash878

[8] R Verborgh O Hartig B De Meester G HaesendonckL De Vocht M Vander Sande R Cyganiak P ColpaertE Mannens and R Van de Walle Querying datasets on the webwith high availability in International Semantic Web Confer-ence Springer 2014 pp 180ndash196

[9] R Verborgh M Vander Sande O Hartig J Van HerwegenL De Vocht B De Meester G Haesendonck and P ColpaertTriple Pattern Fragments a low-cost knowledge graph inter-face for the Web Web Semantics Science Services and Agentson the World Wide Web 37 (2016) 184ndash206

[10] R Verborgh M Vander Sande P Colpaert S CoppensE Mannens and R Van de Walle Web-Scale Querying throughLinked Data Fragments in LDOW Citeseer 2014

[11] R Taelman J Van Herwegen M Vander Sande and R Ver-borgh Comunica a Modular SPARQL Query Engine for theWeb in International Semantic Web Conference 2018

[12] JD Fernaacutendez MA Martiacutenez-Prieto C GutieacuterrezA Polleres and M Arias Binary RDF representation forpublication and exchange (HDT) Web Semantics ScienceServices and Agents on the World Wide Web 19 (2013) 22ndash41

[13] M Lanthaler and C Guumltl Hydra A Vocabulary forHypermedia-Driven Web APIs LDOW 996 (2013)

[14] P Colpaert R Verborgh and E Mannens Public Transit RoutePlanning Through Lightweight Linked Data Interfaces in In-ternational Conference on Web Engineering Springer 2017pp 403ndash411

[15] P Colpaert S Ballieu R Verborgh and E Mannens The Im-pact of an Extra Feature on the Scalability of Linked Connec-tions in COLD ISWC 2016

[16] D Chaves-Fraga J Rojas P-J Vandenberghe P Colpaert andO Corcho The tripscore Linked Data client calculating spe-

cific summaries over large time series in Proceedings of theWorkshop on Decentralizing the Semantic Web (DeSemWeb)2017

[17] J Dibbelt T Pajor B Strasser and D Wagner Intriguinglysimple and fast transit routing in International Symposium onExperimental Algorithms Springer 2013 pp 43ndash54

[18] J Tennison G Kellogg and I Herman Model for tabular dataand metadata on the web W3C recommendation World WideWeb Consortium (W3C) (2015)

[19] A Dimou M Vander Sande P Colpaert R Verborgh E Man-nens and R Van de Walle RML A Generic Language for In-tegrated RDF Mappings of Heterogeneous Data in LDOW2014

[20] S Das S Sundara and R Cyganiak R2RML RDB toRDF Mapping Language W3C Recommendation 27 Septem-ber 2012 Cambridge MA World Wide Web Consortium(W3C)(www w3 orgTRr2rml) (2012)

[21] H Bast E Carlsson A Eigenwillig R Geisberger C Harrel-son V Raychev and F Viger Fast routing in very large publictransportation networks using transfer patterns in EuropeanSymposium on Algorithms Springer 2010 pp 290ndash301

[22] D Delling T Pajor and RF Werneck Round-based publictransit routing Transportation Science 49(3) (2014) 591ndash604

[23] S Witt Trip-based public transit routing in Algorithms-ESA2015 Springer 2015 pp 1025ndash1036

[24] B Strasser and D Wagner Connection scan accelerated in2014 Proceedings of the Sixteenth Workshop on Algorithm En-gineering and Experiments (ALENEX) SIAM 2014 pp 125ndash137

[25] WWW Consortium et al JSON-LD 10 a JSON-based seri-alization for linked data (2014)

[26] C Bizer and R Cyganiak RDF 11 TriG RDF Dataset Lan-guage W3C recommendation W3C 25 (2014)

[27] R Cyganiak A Harth and A Hogan N-quads Extending n-triples with context W3C Recommendation (2008)

[28] H Van de Sompel R Sanderson ML Nelson LL Bal-akireva H Shankar and S Ainsworth An HTTP-basedversioning mechanism for linked data arXiv preprintarXiv10033661 (2010)

  • Introduction
  • Related Work
  • The Linked Connections Framework
    • The Linked Connections specification
    • Live data in Linked Connections
    • Managing live and historical data with the Linked Connections server
    • A running example Providing reliable access to NMBS historical and real time data
      • Evaluation Design
      • Results
        • Static data with fragmentation by size
          • NMBS
          • TBS Barcelona
          • MLM
          • De Lijn
            • Historical data with fragmentation by size
              • NMBS
              • MLM
                • Live and historical data with fragmentation by size
                • Live and historical data with fragmentation by time
                  • Discussion
                  • Conclusions and Future work
                  • Acknowledgements
                  • References