industrialized linked data
Post on 08-May-2015
1.534 Views
Preview:
DESCRIPTION
TRANSCRIPT
Industrialized Linked Data
Dave Reynolds, Epimorphics Ltd @der42
Context: public sector Linked Data
Linked Data journey ...
explore
what is linked data?
what use it is for us?
Linked Data journey ...
explore
what is linked data?
what use it is for us?
self-describing
carries semantics with it
annotate and explain
data in context
...
Integration
comparable
slice and dice
web API
...
Linked Data journey ...
explore
what is linked data?
what use it is for us?
what’s involved?
self-describing
carries semantics with it
annotate and explain
data in context
...
Integration
comparable
slice and dice
web API
...
Linked Data journey ...
explore pilot
data model publish convert apply
Photo of The Thinker © dSeneste.dk@flicker CC BY
Linked Data journey ...
explore pilot routine?
Great pilot but ...
can we reduce the time and cost?
how do we handle changes and updates?
how can we make the published data easier to use?
How do we make Linked Data “business as usual”?
Example case study: Environment Agency
monitoring of bathing water quality
static pilot
live pilot
historic annual assessments
weekly assessments
operational system
additional data feeds
live update
integrated API
data explorer
From pilot to practice
reduce modelling costs patterns
reuse
handling change and update patterns
publication process
automation conversion
publication
embed in the business process use internally as well as externally
publish once, use many
data platform
dive1
Reduce costs - modelling
1. Don’t do it
map source data into isomorphic RDF, synthesize URIs
loses some of the value proposition
2. Reuse existing ontologies intact or mix-and-match
best solution when available
W3C GLD work on vocabularies – people, organizations, datasets ...
3. Reusable vocabulary patterns
example:
Data cube plus reference URI sets
adaptable to broad range of data – environmental, statistical, financial ...
Reusable patterns: Data cube
Much public sector data has regularities
set of measures
observations, forecasts, budgets, assessments, statistics ...
27 good
125
excellent
good
>0.1 34
poor
Reusable patterns: Data cube
Much public sector data has regularities
sets of measures
observations, forecasts, budgets, assessments, estimates ...
organized along some dimensions
region, agency, time, category, cost centre ...
120 130 180
8 9 11
12 15 25
time
cost centre
measure: spend
objective code
Reusable patterns: Data cube
Much public sector data has regularities
sets of measures
observations, forecasts, budgets, assessments, estimates ...
organized along some dimensions
region, agency, time, category, cost centre ...
interpreted according to attributes
units, multipliers, status
$120k $130k $180k
$8k $9k $11k
$12k $15k $25k
time
cost centre objective code
provisional
final
measure: spend
Data cube vocabulary
Data cube pattern
Pattern, not a fixed ontology
customize by selecting measures, dimensions and attributes
originated in publishing of statistics
applied to environment measurements, weather forecasts, budgets and spend, quality assessments, regional demographics ...
Supports reuse
widely reusable URI sets – geography, time periods, agencies, units
organization-wide sets
modelling often only requires small increments on top of core pattern and reusable components
opens door for reusable visualization tools
standardization through W3C GLD
Application to case study
Data Cubes for water quality measurement
in-season weekly assessments
end of season annual assessments
dimensions:
time intervals – UK reference time service
location - reference URI set for bathing waters and sample pts
cubes can reuse these dimensions
just need to define specific measures
From pilot to practice
reduce modelling costs patterns
reuse
handling change and update patterns
publication process
automation conversion
publication
embed in the business process use internally as well as externally
publish once, use many
data platform
dive 2
Handling change
critical challenge
most initial pilots choose a snapshot dataset
and go stale, fast
understanding the nature of data updates and how to handle them is critical to successful scaling to business as usual
types of change
new data related to different time period
corrections to data
entities change
properties
identity
Modelling change 1. Individual data items relate to new time period
Pattern: n-ary relation observation resource relates value to time period and other context
use Data Cube dimensions for this
History or latest? latest is non-monotonic but helpful for many practical uses
materialize (SPARQL Update), implement in query, implement in API
choice whether to keep history as well water quality v. weather forecasts
bwq:sampleYear
http://environment.data.gov.uk/id/bathing-
water/ukk1202-36000
Clevedon Beach
http://reference.data.gov.uk/id/year/2009 bwq:bathingWater
bwq:classification Higher
http://reference.data.gov.uk/id/year/2010 bwq:sampleYear
bwq:classification Minimum
http://reference.data.gov.uk/id/year/2011 bwq:sampleYear
bwq:classification Higher
Modelling change 2. Corrections
patterns
silent change (!)
explicit replacement
API level hides replaced values but SPARQL query can retrieve & trace
explicit change event
dct:isReplacedBy
http://environment.data.gov.uk/id/bathing-
water/ukk1202-36000
Clevedon Beach
classification : Higher http://reference.data.gov.uk/id/year/2011
classification : Minimum status: replaced
reason: reanalysis
dct:replaces
bwq:bathingWater bwq:sampleYear
analysis event
ev:before
ev:after
ev:occuredOn
ev:agent
Modelling change 3. Mutation
Infrequent change of properties, essential identity remains
e.g. renaming a school, adding another building
routine accesses see property value, not function of time
patterns
in place update
named graphs current graph + graphs for each previous state + meta-graph
explicit versioning with open periods
Modelling change 3. Mutation
explicit versioning with open periods
find right version by query on validity interval
simplify use through
non-monotonic “latest value” link
API to implement query filters automatically
“Clevedon Beach” “Clevedon Sands”
endurant
2003
2011
dct:valid
time:intervalStarts
time:intervalFinishes
2011 dct:valid
time:intervalStarts
dct:hasVersion dct:hasVersion
Application to case study
weekly and annual samples
use Data Cube pattern (n-ary relation)
withdrawn samples replacement pattern (no explicit change event)
Data Cube slice for “latest valid assessment”
generated by a SPARQL Update query
API gives easy access to the latest valid values
linked data following or raw SPARQL query allows drilling into changes
changes to bathing water profile
versioning pattern
bathing water entity points to latest profile (SPARQL Update again)
From pilot to practice
reduce modelling costs patterns
reuse
handling change and update patterns
publication process
automation conversion
publication
embed in the business process use internally as well as externally
publish once, use many
data platform
dive 3
Automation Transform and publish data feed increments
transformation engine service
reusable mappings, low cost to adapt to new feeds
linking to reference data
publication service that supports non-monotonic changes
Reference data
data increments (csv)
replicated publication
servers
transform service
pu
blicatio
n
service
xform spec. xform
spec.
reconciliation service
xform spec.
Transformation service
declarative specification of transform
single service support range of transformations
easy to adapt transformation to new feeds and modelling changes
R2RML – RDB to RDF Mapping Language
specify mapping from database tables to RDF triples
W3C candidate recommendation
D2RML
R2RML extension to treat CSV feed as a database table
Small D2RML example :dataSource a dr:CSVDataSource ;
rdfs:label "dataSource" .
:bathingWaterTermMap a dr:SubjectMap;
dr:template "http://environment.data.gov.uk/id/bathing-water/{EUBWID2}" ;
dr:class def-bw:BathingWater .
:bathingWaterMap
dr:logicalTable :dataSource ;
dr:subjectMap :bathingWaterTermMap ;
dr:predicateObjectMap [
dr:predicate rdfs:label ;
dr:objectMap [dr:column "description_english" ; dr:language "en" ] ]
dr:predicateObjectMap [
dr:predicate def-bw:eubwidNotation;
dr:objectMap [ dr:column "EUBWID2"; dr:datatype def-bw:eubwid ] ] .
Using patterns
problems with verbosity, increases reuse costs
extend to support modelling patterns
Data Cube
specify mapping to observation with measures and dimensions
engine generates Data Set and Data Structure Definition automatically
D2RML cube map example :dataCubeMap a dr:DataCubeMap ;
rr:logicalTable “dataSource”;
dr:datasetIRI “http://example.org/datacube1”^^xsd:anyURI ;
dr:dsdIRI “http://example.org/myDsd”^^xsd:anyURI ;
dr:observationMap [
rr:subjectMap [
rr:termType rr:IRI ;
rr:template “http://example.org/observation/{PLACE}/{DATE}” ] ;
rr:componentMap [
dr:componentType qb:measure ;
rr:predicate aq:concentration ;
rr:objectMap [ rr:column “NO2” ; rr:datatype xsd:decimal ; ]
] ;
... Define how measure
value is to be represented
Instances will automatically link to
base Data Set
Implies an entry in the Data Structure Definition which is
auto-generated
But what about linking?
connect observations to reference data
a core value of linked data
R2RML has Term Maps to create values
constants and templates
extend to allow maps based on other data sources
Lookup map
lookup resource in a store, fetch predicate
Reconcile
specify lookup in a remote service
use Google Refine reconciliation API
Automation Transform and publish data feed increments
transformation engine service
reusable mappings, low cost to adapt to new feeds
linking to reference data
publication service that supports non-monotonic changes
Reference data
data increments (csv)
replicated publication
servers
transform service
pu
blicatio
n
service
xform spec. xform
spec.
reconciliation service
xform spec.
Publication service
goals
cope with non-monotonic effects of change representation
so replication is robust and cheap (=> make it idempotent)
solution
SPARQL Update
publish transformed increment as a simple DATA INSERT
then run SPARQL Update script for non-monotonic links
dct:replacedBy links
lastest value slices
Sample update script DELETE {
?bw bwq:latestComplianceAssessment ?o .
} WHERE {
?bw bwq:latestComplianceAssessment ?o .
}
INSERT {
?bw bwq:latestComplianceAssessment ?o .
} WHERE {
{
?slice a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year].
OPTIONAL {
?slice2 a bwq:ComplianceByYearSlice; bwq:sampleYear [interval:ordinalYear ?year2].
FILTER (?year2 > ?year)
} FILTER ( !bound(?slice2) )
}
?slice qb:observation ?o .
?o bwq:bathingWater ?bw.
}
Automation Transform and publish data feed increments
transformation engine service
reusable mappings, low cost to adapt to new feeds
linking to reference data
publication service that supports non-monotonic changes
Reference data
data increments (csv)
replicated publication
servers
transform service
pu
blicatio
n
service
xform spec. xform
spec.
reconciliation service
xform spec.
Application to case study
Update server
transforms based on scripts (earlier scripting utility)
linking to reference data
distributed publication via SPARQL Update
extensible range of data sets annual assessments
in-season assessments
bathing water profile
features (e.g. pollution sources)
reference data
From pilot to practice
reduce modelling costs patterns
reuse
handling change and update patterns
publication process
automation conversion
publication
embed in the business process use internally as well as externally
publish once, use many
data platform
dive 4
Embed in business process
embedding is critical to ensure data kept up to date
in turn needs usage
=> lower barrier to use
data not used
hard to justify
data goes stale
external use
invest rich, up to date
data
internal use
Lowering barrier to use
simple REST APIs
use Linked Data API specification
rich query without learning SPARQL
easy consumption as JSON, XML
gets developers used to data and data model
transform service
pu
blicatio
n
service
LD API
Application to case study
embedded in process for weekly/daily updates
infrastructure to automate conversion and publishing
API plus extensive developer documentation
third party and in-house applications built over API
publish once, use many
information products as applications over a data platform, usable externally as well as internally
The next stage
grow range of data publications and uses
range of reference data and sets brings new challenges
discover reference terms and models to reuse
discover datasets to use for application
discover models and links between sets
needs a coordination or registry service
story for another day ...
Conclusions
illustrated how public sector users of linked are moving from static pilots to operational systems
keys are:
reduce modelling costs through patterns and reuse
design for continuous update
automation of publication using declarative mappings and SPARQL Update
lower barrier to use through API design and documentation
embed in organization’s process so the data is used and useful
Acknowledgements Only possible thanks to many smart colleagues: Stuart Williams, Andy Seaborne, Ian Dickinson, Brian McBride, Chris Dollin plus Alex Coley and team from the Environment Agency
top related