tpdl2013 tutorial linked data for digital libraries 2013-10-22
DESCRIPTION
Tutorial on Linked Data for Digital Libraries, given by me, Uldis Bojars, and Nuno Lopes in Valletta, Malta at TPDL2013 on 2013-10-22. http://tpdl2013.upatras.gr/tut-lddl.php This half-day tutorial is aimed at academics and practitioners interested in creating and using Library Linked Data. Linked Data has been embraced as the way to bring complex information onto the Web, enabling discoverability while maintaining the richness of the original data. This tutorial will offer participants an overview of how digital libraries are already using Linked Data, followed by a more detailed exploration of how to publish, discover and consume Linked Data. The practical part of the tutorial will include hands-on exercises in working with Linked Data and will be based on two main case studies: (1) linked authority data and VIAF; (2) place name information as Linked Data. For practitioners, this tutorial provides a greater understanding of what Linked Data is, and how to prepare digital library materials for conversion to Linked Data. For researchers, this tutorial updates the state of the art in digital libraries, while remaining accessible to those learning Linked Data principles for the first time. For library and iSchool instructors, the tutorial provides a valuable introduction to an area of growing interest for information organization curricula. For digital library project managers, this tutorial provides a deeper understanding of the principles of Linked Data, which is needed for bespoke projects that involve data mapping and the reuse of existing metadata models.TRANSCRIPT
Linked Data for Digital Libraries
Uldis Bojars, Nuno Lopes, & Jodi SchneiderTPDL 2013
September 22, 2013Valletta, Malta
1
NunoDigital Repository of Ireland &DERI
UldisNational Library of Latvia
JodiDERI
Schedule for the day9:00 - Introduction of presenters, tutorial schedule, and learning outcomes9:10 - Motivation and concepts of Linked Data9:30 - Discuss: How would you envision using Linked Data in your institution? 9:45 - Lifecycle of Linked Data & Exploring Linked Data10:10 - Case Study 1: Authority Data
10:30 – 11 COFFEE BREAK
11:00 - Recap 11:10 - Modelling data as Linked Data11:30 - Case Study 2: Geographical Linked Data11:50 - Choice of Hands-on Activities12:25 - Conclusions
Hands-on Activities
11:50 – 12:25Choice of Activities….
• Data Modelling• Data Cleaning & Structuring• Querying (SPARQL)
Please share your expertise!
• In the room• On paper• Online - shared folder:
http://tinyurl.com/tpdl2013-ld-notes– PDF of the programme– Shared notes– More materials later
• What is Linked Data? Why use it? • What are some examples of
Linked Data in Digital Libraries?• What are the best practices for
exploring & creating Linked Data?
Objectives for Today
Motivation and concepts of Linked Data
• Using identifiers• to enable access• to add structure • to link to other stuff
What is Linked Data?
Why use Linked Data?
Key technology for library data! Representing
PublishingExchanging
• Powerful querying• Ability to mix/match vocabularies• Same technology stack as everybody else
– Findability– Interoperability
Who is using Linked Data?
Aggregators
Integrated Library Systems & OPACs
Thesauri
Repositories
What is Linked Data (redux)?
Rob Styles
Towards RDF
Subject Predicate Object
RDF triple
Subject Predicate Object
RDF graph
Reuses the existing Web infrastructure to publish your data along with your documents:
– Using URI identifiers– and HTTP for accessing the information
How Linked Data works
Linked Data Principles
1. Use URIs as names for things 2. Use HTTP URIs so that people can look up
those names. 3. When someone looks up a URI, provide
useful information, using the standards- RDF, SPARQL
4. Include links to other URIs. so that they can discover more things.
http://www.w3.org/wiki/LinkedDatahttp://www.w3.org/DesignIssues/LinkedData
• We need a proper infrastructure for a real Web of Data– data is available on the Web
• accessible via standard Web technologies
– data is interlinked over the Web– ie, data can be integrated over the Web
• We need Linked Data
Data on the Web is not enough…
Slide credit: Ivan Herman
In groups of 2-3: Discuss
• How would you envision using Linked Data?What are the opportunities?
• Is your institution already using Linked Data? Planning a Linked Data project?
Lifecycle of Linked Data
Lifecycle of Linked Data
• Find• Explore• Transform• Model• Store• Query• Interlink• Publish
Uldis Bojars, Nuno Lopes, & Jodi Schneider
Semantic Web for Digital Libraries
Exploring Linked Data(Practical Tools and Approaches)
Objectives
• Learn about Linked Data (LD) by looking at existing data sources
• Discover tools and approaches for exploring Linked Data
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Exploring Linked Data
• Discovering Linked Data• Accessing RDF data• Making sense of the data
– Validating RDF data– Converting between formats– Browsing Linked Data
• Querying RDF data
RDF graph
What RDF looks like
• RDF can be expressed in a number of formats:– some are good for machines;
some – understandable to people
• Common formats:– RDF/XML – common, but difficult to read– NTriples – a simple list of RDF triples – Turtle – human-readable, easier to understand
• Can be represented visually
Accessing RDF data
RDF data on the Web can be found as:
• Linked Data– follow links, request data by URI– returned data can be in various RDF formats
• Data dumps– download the data
• SPARQL endpoints– query Linked Data (more on that later)
http://www.ivan-herman.net/
Discovering Linked Data
a) find on a link in a Web pageb) have some tools alert you Linked Data is there
– Tabulator– Semantic Radar
c) explore a project you heard about – and know LOD should be there
d) use a registry of sources http://datahub.io/group/lodcloud
e) Just ask someone
RDF discovery example
• data at Ivan Herman’s page can be found via:– finding the RDF icon (with the link to FOAF file)– letting browser tools alert you that RDF is present
• RDF auto-discovery
– extracting RDFa data embedded in the page
• for other data sources RDF content negotiation might work
Making sense of the data
• Validating RDF data– Ensures that data representation is correct
• Converting between formats– Convert to a [more] human-readable RDF format
• Browsing Linked Data– Browse the data without worrying about
“reading” RDF
Validating and Converting RDF
• W3C RDF validator http://www.w3.org/RDF/Validator/
• URI debugger – “Swiss knife” of Linked Datahttp://linkeddata.informatik.hu-berlin.de/uridbg/
• RDFa distiller – extracts RDF embedded in web pages http://www.w3.org/2012/pyRdfa/
• Command-line tools (we’ll return to that)
<http://www.ivan-herman.net/> a foaf:PersonalProfileDocument; dc:creator "Ivan Herman"; dc:date "2009-06-17"^^xsd:date; dc:title "Ivan Herman’s home page"; xhv:stylesheet <http://www.ivan-herman.net/Style/gray.css>; foaf:primaryTopic <http://www.ivan-herman.net/foaf#me> .
<http://twitter.com/ivan_herman> a foaf:OnlineAccount; foaf:accountName "ivan_herman"; foaf:accountServiceHomepage <http://twitter.com/> .
<http://www.ivan-herman.net/cgi-bin/rss2to1.py> a rss:channel .
<http://www.ivan-herman.net/foaf#me> a dc:Agent,
foaf:Person; rdfs:seeAlso <http://www.ivan-herman.net/AboutMe>, <http://www.ivan-herman.net/cgi-bin/rss2to1.py>, <http://www.ivan-herman.net/foaf.rdf>;...
Extracted from http://www.ivan-herman.net/ using RDFa Distiller
Browsing Linked Data (DBPedia):http://live.dbpedia.org/resource/Valletta
Command Line Tools
• wget – command line network downloader$ wget http://dbpedia.org/resource/Valletta
• curl – specify HTTP headers$ curl -L -H "Accept: text/rdf+n3” http://dbpedia.org/resource/Valletta
• Redland rapper – RDF parsing and serialisation$ rapper -o turtle http://dbpedia.org/resource/Valletta
Querying Linked Data
• SPARQL Protocol and RDF Query Language• Graph Matching• Components of a SPARQL Query:
– Prefix Declarations– Result type (SELECT, CONSTRUCT, DESCRIBE, ASK)– Dataset– Query pattern– Solution modifiers
Europeana SPARQL endpoint
http://europeana.ontotext.com/
Sample queries provided:http://europeana.ontotext.com/sparql
http://tinyurl.com/europeana-rights-sparql
Tool catalogues: many more tools
• Collection of tools from other projects– http://www.w3.org/2001/sw/wiki/LLDtools– http://www.w3.org/2001/sw/wiki/Tools– http://semanticweb.org/wiki/Tools– http://dbpedia.org/Applications
Interesting Projects• LOCAH
a stylesheet to transform UK Archives Hub EAD to RDF/XML, and provides examples of the process using XLST http://data.archiveshub.ac.uk/ead2rdf/
• AliCAT (Archival Linked-data Cataloguing)Tool for editing collection level recordshttp://data.aim25.ac.uk/step-change/
• Axiell CALM Solution for LAM that includes Linked Data functionality, allowing archivists to tag their collections with URIs from any chosen Linked Dataset.
http://www.axiell.com/calm
Tools for Converting MARC records
• MariMba Tool to translate MARC to RDF and Linked Datahttp://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/downloads/228-marimba
• marcauth-2-madsrdfXQuery utility to convert MARC/XML Authority records to MADS/RDF and SKOS resources https://github.com/kefo/marcauth-2-madsrdf
Tools for museum curators
• Karma (http://isi.edu/integration/karma/)was used to map the records of the Smithsonian American Art Museum to RDF and link them the Web and the Linked Open Data Cloud. Demo: http://www.youtube.com/watch?v=kUIqTI56oeQ
Authority Linked Data
VIAF and Wikipedia case study
library linksSlide credit: Jindřich Mynarz
• Use a single, distinct name for each person, organization, …
• Name is consistently used throughout library systems
• Issues:– “Strings” not “things”– in Linked Data world we’d just use
URIs
VIAF
• Virtual Internet Authority File (viaf.org)
• Integrating authority information from a number of national libraries– Linked data + links to related information
• Matching authority data from multiple sources– using related bibliographic records to help
matching
Wikipedia + VIAF
• How can people discover useful information in VIAF and via VIAF?
• Linked Data eco-system – let’s explore (!)– Wikipedia -> VIAF -> National Library LD
• Example (Andrejs Pumpurs):– http://en.wikipedia.org/wiki/Andrejs_Pumpurs– http://viaf.org/viaf/44427367/
http://en.wikipedia.org/wiki/Andrejs_Pumpurs
http://viaf.org/viaf/44427367/
VIAF
• Ontologies used:– FOAF, SKOS, RDA (FRBR entities and elements),
Dublin Core, VIAF, UMBEL
• Related datasets:– National authority data:
• Germany (d-nb.info), Sweden (LIBRIS), France (idref.rf)
– DBPedia
http://viaf.org/viaf/44427367/
How did VIAF get into Wikipedia?
• VIAFbot– algorithmically matched by name, important
dates, and selected works• “The principal benefit of VIAFbot is the
interconnected structure.” -
One Direction
VIAF English Wiki
Slide credit: Maximilian Klein, Wikipedian in Residence at OCLC
Enter VIAFBot: Wikipedia Robot
VIAF English Wiki
Slide credit: Maximilian Klein, Wikipedian in Residence at OCLC
Idea: Reciprocate
VIAF English Wiki
Slide credit: Maximilian Klein, Wikipedian in Residence at OCLC
VIAF – summary:
– an efficient way for putting library authority data online as linked data
– in case if the organization also provides Linked Data itself can add links to VIAF to link back to organization’s LD records (which may contain richer / additional information)
Data Modelling
Publishing Data
• Naïve Transform– Direct Mapping of Relational Data to RDF
See RDB2RDF
OR• Model & Transform
– Figure out how to represent data– Then transform according to the model
Model
• Describe the domain– What are the important concepts?– What are their properties?– What are their relations?
• Choose vocabularies
DC TERMS RDF Vocabularyhttp://purl.org/dc/terms/
Deciding on URI patterns
• Use a domain that you control• Use consistent patterns• Manage change: transparent isn’t always best• Consider what concepts are worth
distinguishing
Example URI patterns
• Designing URI Sets for the UK Public Sector• Defines patterns for
– Identifier URI– Document URI– Representation URI
• Identifier example:http://{domain}/id/{concept}/{reference}
http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist
Choosing Vocabularies
• Audience & Purpose – e.g. search engine vs. bibliographic exchange
• Domain– Biomedical, geographical, …
• Granularity• Popularity: potential for interlinking & reuse
Finding vocabularies & ontologies
Look at examples
Look at examples
Find examples: Linked Open Data Cloud
Look at Publications & Listshttp://www.w3.org/2005/Incubator/lld/XGR-lld-vocabdataset-20111025/
Ask the community
• Mailing lists– LOD-LAM– Code4Lib– OKFN Open-Bibliography Working Group– W3C Schema.org BibEx Community Group
• Domain-specific Linked Data groups & lists
Popularity
Popularity: Semantic search engines
http://sindice.com/
Modeling spectrum:lightweight to heavyweight
An ontology ”spectrum” (in the order of complexity). Source: [Lassila and McGuinness, 2001]. Image from Bojars 2009
Some popular vocabularies
• DC• BIBO• FOAF• LODE (LinkedEvents)• OAI-ORE• SKOS
Be aware of & connect to
• Authority data– e.g. VIAF
• Thesauri– e.g. Agrovoc
• Linked Data is about Linking!
Modeling examples
• BIBFRAME• British Library Data Model• EDM• LIBRIS• VIAF
VIAF
• Ontologies used:– FOAF, SKOS, RDA (FRBR entities and elements),
Dublin Core, VIAF, UMBEL
• Related datasets:– National authority data:
• Germany (d-nb.info), Sweden (LIBRIS), France (idref.rf)
– DBPedia
LIBRIS Modeling
British Library Data Modelhttp://www.bl.uk/bibliographic/pdfs/bldatamodelbook.pdf
Uldis Bojars, Nuno Lopes, & Jodi Schneider
Semantic Web for Digital LibrariesGeographical LD case study
• Collections refer to Geographical Data in many forms…
• The Longfield Maps are a set of 1,570 surveys carried out in Ireland between 1770 and 1840.
• Currently catalogued in MarcXML, using data from Logainm, Geonames and Dbpedia.
The NLI Longfield Map Collection
<marc:datafield tag="650" ind1="" ind2="">
<marc:subfield code="a">Land tenure</marc:subfield>
<marc:subfield code="z">Ireland</marc:subfield>
<marc:subfield code="z">Rathdown (Barony)</marc:subfield>
</marc:datafield>
<marc:datafield tag="650" ind1="" ind2="">
<marc:subfield code="a">Land use surveys</marc:subfield>
<marc:subfield code="z">Ireland</marc:subfield>
<marc:subfield code="z">Wicklow (County)</marc:subfield>
</marc:datafield>
Longfield Map example
DBpedia– Includes latitude and longitude for geographic entities
LinkedGeoData – Export of data from OpenStreetMap– Beyond lat/lon (areas as polygons)
GeoNames– Access data as RDF (download requires subscription)
Geographic Data Providers
GeoLinkedData Spain
Ordnance Survey UK
• The authority list of Irish place names, validated by the Place Names Branch.
• Delivering a more detailed level than in DBpedia, Geonames.
• Unique source of Irish language place names.
• NLI looking to integrate Logainm data into their workflow. Allowing to search for place names in Irish.
Logainm.ie
• W3C Geo (very basic)– SpatialThing, latitude and longitude
• Most providers have defined their own
• NeoGeo (http://geovocab.org/doc/neogeo/)– Feature vs Geometry
– Spatial Relations (is_part_of)
Geo-Vocabularies
NeoGeo Overview
• Classes– Feature (spatial:Feature)
• A geographical feature, capable of holding spatial relations.
– Geometry (geom:Geometry)• Super-class of all geometrical representations (RDF,
KML, GML, WKT...).
• Connected by the geometry (geom:geometry)
Relations between geometries
Properties• connects with (spatial:C)• overlaps (spatial:O)• is part of (spatial:P)• contains (spatial:Pi)• …
Creating a LD Dataset
Steps:1. Data transformation / access
• Vocabulary assessment
2. Link Discovery• Evaluation of generated links
3. Deployment• Virtuoso OpenSource
owl:sameAs
foaf:name
http://data.logainm.ie/1375542
Dublinhttp://
sws.geonames.org/2964574/
~100,000 place names
~1.3M triples
Converting Logainm to RDF
Link Discovery
• Silk– http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/
• LIMES– http://aksw.org/Projects/LIMES.html
• Based on specifying rules that compare pairs of entities
• Rules based on:– Place names– Geographical coordinates– Name of the county / parent place name– Hierarchy of places
Rules to discover links to other datasets
• # entities matched:– DBpedia: 1,552– LinkedGeoData: 6,611– GeoNames: 8,229
<marc:datafield tag="650" ind1="" ind2="">
<marc:subfield code="a">Land tenure</marc:subfield>
<marc:subfield code="z">Ireland</marc:subfield>
<marc:subfield code="z">Rathdown (Barony)</marc:subfield>
</marc:datafield>
<marc:datafield tag="650" ind1="" ind2="">
<marc:subfield code="a">Land use surveys</marc:subfield>
<marc:subfield code="z">Ireland</marc:subfield>
<marc:subfield code="z">Wicklow (County)</marc:subfield>
</marc:datafield>
<marc:datafield tag="651" ind2="7" ind1="">
<marc:subfield code="2">logainm.ie</marc:subfield>
<marc:subfield code="a">Rathdown</marc:subfield>
<marc:subfield code="0”>http://data.logainm.ie/place/283</marc:subfield>
</marc:datafield>
Longfield Map example
Demo: Location LODerhttp://apps.dri.ie/locationLODer/locationLODer
Hands-on Activities
11:50 – 12:25Choice of Activities….
• Data Modelling• Data Cleaning & Structuring• Querying (SPARQL)
Uldis Bojars, Nuno Lopes, & Jodi Schneider
Semantic Web for Digital LibrariesOpen Refine Exercise
Open Refine
• Useful for batch transformation of large amounts of data– data cleanup (misspellings, splitting multiple-valued
columns, …)• Linking to other databases
– Freebase– Any SPARQL enabled LD
• Website: http://openrefine.org/ • RDF extension: http://refine.deri.ie/
Exercise
• Examples from: http://freeyourmetadata.org/• Sample Data (collection metadata from the
Sydney Powerhouse Museum): http://data.freeyourmetadata.org/powerhouse-museum/phm-collection.zip
• Screencast: http://www.youtube.com/watch?v=NnCA1dnCT-c
Task 1 - Data Cleanup
1. Import the collection into OpenRefine2. Get to know your data3. Remove blank rows4. Remove duplicate rows5. Split cells with multiple values6. Remove blank cells7. Cluster values8. Remove double category values
Task 2 - Data Reconciliation & RDF Export
1. Pick a column to reconcile2. Pick a vocabulary to reconcile with3. Tell OpenRefine about the vocabulary4. Start the reconciliation process5. Understanding the reconciliation results6. Interpreting the new reconciliation results7. Exporting RDF
Uldis Bojars, Nuno Lopes, & Jodi Schneider
Semantic Web for Digital LibrariesSPARQL Hands-on Session
SPARQL
• Query Language for RDF data• W3C Standard• Components of a SPARQL Query:
– Prefix Declarations– Result type (SELECT, CONSTRUCT, DESCRIBE, ASK)– Dataset– Query pattern– Solution modifiers
Further information
• In-Depth SPARQL tutorials– http
://www.cambridgesemantics.com/semantic-university/sparql-by-example
– http://axel.deri.ie/presentations/20100922SPARQL1.1Tutorial.pptx
– http://web.ing.puc.cl/~marenas/talks/BNCOD13.pdf• SPARQL:
– http://sparql.org/ (Jena) – http://dydra.org/
SPARQL by example – Europeana Endpoint
Endpoint: http://europeana.ontotext.com/sparql
1. SPARQL Select template2. List of data providers having contributed
content to Europeana3. List of provided objects with their aggregators4. 18th century Europeana objects from France5. Write your own