tpdl2013 tutorial linked data for digital libraries 2013-10-22

Post on 08-May-2015

4.794 Views

Category:

Technology

9 Downloads

Preview:

Click to see full reader

DESCRIPTION

Tutorial on Linked Data for Digital Libraries, given by me, Uldis Bojars, and Nuno Lopes in Valletta, Malta at TPDL2013 on 2013-10-22. http://tpdl2013.upatras.gr/tut-lddl.php This half-day tutorial is aimed at academics and practitioners interested in creating and using Library Linked Data. Linked Data has been embraced as the way to bring complex information onto the Web, enabling discoverability while maintaining the richness of the original data. This tutorial will offer participants an overview of how digital libraries are already using Linked Data, followed by a more detailed exploration of how to publish, discover and consume Linked Data. The practical part of the tutorial will include hands-on exercises in working with Linked Data and will be based on two main case studies: (1) linked authority data and VIAF; (2) place name information as Linked Data. For practitioners, this tutorial provides a greater understanding of what Linked Data is, and how to prepare digital library materials for conversion to Linked Data. For researchers, this tutorial updates the state of the art in digital libraries, while remaining accessible to those learning Linked Data principles for the first time. For library and iSchool instructors, the tutorial provides a valuable introduction to an area of growing interest for information organization curricula. For digital library project managers, this tutorial provides a deeper understanding of the principles of Linked Data, which is needed for bespoke projects that involve data mapping and the reuse of existing metadata models.

TRANSCRIPT

Linked Data for Digital Libraries

Uldis Bojars, Nuno Lopes, & Jodi SchneiderTPDL 2013

September 22, 2013Valletta, Malta

1

NunoDigital Repository of Ireland &DERI

UldisNational Library of Latvia

JodiDERI

Schedule for the day9:00 - Introduction of presenters, tutorial schedule, and learning outcomes9:10 - Motivation and concepts of Linked Data9:30 - Discuss: How would you envision using Linked Data in your institution? 9:45 - Lifecycle of Linked Data & Exploring Linked Data10:10 - Case Study 1: Authority Data

10:30 – 11 COFFEE BREAK

11:00 - Recap 11:10 - Modelling data as Linked Data11:30 - Case Study 2: Geographical Linked Data11:50 - Choice of Hands-on Activities12:25 - Conclusions

Hands-on Activities

11:50 – 12:25Choice of Activities….

• Data Modelling• Data Cleaning & Structuring• Querying (SPARQL)

Please share your expertise!

• In the room• On paper• Online - shared folder:

http://tinyurl.com/tpdl2013-ld-notes– PDF of the programme– Shared notes– More materials later

• What is Linked Data? Why use it? • What are some examples of

Linked Data in Digital Libraries?• What are the best practices for

exploring & creating Linked Data?

Objectives for Today

Motivation and concepts of Linked Data

• Using identifiers• to enable access• to add structure • to link to other stuff

What is Linked Data?

Why use Linked Data?

Key technology for library data! Representing

PublishingExchanging

• Powerful querying• Ability to mix/match vocabularies• Same technology stack as everybody else

– Findability– Interoperability

Who is using Linked Data?

Aggregators

Integrated Library Systems & OPACs

Thesauri

Repositories

What is Linked Data (redux)?

Rob Styles

Towards RDF

Subject Predicate Object

RDF triple

Subject Predicate Object

RDF graph

Reuses the existing Web infrastructure to publish your data along with your documents:

– Using URI identifiers– and HTTP for accessing the information

How Linked Data works

Linked Data Principles

1. Use URIs as names for things 2. Use HTTP URIs so that people can look up

those names. 3. When someone looks up a URI, provide

useful information, using the standards- RDF, SPARQL

4. Include links to other URIs. so that they can discover more things.

http://www.w3.org/wiki/LinkedDatahttp://www.w3.org/DesignIssues/LinkedData

• We need a proper infrastructure for a real Web of Data– data is available on the Web

• accessible via standard Web technologies

– data is interlinked over the Web– ie, data can be integrated over the Web

• We need Linked Data

Data on the Web is not enough…

Slide credit: Ivan Herman

In groups of 2-3: Discuss

• How would you envision using Linked Data?What are the opportunities?

• Is your institution already using Linked Data? Planning a Linked Data project?

Lifecycle of Linked Data

Lifecycle of Linked Data

• Find• Explore• Transform• Model• Store• Query• Interlink• Publish

Uldis Bojars, Nuno Lopes, & Jodi Schneider

Semantic Web for Digital Libraries

Exploring Linked Data(Practical Tools and Approaches)

Objectives

• Learn about Linked Data (LD) by looking at existing data sources

• Discover tools and approaches for exploring Linked Data

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Exploring Linked Data

• Discovering Linked Data• Accessing RDF data• Making sense of the data

– Validating RDF data– Converting between formats– Browsing Linked Data

• Querying RDF data

RDF graph

What RDF looks like

• RDF can be expressed in a number of formats:– some are good for machines;

some – understandable to people

• Common formats:– RDF/XML – common, but difficult to read– NTriples – a simple list of RDF triples – Turtle – human-readable, easier to understand

• Can be represented visually

Accessing RDF data

RDF data on the Web can be found as:

• Linked Data– follow links, request data by URI– returned data can be in various RDF formats

• Data dumps– download the data

• SPARQL endpoints– query Linked Data (more on that later)

http://www.ivan-herman.net/

Discovering Linked Data

a) find on a link in a Web pageb) have some tools alert you Linked Data is there

– Tabulator– Semantic Radar

c) explore a project you heard about – and know LOD should be there

d) use a registry of sources http://datahub.io/group/lodcloud

e) Just ask someone

RDF discovery example

• data at Ivan Herman’s page can be found via:– finding the RDF icon (with the link to FOAF file)– letting browser tools alert you that RDF is present

• RDF auto-discovery

– extracting RDFa data embedded in the page

• for other data sources RDF content negotiation might work

Making sense of the data

• Validating RDF data– Ensures that data representation is correct

• Converting between formats– Convert to a [more] human-readable RDF format

• Browsing Linked Data– Browse the data without worrying about

“reading” RDF

Validating and Converting RDF

• W3C RDF validator http://www.w3.org/RDF/Validator/

• URI debugger – “Swiss knife” of Linked Datahttp://linkeddata.informatik.hu-berlin.de/uridbg/

• RDFa distiller – extracts RDF embedded in web pages http://www.w3.org/2012/pyRdfa/

• Command-line tools (we’ll return to that)

<http://www.ivan-herman.net/> a foaf:PersonalProfileDocument; dc:creator "Ivan Herman"; dc:date "2009-06-17"^^xsd:date; dc:title "Ivan Herman’s home page"; xhv:stylesheet <http://www.ivan-herman.net/Style/gray.css>; foaf:primaryTopic <http://www.ivan-herman.net/foaf#me> .

<http://twitter.com/ivan_herman> a foaf:OnlineAccount; foaf:accountName "ivan_herman"; foaf:accountServiceHomepage <http://twitter.com/> .

<http://www.ivan-herman.net/cgi-bin/rss2to1.py> a rss:channel .

<http://www.ivan-herman.net/foaf#me> a dc:Agent,

foaf:Person; rdfs:seeAlso <http://www.ivan-herman.net/AboutMe>, <http://www.ivan-herman.net/cgi-bin/rss2to1.py>, <http://www.ivan-herman.net/foaf.rdf>;...

Extracted from http://www.ivan-herman.net/ using RDFa Distiller

Browsing Linked Data (DBPedia):http://live.dbpedia.org/resource/Valletta

Command Line Tools

• wget – command line network downloader$ wget http://dbpedia.org/resource/Valletta

• curl – specify HTTP headers$ curl -L -H "Accept: text/rdf+n3” http://dbpedia.org/resource/Valletta

• Redland rapper – RDF parsing and serialisation$ rapper -o turtle http://dbpedia.org/resource/Valletta

Querying Linked Data

• SPARQL Protocol and RDF Query Language• Graph Matching• Components of a SPARQL Query:

– Prefix Declarations– Result type (SELECT, CONSTRUCT, DESCRIBE, ASK)– Dataset– Query pattern– Solution modifiers

Europeana SPARQL endpoint

http://europeana.ontotext.com/

http://tinyurl.com/europeana-rights-sparql

Tool catalogues: many more tools

• Collection of tools from other projects– http://www.w3.org/2001/sw/wiki/LLDtools– http://www.w3.org/2001/sw/wiki/Tools– http://semanticweb.org/wiki/Tools– http://dbpedia.org/Applications

Interesting Projects• LOCAH

a stylesheet to transform UK Archives Hub EAD to RDF/XML, and provides examples of the process using XLST http://data.archiveshub.ac.uk/ead2rdf/

• AliCAT (Archival Linked-data Cataloguing)Tool for editing collection level recordshttp://data.aim25.ac.uk/step-change/

• Axiell CALM Solution for LAM that includes Linked Data functionality, allowing archivists to tag their collections with URIs from any chosen Linked Dataset.

http://www.axiell.com/calm

Tools for Converting MARC records

• MariMba Tool to translate MARC to RDF and Linked Datahttp://mayor2.dia.fi.upm.es/oeg-upm/index.php/en/downloads/228-marimba

• marcauth-2-madsrdfXQuery utility to convert MARC/XML Authority records to MADS/RDF and SKOS resources https://github.com/kefo/marcauth-2-madsrdf

Tools for museum curators

• Karma (http://isi.edu/integration/karma/)was used to map the records of the Smithsonian American Art Museum to RDF and link them the Web and the Linked Open Data Cloud. Demo: http://www.youtube.com/watch?v=kUIqTI56oeQ

Authority Linked Data

VIAF and Wikipedia case study

library linksSlide credit: Jindřich Mynarz

• Use a single, distinct name for each person, organization, …

• Name is consistently used throughout library systems

• Issues:– “Strings” not “things”– in Linked Data world we’d just use

URIs

http://viaf.org

VIAF

• Virtual Internet Authority File (viaf.org)

• Integrating authority information from a number of national libraries– Linked data + links to related information

• Matching authority data from multiple sources– using related bibliographic records to help

matching

Wikipedia + VIAF

• How can people discover useful information in VIAF and via VIAF?

• Linked Data eco-system – let’s explore (!)– Wikipedia -> VIAF -> National Library LD

• Example (Andrejs Pumpurs):– http://en.wikipedia.org/wiki/Andrejs_Pumpurs– http://viaf.org/viaf/44427367/

http://en.wikipedia.org/wiki/Andrejs_Pumpurs

http://viaf.org/viaf/44427367/

VIAF

• Ontologies used:– FOAF, SKOS, RDA (FRBR entities and elements),

Dublin Core, VIAF, UMBEL

• Related datasets:– National authority data:

• Germany (d-nb.info), Sweden (LIBRIS), France (idref.rf)

– DBPedia

http://viaf.org/viaf/44427367/

How did VIAF get into Wikipedia?

• VIAFbot– algorithmically matched by name, important

dates, and selected works• “The principal benefit of VIAFbot is the

interconnected structure.” -

One Direction

VIAF English Wiki

Slide credit: Maximilian Klein, Wikipedian in Residence at OCLC

Enter VIAFBot: Wikipedia Robot

VIAF English Wiki

Slide credit: Maximilian Klein, Wikipedian in Residence at OCLC

Idea: Reciprocate

VIAF English Wiki

Slide credit: Maximilian Klein, Wikipedian in Residence at OCLC

VIAF – summary:

– an efficient way for putting library authority data online as linked data

– in case if the organization also provides Linked Data itself can add links to VIAF to link back to organization’s LD records (which may contain richer / additional information)

Data Modelling

Publishing Data

• Naïve Transform– Direct Mapping of Relational Data to RDF

See RDB2RDF

OR• Model & Transform

– Figure out how to represent data– Then transform according to the model

Model

• Describe the domain– What are the important concepts?– What are their properties?– What are their relations?

• Choose vocabularies

DC TERMS RDF Vocabularyhttp://purl.org/dc/terms/

Deciding on URI patterns

• Use a domain that you control• Use consistent patterns• Manage change: transparent isn’t always best• Consider what concepts are worth

distinguishing

Example URI patterns

• Designing URI Sets for the UK Public Sector• Defines patterns for

– Identifier URI– Document URI– Representation URI

• Identifier example:http://{domain}/id/{concept}/{reference}

http://data.archiveshub.ac.uk/id/person/ncarules/skinnerbeverley1938-1999artist

Choosing Vocabularies

• Audience & Purpose – e.g. search engine vs. bibliographic exchange

• Domain– Biomedical, geographical, …

• Granularity• Popularity: potential for interlinking & reuse

Finding vocabularies & ontologies

Look at examples

Look at examples

Find examples: Linked Open Data Cloud

Ask the community

• Mailing lists– LOD-LAM– Code4Lib– OKFN Open-Bibliography Working Group– W3C Schema.org BibEx Community Group

• Domain-specific Linked Data groups & lists

Popularity

Popularity: Semantic search engines

http://sindice.com/

Modeling spectrum:lightweight to heavyweight

An ontology ”spectrum” (in the order of complexity). Source: [Lassila and McGuinness, 2001]. Image from Bojars 2009

Some popular vocabularies

• DC• BIBO• FOAF• LODE (LinkedEvents)• OAI-ORE• SKOS

Be aware of & connect to

• Authority data– e.g. VIAF

• Thesauri– e.g. Agrovoc

• Linked Data is about Linking!

Modeling examples

• BIBFRAME• British Library Data Model• EDM• LIBRIS• VIAF

VIAF

• Ontologies used:– FOAF, SKOS, RDA (FRBR entities and elements),

Dublin Core, VIAF, UMBEL

• Related datasets:– National authority data:

• Germany (d-nb.info), Sweden (LIBRIS), France (idref.rf)

– DBPedia

LIBRIS Modeling

British Library Data Modelhttp://www.bl.uk/bibliographic/pdfs/bldatamodelbook.pdf

Uldis Bojars, Nuno Lopes, & Jodi Schneider

Semantic Web for Digital LibrariesGeographical LD case study

• Collections refer to Geographical Data in many forms…

• The Longfield Maps are a set of 1,570 surveys carried out in Ireland between 1770 and 1840.

• Currently catalogued in MarcXML, using data from Logainm, Geonames and Dbpedia.

The NLI Longfield Map Collection

<marc:datafield tag="650" ind1="" ind2="">

<marc:subfield code="a">Land tenure</marc:subfield>

<marc:subfield code="z">Ireland</marc:subfield>

<marc:subfield code="z">Rathdown (Barony)</marc:subfield>

</marc:datafield>

<marc:datafield tag="650" ind1="" ind2="">

<marc:subfield code="a">Land use surveys</marc:subfield>

<marc:subfield code="z">Ireland</marc:subfield>

<marc:subfield code="z">Wicklow (County)</marc:subfield>

</marc:datafield>

Longfield Map example

DBpedia– Includes latitude and longitude for geographic entities

LinkedGeoData – Export of data from OpenStreetMap– Beyond lat/lon (areas as polygons)

GeoNames– Access data as RDF (download requires subscription)

Geographic Data Providers

GeoLinkedData Spain

Ordnance Survey UK

• The authority list of Irish place names, validated by the Place Names Branch.

• Delivering a more detailed level than in DBpedia, Geonames.

• Unique source of Irish language place names.

• NLI looking to integrate Logainm data into their workflow. Allowing to search for place names in Irish.

Logainm.ie

• W3C Geo (very basic)– SpatialThing, latitude and longitude

• Most providers have defined their own

• NeoGeo (http://geovocab.org/doc/neogeo/)– Feature vs Geometry

– Spatial Relations (is_part_of)

Geo-Vocabularies

NeoGeo Overview

• Classes– Feature (spatial:Feature)

• A geographical feature, capable of holding spatial relations.

– Geometry (geom:Geometry)• Super-class of all geometrical representations (RDF,

KML, GML, WKT...).

• Connected by the geometry (geom:geometry)

Relations between geometries

Properties• connects with (spatial:C)• overlaps (spatial:O)• is part of (spatial:P)• contains (spatial:Pi)• …

Creating a LD Dataset

Steps:1. Data transformation / access

• Vocabulary assessment

2. Link Discovery• Evaluation of generated links

3. Deployment• Virtuoso OpenSource

owl:sameAs

foaf:name

http://data.logainm.ie/1375542

Dublinhttp://

sws.geonames.org/2964574/

~100,000 place names

~1.3M triples

Converting Logainm to RDF

Link Discovery

• Silk– http://wifo5-03.informatik.uni-mannheim.de/bizer/silk/

• LIMES– http://aksw.org/Projects/LIMES.html

• Based on specifying rules that compare pairs of entities

• Rules based on:– Place names– Geographical coordinates– Name of the county / parent place name– Hierarchy of places

Rules to discover links to other datasets

• # entities matched:– DBpedia: 1,552– LinkedGeoData: 6,611– GeoNames: 8,229

<marc:datafield tag="650" ind1="" ind2="">

<marc:subfield code="a">Land tenure</marc:subfield>

<marc:subfield code="z">Ireland</marc:subfield>

<marc:subfield code="z">Rathdown (Barony)</marc:subfield>

</marc:datafield>

<marc:datafield tag="650" ind1="" ind2="">

<marc:subfield code="a">Land use surveys</marc:subfield>

<marc:subfield code="z">Ireland</marc:subfield>

<marc:subfield code="z">Wicklow (County)</marc:subfield>

</marc:datafield>

<marc:datafield tag="651" ind2="7" ind1="">

<marc:subfield code="2">logainm.ie</marc:subfield>

<marc:subfield code="a">Rathdown</marc:subfield>

<marc:subfield code="0”>http://data.logainm.ie/place/283</marc:subfield>

</marc:datafield>

Longfield Map example

Demo: Location LODerhttp://apps.dri.ie/locationLODer/locationLODer

Hands-on Activities

11:50 – 12:25Choice of Activities….

• Data Modelling• Data Cleaning & Structuring• Querying (SPARQL)

Uldis Bojars, Nuno Lopes, & Jodi Schneider

Semantic Web for Digital LibrariesOpen Refine Exercise

Open Refine

• Useful for batch transformation of large amounts of data– data cleanup (misspellings, splitting multiple-valued

columns, …)• Linking to other databases

– Freebase– Any SPARQL enabled LD

• Website: http://openrefine.org/ • RDF extension: http://refine.deri.ie/

Task 1 - Data Cleanup

1. Import the collection into OpenRefine2. Get to know your data3. Remove blank rows4. Remove duplicate rows5. Split cells with multiple values6. Remove blank cells7. Cluster values8. Remove double category values

Task 2 - Data Reconciliation & RDF Export

1. Pick a column to reconcile2. Pick a vocabulary to reconcile with3. Tell OpenRefine about the vocabulary4. Start the reconciliation process5. Understanding the reconciliation results6. Interpreting the new reconciliation results7. Exporting RDF

Uldis Bojars, Nuno Lopes, & Jodi Schneider

Semantic Web for Digital LibrariesSPARQL Hands-on Session

SPARQL

• Query Language for RDF data• W3C Standard• Components of a SPARQL Query:

– Prefix Declarations– Result type (SELECT, CONSTRUCT, DESCRIBE, ASK)– Dataset– Query pattern– Solution modifiers

SPARQL by example – Europeana Endpoint

Endpoint: http://europeana.ontotext.com/sparql

1. SPARQL Select template2. List of data providers having contributed

content to Europeana3. List of provided objects with their aggregators4. 18th century Europeana objects from France5. Write your own

top related