methodological guidelines for publishing linked data

Post on 11-May-2015

852 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Methodological Guidelines for Publishing Linked Data presented at CONSEGI 2011

TRANSCRIPT

Methodological Guidelines for Publishing Linked DataPublishing Linked Data

Boris Villazón-Terrazas, Asunción Gómez-Pérez, and Óscar Corcho

Facultad de Informática, Universidad Politécnica de MadridCampus de Montegancedo sn, 28660 Boadilla del Monte, Madrid

http://www oeg upm nethttp://www.oeg-upm.net{bvillazon,asun,ocorcho}@fi.upm.es

Phone: 34.91.3366605, Fax: 34.91.3524819

CONSEGI 2011 – Brasília, Brazil12th May, 2011

ToC

• Introduction to Linked Data

G id li f P bli hi Li k d D t• Guidelines for Publishing Linked Data

• Demo• Demo

2

ToC

• Introduction to Linked Data

• Guidelines for Publishing Linked Data

• Demo

3

Classic Web

MovieDB

Data exposed tothe Web viathe Web via

HTML, pdf, etc.

CIAWorld

FactBook

4

© Slide adapted from “5min Introduction to Linked Data”- Olaf Hartig

Classic Web

Information fromsingle pagesComplex queriessingle pages

can be found viasearch engines

over multiplepages / data

?search engines

sources?

5

© Slide adapted from “5min Introduction to Linked Data”- Olaf Hartig

What do we actually want?

• Use the Web like a single global database

MovieDBCIA

WorldFactBook

6

© Slide adapted from “5min Introduction to Linked Data”- Olaf Hartig

Linked Data enables such Web of DataGlobal Identifier: URI (Uniform Resource Identifier) which is a string of characters usedGlobal Identifier: URI (Uniform Resource Identifier), which is a string of characters used

to identify a name or a resource on the Internet.Data Model: RDF (Resource Description Framework), which is a standard model

for data interchange on the WebAccess Mechanism: HTTPConnection: Typed Links

8000000

“Even the Rain”

http://cia.../Boliviahttp://imdb.../TLLuvia

http://.../populationhttp://.../name

http://.../filming_location

p

MovieDBCIA

WorldFactBook

7© Slide adapted from “5min Introduction to Linked Data”- Olaf Hartig

In a nutshell• An extension of the current• An extension of the current

Web…• … where information and services data

are given well-defined and explicitly represented meaning, …

• … so that it can be shared and used by humans and machinesby humans and machines, ...

• ... better enabling them to work in cooperation

• How?• Promoting information exchange by

tagging web content with machine processable descriptions of its meaning. A d t h l i d i f t t• And technologies and infrastructureto do this

• And clear principles on how to publish data

8

publish data

The four principles (Tim Berners Lee, 2006)

1. Use URIs as names for things

• http://www.w3.org/DesignIssues/Linkedfor things

2. Use HTTP URIs so that people can look

esignIssues/LinkedData.html

that people can look up those names.

3. When someone looks http://www.ted.com/talks/tim_berners_lee_on_the_next_web.htmlhttp://www.ted.com/talks/tim_berners_lee_on_the_next_web.html

up a URI, provide useful information,

i th t d dusing the standards (RDF*, SPARQL)

4 Include links to other4. Include links to other URIs, so that they can discover more things.discover more things.

9

So does that mean I have to publish my data as Linked Data, now?

• But, why?

• What was your incentive to publish an HTML page in 1990?• Share data in documents and because your neighbor

was doing itwas doing it

• So, why should we publish Linked Data in 2011?, y p• Share data as data and because your neighbor is doing it

10

© Slide adapted from “Introduction to Linked Data”- Juan Sequeda

And guess who is starting to publish Linked Data now?

• UK Government• UK Government• US Government• BBC• Open Calais• Freebase• NY Times• CNET• Dbpedia• Dbpedia• ….

11

Linked Open Data evolution

2007

2008

2009

1212

Linked Open Data

2010

13

http://richard.cyganiak.de/2007/10/lod/

ToC

• Introduction to Linked Data

G id li f P bli hi Li k d D t• Guidelines for Publishing Linked Data

• Demo• Demo

14

Linked Data in OEG

• GeoLinkedData is an open initiative whose aim is toenrich the Web of Data with Spanish geospatial data.p g phttp://geo.linkeddata.es

• El Viajero Linked Data is project that focuses on theintegration of the contents produced by newspapersand digital platforms belonging to Prisa Groupand digital platforms belonging to Prisa Group.http://webenemasuno.linkeddata.es/

• A project with the Biblioteca Nacional to publish thelibrary information as Linked Data.yhttp://cultura.linkeddata.es/visualizer/

15

Linked Data in OEG

• Tools for generating and cosuming Linked Data, e.g.,• geometry2rdf http://www oeg upm net/index php/downloads/151 geometry2rdf• geometry2rdf http://www.oeg-upm.net/index.php/downloads/151-geometry2rdf

• map4rdf http://oegdev.dia.fi.upm.es/projects/map4rdf/

• Spanish Thematic Network of Linked Data http://red.linkeddata.esp

» Group leader: Ontology Engineering Group

» 19 Research Groups

» 4 companies» 4 companies

16

Guidelines for Publishing Linked Data

17

Guidelines for Publishing Linked Data

18

Identification of the data sources

• Guidelines based on the Open Data Manual 1

• Two possibilities

• To find the data sources already available in a public data catalog, e.g., Aporta project 2

• To get an agreement with a particular government body topublish its data sources, e.g., GeoLinkedData - IGNp g

19

1 http://opendatamanual.org/2 http://aporta.es

GeoLinkedDataIdentification of the data sources

IGNNational Geographic Institute of Spain

Agreement with the IGN

g p p

Oracle & MySQL

Data sources availablein a public data catalog

INENational Statistic Institute of Spain

in a public data catalog

20

IGN & INEIdentification of the data sources

Year

Industry Production IndexProvince

21

Guidelines for Publishing Linked Data

22

OntologyVocabulary Modelling

• An ontology is an engineering artifact, which provides: • A set of terms• A set of explicit assumptions regarding the intended meaning of the terms.

• Almost always including concepts and their classification• Almost always including properties between concepts

Shared nderstanding of a domain of interest• Shared understanding of a domain of interest

23

Reuse available vocabulariesVocabulary Modelling

Search for suitablevocabularies

Linked Open Vocabularies

are theresuitable

vocabularies?

Build the vocabulary byreusing available

vocabularies

Yes

No

24

Reuse available non-ontological resourcesVocabulary Modelling

Highly reliable Web Sites

Search for suitablenon-ontological resources

Domain-related sites

Government CatalogsGovernment Catalogs

are theresuitable

resources?

Build the vocabulary bytransforming available

resources

Yes

No

Build the vocabulary fromscratch

25

GeoLinkedDataVocabulary Modelling

scv:Dimensionscv:Item

scv:Dataset

WGS84 Geo Positioning: an RDF

vocabulary

hydrographical phenomena (riversphenomena (rivers,

lakes, etc.)

Vocabulary for instants, intervals, , ,durations, etc.

Ontology for OGC Geography Markup Language

Names and international code systems for territories and groupsg g

Classes 33 33

Object Properties 44 44

http://neon-toolkit.org/

j p

Data Properties 318 318

26

Guidelines for Publishing Linked Data

27

Generation of the RDF Data

INEINE

NOR2O

ODEMapster

IGNIGN

IGNIGN

GeospatialGeospatialcolumncolumn

Geometry2RDF

28

NOR2OIndustry Production Index Year

Generation of the RDF Data

Industry Production Index

ProvinceProvince

NOR2O

29

R2O & ODEMapsterR O is an extensible fully declarative language to describe

Generation of the RDF Data

• R2O is an extensible, fully declarative language to describe mappings between relational database schemas and ontologies.

• The ODEMapster processor generates RDF instances from relational instances based on the mapping description expressed in the R2O document

30

www.oeg-upm.net/index.php/en/downloads/9-r2o-odempaster

R2O & ODEMapsterGeneration of the RDF Data

• Creation of the R2O Mappings

31

R2O & ODEMapsterGeneration of the RDF Data

Excerpt of the R2O documentExcerpt of the R2O document

32

geometry2rdfGeneration of the RDF Data

• Tool for generating RDF from geometrical information

• The geometry could be available in GML or WKT

• The RDF generated follows our Geometry Model

33

http://www.oeg-upm.net/index.php/en/downloads/151-geometry2rdf

geometry2rdfGeneration of the RDF Data

Oracle STO UTIL packageOracle STO UTIL package

SELECT TO_CHAR(SDO_UTIL.TO_GML311GEOMETRY(geometry)) AS Gml311Geometry

FROM "BCN200"."BCN200_0301L_RIO" cWHERE c.Etiqueta='Arroyo'

34

geometry2rdfGeneration of the RDF Data

Geometry ModelGeneration of the RDF Data

geoes: http://geo.linkeddata.es/geo: http://www.w3.org/2003/01/geo/wgs84_pos#

geoes:ontology/Geometría

rdfs:subClassOf rdfs:subClassOf

geoes:ontology/Polígonogeoes:ontology/Curvageo:Point

rdfs:subClassOfrdfs:subClassOf

rdfs:subClassOf

3939geo:lat geo:long Collection of 2 or Collection of 3 or

formadoPor formadoPor

more geo:PointsCollection of 3 ormore geo:Points

36

RDF generated according to our Geometry ModelGeneration of the RDF Data

1 2

0

0

37

URI GenerationGeneration of the RDF Data

• URIs are extremely relevant in this process since they are the key for the alignment of heterogeneousthey are the key for the alignment of heterogeneous resources that come from different data sources.• Cool URIs 1

• UK Cabinet Office 2

• Examples:http://geo.linkeddata.es/ontology/{class/property}

http://geo.linkeddata.es/ontology/Lago

http://geo linkeddata es/resource/dataset/type/{resourcename}http://geo.linkeddata.es/resource/dataset/type/{resourcename}

http://geo.linkeddata.es/resource/Provincia/Madrid

38

1 http://www.w3.org/TR/cooluris/2 http://www.cabinetoffice.gov.uk/media/301253/puiblic sector uri.pdf

Provenance InformationGeneration of the RDF Data

• It is relevant• to manage the provenance information of the resources• to manage the provenance information of the resources• to establish the license of the information

• Example

39

Pubby: http://www4.wiwiss.fu-berlin.de/pubby/

Guidelines for Publishing Linked Data

40

Publication of the RDF data

map4rdf

map4rdfhttp://oegdev.dia.fi.upm.es/projects/map4rdf/

SPARQLLinked DataHTML

PubbyIncluding Provenance Pubby

Pubby 0.3

Including ProvenanceSupport

http://www4.wiwiss.fu-berlin.de/pubby/

41

Virtuoso 6.1.0

Guidelines for Publishing Linked Data

42

Data Cleansing

• To find possible errors, identified by Hogan et al.• http-level issues such as accessibility and derefencability• http-level issues, such as accessibility and derefencability,

e.g., HTTP URIs return 40x/50x errors• reasoning issues such as namespace without vocabulary,

e.g., rss:item term invented• malformed/incompatible datatypes, e.g., “true” as xsd:int

• To fix the identified errors

• Example, encoding URIs• Special characters á é ñSpecial characters á, é, ñ

• http://geo.linkeddata.es/resource/Provincia/M%C3%A1laga

43

Guidelines for Publishing Linked Data

44

Linking the RDF Data

Identify suitable data sets li ki t t

http://ckan.netas linking targets

Discover relationshipsbetween data items

Silk FrameworkLIMEShttp://aksw.org/Projects/limes http://www4.wiwiss.fu-berlin.de/bizer/silk/

Validate the relationshipsdiscovered sameAs Validator

http://oegdev.dia.fi.upm.es:8080/sameAs/

45

GeoLinkedDataLinking the RDF Data

GeoLinkedData

GeoNamesDBPedia

…. …. ….

http://sws.geonames.org/6355233/

http://geo.linkeddata.es/.../Madrid

http://dbpedia.org/resource/Madrid

46

…. …. ….

sameAs ValidatorLinking the RDF Data

http://oegdev.dia.fi.upm.es:8080/sameAs/

47

Guidelines for Publishing Linked Data

48

Register the dataset into CKAN RegistryEnable Effective Discovery

• Add the dataset to CKAN, the open registry of data and content packagesand content packages

• Minimum information• Minimum information• Name, unique ID for your data set on CKAN• Title, full name of your data set, y• URL, link to the data set home page

49

http://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/DataSets/CKANmetainformation

Sitemap protocolEnable Effective Discovery

• Used by web crawlers• Efficiently find all your content & discover

what has been updatedhttp://sitemaps.org/

A i fil i i f i di URLA sitemap file contains information regarding one or more URLs onyour Web site. The information that is stored there helps searchengines better spider your website.

50

Sindice: the best RDF search engineEnable Effective Discovery

51

sitemap4rdfEnable Effective Discovery

• Simple command line tool• Sends a SPARQL query to list all URIs• Generates sitemap• Generates sitemap

it 4 df htt // it / l htt // it / /sitemap4rdf http://yoursite/sparql http://yoursite/resource/

Example:

it 4 df if i th SPARQL d i t

sitemap4rdf http://geo.linkeddata.es/sparql http://geo.linkeddata.es/

• run sitemap4rdf specifying the SPARQL endpointand the prefix of the URLs to include in the Sitemap

52

http://lab.linkeddata.deri.ie/2010/sitemap4rdf/

Submit the sitemap location - SindiceEnable Effective Discovery

• http://sindice.com/main/submit

53

Submit the sitemap location - GoogleEnable Effective Discovery

• https://www.google.com/webmasters/tools/

54

ToC

• Introduction to Linked Data

G id li f P bli hi Li k d D t• Guidelines for Publishing Linked Data

• Demo• Demo

55

DEMODEMOhttp://geo linkeddata es/browserhttp://geo.linkeddata.es/browser

56

Provinces

57

Capital of Province

58

Provinces – Industry Production Index

59

Beaches

60

DEMODEMOhttp://webenemasuno linkeddata es/http://webenemasuno.linkeddata.es/

61

Trips

62

Guide Locations

63

Guide

64

Future Work

65

Methodological Guidelines for Publishing Linked DataPublishing Linked Data

Boris Villazón-Terrazas, Asunción Gómez-Pérez, and Óscar Corcho

Facultad de Informática, Universidad Politécnica de MadridCampus de Montegancedo sn, 28660 Boadilla del Monte, Madrid

http://www oeg upm nethttp://www.oeg-upm.net{bvillazon,asun,ocorcho}@fi.upm.es

Phone: 34.91.3366605, Fax: 34.91.3524819

CONSEGI 2011 – Brasília, Brazil12th May, 2011

top related