linked data and language - cilc...
TRANSCRIPT
Linked Data and
Language
A. Gómez-Pérez
Universidad Politécnica de Madrid
Acknowledgements: Jorge Gracia, Victor Rodríguez Doncel, LIDER Consortium
members
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
License
• This work is licensed under the Creative Commons
Attribution – Non Commercial – Share Alike License
• You are free:
- to Share — to copy, distribute and transmit the work
- to Remix — to adapt the work
• Under the following conditions
- Attribution — You must attribute the work by inserting
• “[source http://www.oeg-upm.net/]” at the footer of each
reused slide
• a credits slide stating: “Linked Data and Language” by A.
Gómez-Pérez ”
- Non-commercial
- Share-Alike
2
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Table of Content
• Linked Data
• Linguistic Linked Data
• Multilingual Linked Data
• The LIDER project
Linked data
Linguistic
Linked Data
Multilingual
Linked Data
www.lider-project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linked Data Foundations
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
LD domains in August 2014
Media
Geographic
Life Sciences
Publications Goverment
Social
Networking Cross-domains
User Generated
Content Linguistics
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Foundations
Unique identifiers: URI identify or name a resource
RDF(S) models
El Quijote Cervantes Is creator of
Work Person Is creator of
Is a Is a
http://datos.bne.es/resource/XX1718747 http://datos.bne.es/resource/XX3383563
http://iflastandards.info/ns/fr/frbr/frbrer/C1005 http://iflastandards.info/ns/fr/frbr/frbrer/C1001
Equivalence links to other datasets Same As
http://viaf.org/viaf/17220427
Cervantes
Same As Same As
http://dbpedia.org/resource/Miguel_de_Cervantes
Cervantes
Data navigation
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
The model (Ontology) and the data
9
Work
Idiom
translation
Year
Publication date
Library
Located at
Person
Is creator of
Has subject
El Quijote Cervantes
Is creator of
Catalán
translation
1960
Publication date
BNE
Located in
Has subject
Vida de Cervantes
birthPlace Place
birthPlace Alcalá de Henares
Ontology
Data
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
10
http://iflastandards.info/ns/fr/frbr/frbrer/C1001
http://iflastandards.info/ns/fr/frbr/frbrer/C1002
translation
Año
Publication date
http://xmlns.com/foaf/0.1/Organization
Located in
http://iflastandards.info/ns/fr/frbr/frbrer/C1005
Is creator of
Has subject
http://datos.bne.es/resource/XX3383563 http://datos.bne.es/resource/XX1718747
Es autor
http://datos.bne.es/resource/XX1924295
translation
1960
Publication date
BNE
Located in
Has subject
http://datos.bne.es/resource/bimo0002045496
Vida de Miguel de Cervantes Saavedra
Don Quijote de la Mancha Cervantes Saavedra, Miguel de
Catalán
Ontology
Data
http://datos.bne.es/#
Language
work
Biblioteca
Person
http://geo.linkeddata.es/ontology/Municipio
birthPlace
http://geo.linkeddata.es/resource/Alcalá de Henares
birthPlace
Linked data is full of URIs
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linked Data without ontologies
http://www.server1.org/resource/Cervantes
http://www.server2.es/resource/Cervantes
http://datos.bne.es/resource/XX1718747
http://d-nb.info/gnd/11851993X
http://geo.linkeddata.es/page/resource/Municipio/Cervantes
Same as
Same as
Same as
Same as
URI URI
URI URI
URI
914 296 093
276,4 km²
Phone
Size
1547
#People
1547
Date of Birth
Author
D. Quijote
Cervantes
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linked Data and ontologies
http://www.server1.org/resource/Cervantes
http://www.server2.es/resource/Cervantes
http://datos.bne.es/resource/XX1718747
http://d-nb.info/gnd/11851993X
http://geo.linkeddata.es/page/resource/Municipio/Cervantes
Same as
Person rdf:type
rdf:type
Retaurant rdf:type
Street rdf:type
Municipality rdf:type
URI URI
URI URI
URI
1547
Date of Birth
Author
D. Quijote
Cervantes
(Person)
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linked Data allows uniform access
1. Agree on vocabularies for
describing metadata and domain
data
2. Unified and standardized language
for describing resources ( RDF(S))
3. Unified and standardized query
language (SPARQL)
4. Standardized non-proprietary APIs
5. Links to other resources
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Uses of Linked Data
1. Programmers built
applications using
make queries in
SPARQL and get RDF
Culture
(@BNE)
Geograhical
(@IGN)
Metereological
(@AEMET)
Smart Cities 2. Citizens/Users access
LD through a user
interface (they do not
see RDF)
3. Machine – Machine
data exchange and
semantic
interoperability in RDF
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
The new Linked Data Ecosystem
Culture
(@BNE)
Geograhical
(@IGN)
Metereological
(@AEMET)
Smart Cities
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
Linked data
Linguistic
Linked Data
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
The problem
Finding and reusing LR in third party applications is manual and time consuming
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Lack of interoperability of Language resources
• Ecosystem of
- Open and Closed resources
- Silos of LRs
- Complementary resources
• Lexicon, Corpora, Dictionaries, Grammars, ….
- Heterogeneous formats
• E.g, for Lexicons: Lexinfo, LMF, LIR, Lemon, …
- Several repositories with different metadata and schemas
- Many APIs and services for querying
Discovery and reuse LR in third party applications is hard, manual and time consuming
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Use cases for LR Discovery
• Language metadata content
- Give me bilingual dictionaries in
Spanish, German , that
accounts for grammatical
number and gender with
Creative Common licenses
• Language Resources content
- Give me all occurrences in
corpora of the token “bank”
disambiguated as the WorNet
synset http://wordnet-
rdf.princeton.edu/wn31/1084372
35-n
• Language Services
- Give me all RESTfull
services that can extract
terms from text in Latvian.
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
20
http://rae.es
Motivation
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
21 *Picture attribution: http://commons.wikimedia.org/wiki/User:Gugerell
http://es.wiktionary.org
http://rae.es
An example
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
22 *Picture attribution: http://commons.wikimedia.org/wiki/User:Gugerell
http://es.wiktionary.org
http://rae.es
http://www.wikilengua.org/
index.php/Terminesp:red
An example
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
23 *Picture attribution: http://commons.wikimedia.org/wiki/User:Gugerell
http://es.wiktionary.org
http://rae.es
http://www.wikilengua.org/
index.php/Terminesp:red
http://www.wordreference.
com/sinonimos/
An example
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
24 *Picture attribution: http://commons.wikimedia.org/wiki/User:Gugerell
http://es.wiktionary.org
http://rae.es
http://www.wikilengua.org/
index.php/Terminesp:red
http://es.wikipedia.org
http://www.wordreference.
com/sinonimos/
An example
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
http://es.wiktionary.org
http://rae.es
http://www.wikilengua.org/
index.php/Terminesp:red
http://es.wikipedia.org
http://www.wordreference.
com/sinonimos/
An example
“Red”
(computer
network)
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
*Picture attribution: http://commons.wikimedia.org/wiki/User:Gugerell
“Red”
Etimologiy Del latin “rete”
Gender: “f”
Definition.: “Conjunto de
ordenadores o de equipos
informáticos conectados entre
sí….”
“Red”
Sinonyms: “sistema”, “malla”,” distribución”
“Red”
Norm: UNE 21302-131
English: network
German: Netzwerk
“Red”
Pronunciation: [red]
Grammar category: sustantivo femenino
Singular: “red”
Plural: “redes”
“Red_de_computadores”
Category: redes informáticas
Image
Complementary
but not connected
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
LD allows linguistic data integration
Red
Phonetic form
Form
number singular
[RED]
Form
plural
[REDES]
Phonetic form
number
Red
Sense
written form
“red”
Sense
written form
“malla”
equivalent
Red
image
Red
Sense Sense
translation
es - en
written form
“red” “network”
written form
Red
written form
Form
gender
femenine
“red”
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
3LD Linguistic Linked Licensed Data
Language resources
such as:
- Lexica
- Corpora
- Dictionaries ..
NIF
NLP Interchange Format
Using RDF and standard data
models (vocabularies):
- Lexica
- Corpora
ODRL Open Digital Rights Language
Published along with
a machine-readable license.
www.lider-project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
Subset of LOD
Linguistic domain
Open License
Resources in RDF
Interconnected with
other LD resources
Requirements: Keep track of the License (open or closed) information
Keep track of the Provenance of the resource
Keep track of the use of the resource
www.lider-
project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data Evolution
Jan. 2013
2014
Sept. 2014
www.lider-project.eu
Sept. 2013
• 65 Resources • 82 Links • Unbalanced: mostly lexicons • Centralized: 30% of links to/from DBpedia
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data @ Nov 2014
LLOD Cloud in November 2014 • 103 Resources (+58%) • 165 Links (+101% increase) • More balanced (14 Corpora,
+367%) • Less Centralized: Babelnet, LexVo
and LexInfo new hubs
Criteria for inclusion: • Resolvable: URLs that resolve • RDF: resolve to RDF • 1000 Triples: self-explaining • Links: to one resource from the
cloud or other 50 links • Crawlable: get the whole
resource by crawling • Linguistic: data must be a
language resources • Registered: at CKAN
www.lider-project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
1. Agree on vocabularies for
describing
• LR metadata and content (Lemon-
Ontolex, NIF, …)
2. Unified and standardized language
for describing resources ( RDF(S))
3. Unified and standardized query
language (SPARQL)
4. Standardized non-proprietary APIs
5. Links to other resources
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
BabelNet
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
How do we represent license information?
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data
…
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix odrl: <http://www.w3.org/ns/odrl/2/> .
<http://example.com/nc-nored-nd-ff> a odrl:Policy ;
rdfs:label "NC-NoReD-ND-FF" ;
rdfs:comment "MetaShare NonCommercial, No Redistribution, No Derivatives, for a fee.
Perpetual, worldwide, allowing no redistribution of the original. "@en ;
rdfs:seeAlso <http://www.meta-net.eu/meta-share....pdf> ;
odrl:permission [ a odrl:Permission ;
odrl:action odrl:reproduce;
odrl:duty [ a odrl:Duty ;
odrl:action odrl:pay ;
odrl:target "XXX EUR"
]
] ;
odrl:prohibition [ a odrl:Prohibition ;
odrl:action odrl:commercialize, odrl:distribute, odrl:derive
] .
META-SHARE NonCommercial NoRedistribution NoDerivatives For-a-Fee Licence
See demo at http://conditional.linkeddata.es
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
The new Linked Data Ecosystem
Culture
(@BNE)
Geograhical
(@IGN)
Metereological
(@AEMET)
Smart Cities
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linked Data is multilingual
Linked data
Linguistic
Linked Data
Multilingual
Linked Data
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
LOD is dominated by the English Language
Some questions:
1. Distribution of natural languages across RDF
datasets?
2. Usage of language tags to indicate the natural
language of RDF tags?
1. Distribution of usage of language tags
2. Distribution of literals tagged as English vs other languages
3. Distribution of literals tagged in languages other than
English
42
2007 2009 2014
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
LOD is dominated by the English language
4. Evolution of top-10 languages (non Eglish)
0
10.000
20.000
30.000
40.000
50.000
60.000
70.000
es de zh fr it ru pl nl pt sv
jan2014 jan2015
1. Number of Monolingual and multilingual datasets
JAN 2014
JAN 2015
67%
33%
RDFliteralsEnglish
RDFliteralsotherthanEnglish
71%
29%
RDFliteralsEnglish
RDFliteralsotherthanEnglish
3. English tags versus other languages' tags
JAN 2014
JAN 2015
JAN 2014
7%
93%
RDFliteralswithlangtag
RDFliteralswithoutlangtag
9%
91%
RDFliteralswithlangtag
RDFliteralswithoutlangtag
2. Current usage of language tagging capabilities in RDF
JAN 2014
JAN 2015
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Multilingualism and the Linked Data
How to represent multilingual Linked Data?
Traditional annotation properties for most cases
Richer models for more demanding applications
dbpedia:Miguel_de_Cervantes
rdfs:label "Miguel de Cervantes"@es .
"ミゲル・デ・セルバンテス"@ja .
"미겔 데 세르반테스"@ko .
# LEMON
isbd:T1001 lemon:isReferenceOf [lemon:isSenseOf :cartographic].
:cartographic a lemon:LexicalEntry;
lemon:form [lemon:writtenRep “cartográfico”@es;
isocat:grammaticalGender isocat:masculine];
lemon:form [lemon:writtenRep “cartográfica”@es;
isocat:grammaticalGender isocat:feminine].
isocat:grammaticalGender rdfs:subPropertyOf lemon:property.
Association of the vocabulary to an external lexicon model
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
The LIDER project
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
LIDER Vision
LOD-aware NLP services
Metadata as
LD
...
Language Resources (Lexicon, corpora, ...)
some of them are open other are closed
Linguistic LOD generation
(Metadata and Content)
Language resources
as LD
Producers
Multimedia and
Multilingual Content
Metadata
Generation
Consumers
Content
Analytics
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
The use of LOD for NLP in Content Analytics
• Which extensions to the LOD are needed to support a new generation of large-scale content analytics applications that will overcome language barriers. - identification of key NLP tasks
that require background knowledge
- Specification of a new generation of NLP services that are LOD-aware and can exploit LOD
- Licensed linguistic linked data (3LD)
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Industry use cases
Technical
activities
Community building
networking
LD4LT W3C- CG
BP-MLOD W3C-CG
OntoLex W3C-CG
1. Surveys
2. Requirements
1. Use cases
2. Industry board
1. Vocabularies
2. Guidelines
3. Roadmap
4. Reference Architecture
http://www.w3.org/community/ld4lt/
http://www.w3.org/community/bpmlod/
http://www.w3.org/community/ontolex/
www.lider-project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Key Use Cases and Requirements
49
www.lider-project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
LIDER Reference Architecture proposed to LD4LT
LLD Publishing
Vocabularies Hosting
Metadata (author, language, structure,…)
Licensing (terms and conditions of use)
Provenance ( origin
and history of data)
Multilingual Data Multilingual Data: Terminologies, (Multimodal) Corpora; Bilingual Dictionaries; Para
llel Data; Translation Memories; Ontologies; Glossaries, Classification Schemas
LLD-aware Services
Scalability Streaming Interoperability
LLD Linking Service Composition
Discovery (indexing and aggregation services)
Benchmarking & Validation of datasets and services
Certification
Gu
idelin
es a
nd S
tandard
ization
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Linguistic Linked Licensed Data life cycle
1. Clear methodologies,
methods and tools for
monolingual LD
generation and
publication
2. Initial guidelines for:
- including the language
dimension
- Multilingual LD
- Linguistic LD
Villazón-Terrazas, B.; Vilches. L.; Corcho, O.; Gómez-Pérez, A.
Methodological Guidelines for Publishing Government Linked
Data. In D. Wood, ed. Linking Government Data. Springer. (pp,
27-49). 2011
Specification
Modelling
Generation Publication
Exploitation
Linking
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Best practices and guidelines (BPMLOD @ W3C)
1. Best practices for Multilingual Linked Data Publication (BPMLOD @ W3C)
- Practices for Naming (URIs)
- Practices for Dereferencing
- Practices for Textual Information
- Practices for Linking
- Practices for Language Identification
2. Guidelines for Linguistic Linked License Data
- Wordnets,
- Multilingual Lexicographic resources
- Bilingual Dictionaries
- Terminologies in TBX
- NIF-based NLP Web services
How many Linguistic
Resources are exposed in
RDF?
www.lider-project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Models for metadata and data
1. Metadata (LD4LT@W3C)
- Definition of the metadata OWL
ontology @ LD4LT W3C group
• UPF’s Metashare model as starting point
• Expanded with data and process
PROVENANCE and LICENSE modules
• Backwards compatible with MS and
LREMap models
• In agreement with members of LD4LT
W3C group
- Exposure of Metashare, Clarin,
LREMap, datahub metadata as LD
- Metadata Linguistic Observatory
(linghub)
http://linghub.lider-project.eu/
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Relation between W3C groups
Linked Data for Language
Technologies (LD4LT)
Best practices
(BPMLOD)
Ontology lexica (Ontolex)
Data on the Web Best Practices
Use Cases BP for LD in LT lemon
specification
BP for
using lemon
BP for Multlingual Data on the Web BP for Data on the Web
55
www.lider-project.eu
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
56
http://datathon.lider-project.eu/
Linked Data and Language © Asunción Gómez-Pérez 7th International Conference on Corpus Linguistic , 5-7 th March, 2015, Valladolid, Spain
Follow Linguistic LD for Language Technologies
www.lider-project.eu
twitter.com/multilingweb
Hashtag: #LiderEU
Join the community
www.w3c.org/community/ld4lt
http://datathon.lider-project.eu/