building the multilingual web of data ... - wordpress.combuilding the multilingual web of data –...
TRANSCRIPT
7/10/14 1Presenter name
Building the Multilingual Web of Data
Integrating NLP with Linked Data and RDF using the NLP Interchange
Format
7/10/14 2Building the Multilingual Web of Data – ISWC tutorial
Outline
1.Introduction
2.NIF Basics
3.NIF corpora
4.NIF tools & services
7/10/14 3Building the Multilingual Web of Data – ISWC tutorial
Introduction
7/10/14 4Building the Multilingual Web of Data – ISWC tutorial
LOD-aware NLP services
Not only data, but also LOD-aware services using:
Lexica and dictionaries in RDF (lemon model)
Training data for NLP in RDF (NIF model)
Service metadata descriptions in RDF
Combination with real world facts (i.e. DBpedia or Geoname)
Long Term Goal
index of tools and data → users can easily produce ready-made, preconfigured NLP →
services and pipelines
freemium / pay-per use business models→
7/10/14 5Building the Multilingual Web of Data – ISWC tutorial
NLP2RDF project
http://nlp2rdf.org
Realize this long-term goal
Sustainably maintain and consolidate results from short-term projects
Bootstrap the Eco-System
7/10/14 6Building the Multilingual Web of Data – ISWC tutorial
NLP2RDF: NLP Interchange Format
The NLP Interchange Format (NIF) is an RDF/OWL-based format that aims to achieve interoperability between Natural Language Processing (NLP) tools, language resources and annotations.
7/10/14 7Building the Multilingual Web of Data – ISWC tutorial
In a nutshell: Way to mint URIs for arbitrary strings and content of documents on the web
NLP2RDF: NLP Interchange Format
7/10/14 8Building the Multilingual Web of Data – ISWC tutorial
In a nutshell: Way to mint URIs for arbitrary strings and content of documents on the web
Logical formalisation of strings and annotations via an ontology
NLP2RDF: NLP Interchange Format
7/10/14 9Building the Multilingual Web of Data – ISWC tutorial
In a nutshell: Way to mint URIs for arbitrary strings and content of documents on the web
Logical formalisation of strings and annotations via an ontology
Quick and easy format
NLP2RDF: NLP Interchange Format
7/10/14 10Building the Multilingual Web of Data – ISWC tutorial
In a nutshell: Way to mint URIs for arbitrary strings and content of documents on the web
Logical formalisation of strings and annotations via an ontology
Quick and easy format Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147
NLP2RDF: NLP Interchange Format
7/10/14 11Building the Multilingual Web of Data – ISWC tutorial
In a nutshell: Way to mint URIs for arbitrary strings and content of documents on the web
Logical formalisation of strings and annotations via an ontology
Quick and easy format Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147 Reusability of RDF tools and implementation
NLP2RDF: NLP Interchange Format
7/10/14 12Building the Multilingual Web of Data – ISWC tutorial
In a nutshell: Way to mint URIs for arbitrary strings and content of documents on the web
Logical formalisation of strings and annotations via an ontology
Quick and easy format Builds on existing standards, e.g. RDF, LAF/GrAF, RFC 5147 Reusability of RDF tools and implementation Decreases development cost for integration
NLP2RDF: NLP Interchange Format
7/10/14 13Building the Multilingual Web of Data – ISWC tutorial
NIF: Motivation
Developers nightmare: All tools belong to similar class of NLP tools
Fulfill similar functions but often not interoperable ecosystems
7/10/14 14Building the Multilingual Web of Data – ISWC tutorial
NIF: Motivation
Developers nightmare: All tools belong to similar class of NLP tools
Fulfill similar functions but often not interoperable ecosystems
All have: Heterogeneous output formats (JSON, XML, …)
7/10/14 15Building the Multilingual Web of Data – ISWC tutorial
NIF: Motivation
Developers nightmare: All tools belong to similar class of NLP tools
Fulfill similar functions but often not interoperable ecosystems
All have: Heterogeneous output formats (JSON, XML, …) Heterogeneous API parameters
7/10/14 16Building the Multilingual Web of Data – ISWC tutorial
NIF: Motivation
Developers nightmare: All tools belong to similar class of NLP tools
Fulfill similar functions but often not interoperable ecosystems
All have: Heterogeneous output formats (JSON, XML, …) Heterogeneous API parameters Heterogeneous way of annotating text:
7/10/14 17Building the Multilingual Web of Data – ISWC tutorial
NIF: Motivation
Developers nightmare: All tools belong to similar class of NLP tools
Fulfill similar functions but often not interoperable ecosystems
All have: Heterogeneous output formats (JSON, XML, …) Heterogeneous API parameters Heterogeneous way of annotating text:
Some remove HTML internally, offsets not usable
7/10/14 18Building the Multilingual Web of Data – ISWC tutorial
NIF: Motivation
Developers nightmare: All tools belong to similar class of NLP tools
Fulfill similar functions but often not interoperable ecosystems
All have: Heterogeneous output formats (JSON, XML, …) Heterogeneous API parameters Heterogeneous way of annotating text:
Some remove HTML internally, offsets not usable Some use byte offset instead of char offset
7/10/14 19Building the Multilingual Web of Data – ISWC tutorial
Outline
1.Introduction
2.NIF Basics
3.NIF corpora
4.NIF tools & services
7/10/14 20Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 21Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 22Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 23Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 24Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 25Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 26Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 27Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 28Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 29Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 30Building the Multilingual Web of Data – ISWC tutorial
NIF architecture
7/10/14 31Building the Multilingual Web of Data – ISWC tutorial
Annotations
7/10/14 32Building the Multilingual Web of Data – ISWC tutorial
Annotations
7/10/14 33Building the Multilingual Web of Data – ISWC tutorial
Annotations
7/10/14 34Building the Multilingual Web of Data – ISWC tutorial
Annotations
7/10/14 35Building the Multilingual Web of Data – ISWC tutorial
Example: Tripadvisor Corpus
Corpus contains hotel reviews and review metadata 1760 semi-structured files In NIF: every file's content becomes one nif:Context resource Strings in the file can be adressed via URIs
7/10/14 36Building the Multilingual Web of Data – ISWC tutorial
Context
7/10/14 37Building the Multilingual Web of Data – ISWC tutorial
Context
Adress the content of a document nif:isString contains document content
7/10/14 38Building the Multilingual Web of Data – ISWC tutorial
Context
Note that in NIF the document is != content of the document
two different documents can have the same content=> must not have the same URI
Adress the content of a document nif:isString contains document content
7/10/14 39Building the Multilingual Web of Data – ISWC tutorial
other Strings
7/10/14 40Building the Multilingual Web of Data – ISWC tutorial
other Strings
Adress arbitrary strings in the document Use string offsets in relation to context to adress strings nif:anchorOf contains the string Additional properties can now be added to the string
7/10/14 41Building the Multilingual Web of Data – ISWC tutorial
other Strings
Adress arbitrary strings in the document Use string offsets in relation to context to adress strings nif:anchorOf contains the string Additional properties can now be added to the string
a tripadvisor:Review ;
7/10/14 42Building the Multilingual Web of Data – ISWC tutorial
Words and Phrases
Sentiment values, POS tags and other annotations etc can now be added to words and phrases
7/10/14 43Building the Multilingual Web of Data – ISWC tutorial
String counting
7/10/14 44Building the Multilingual Web of Data – ISWC tutorial
Ontology
7/10/14 45Building the Multilingual Web of Data – ISWC tutorial
Ontology
7/10/14 46Building the Multilingual Web of Data – ISWC tutorial
Ontology
7/10/14 47Building the Multilingual Web of Data – ISWC tutorial
Ontology
7/10/14 48Building the Multilingual Web of Data – ISWC tutorial
Ontology
7/10/14 49Building the Multilingual Web of Data – ISWC tutorial
Demo: http://demo.nlp2rdf.org/
7/10/14 50Building the Multilingual Web of Data – ISWC tutorial
Demo: http://demo.nlp2rdf.org/
7/10/14 51Building the Multilingual Web of Data – ISWC tutorial
Outline
1.Introduction
2.NIF Basics
3.NIF corpora
4.NIF tools & services
7/10/14 52Building the Multilingual Web of Data – ISWC tutorial
NIF corpora overview
Name Size (Triple) URL
Wikilinks 500M
News-100 13K
RSS-500 10K
Reuters-128 7K
Spotlight 3K
KORE50 2K
Brown 500K
Wikipedia abstract corpus in progresshttp://datahub.io/dataset?tags=nif&q=nif : Tag „nif“ on datahub
7/10/14 53Building the Multilingual Web of Data – ISWC tutorial
Wikilinks corpus – Overview
Large scale coreference resolution corpus by Umass / Google
Over 10M crawled websites that contain text (Named Entities) that link to Wikipedia
Converted to the NIF format and published as LOD: http://wiki-link.nlp2rdf.org/
Additional processing done to extract relevant text snippets, add Dbpedia ontology classes and coarse-grained classes (entity types)
Over 500 million triples, 79GB LOD, 12GB gzipped dumps
Over 30 million links to over 3 million entities
7/10/14 54Building the Multilingual Web of Data – ISWC tutorial
Brown corpus – Overview
Converted to the NIF format and published as LOD: http://brown.nlp2rdf.org
Corpus showcases handling of POS tags in NIF
POS tags: mapped via OliA to predefined categories
<#char=643,647>a nif:String , nif:Word , nif:RFC5147String ;nif:anchorOf "Jury"^^xsd:string ;nif:referenceContext <#char=0,> ;nif:oliaLink brown:NN ;nif:sentence <#char=619,777> ;nif:beginIndex "643"^^xsd:nonNegativeInteger ;nif:endIndex "647"^^xsd:nonNegativeInteger .
Categories can be used to query all resources of a certain POS regardless of tagset used in the corpus
7/10/14 55Building the Multilingual Web of Data – ISWC tutorial
Brown corpus – POS tags
7/10/14 56Building the Multilingual Web of Data – ISWC tutorial
Brown corpus – POS tags
Querying all nouns using the OliA mapping
7/10/14 57Building the Multilingual Web of Data – ISWC tutorial
Brown corpus – POS tags
Querying all nouns using the OliA mapping
7/10/14 58Building the Multilingual Web of Data – ISWC tutorial
Outline
1.Introduction
2.NIF Basics
3.NIF corpora
4.NIF tools & services
7/10/14 59Building the Multilingual Web of Data – ISWC tutorial
NIF tools
Available NIF tools :
Stanford Core NLP OpenNLP RDFace Validator ConLL converter ...
7/10/14 60Building the Multilingual Web of Data – ISWC tutorial
NIF tools: DBpedia Spotlight
https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki
7/10/14 61Building the Multilingual Web of Data – ISWC tutorial
NIF tools: Stanford Core
7/10/14 62Building the Multilingual Web of Data – ISWC tutorial
NIF tools: Stanford Core
7/10/14 63Building the Multilingual Web of Data – ISWC tutorial
NIF tools: Stanford Core
7/10/14 64Building the Multilingual Web of Data – ISWC tutorial
NIF tools: Stanford Core
7/10/14 65Building the Multilingual Web of Data – ISWC tutorial
NIF tools: Stanford Core
7/10/14 66Building the Multilingual Web of Data – ISWC tutorial
NIF tools: Stanford Core
7/10/14 67Building the Multilingual Web of Data – ISWC tutorial
NIF tools: Stanford Core
7/10/14 68Building the Multilingual Web of Data – ISWC tutorial
Done!
Thank you very much!
http://nlp2rdf.orghttp://github.com/NLP2RDF