what you can make out of linked data
DESCRIPTION
Tutorial given at the 38th Internationalization & Unicode Conference.TRANSCRIPT
Text
What you can make out of Linked DataMarco Fossati <[email protected]> Steven R. Loomis <[email protected]>
1
Let's meet the presenters first!
2
Marco FossatiNatural Language Processing
Advocate Recommender Systems
Aficionado Open Data Apologist
3
Steven R. LoomisIBMChair, Unicode ULI-TC !
Projects: ICU, CLDR, ULI
Outline
1. Linked Open Data 101
2. DBpedia
3. The ULI use case
5
Warning! Highly interactive tutorial
6
Let's get started!
7
Text
Linked Open Data 101The Big Picture
8
What is data?Data is how we express facts in a reusable form
9
Why data? The ingredients for...
...InformationKnowledge
Wisdom
10
OK it's data, what else?
Billions of factsBig “Santa Clara is a city”
Richly structuredLinked
Open Open licenses
11
Facts, not words
A fact is...
An assertion about the world
Subject + predicate + object
A triple
Natural language
Human mind
!Machine
12
Human mindPerceiving relationships between entities
13
Natural language"Elvis Presley sings Jailhouse Rock"
14
MachineThe triple
Elvis Presley
Jailhouse Rock!
sings
15
The graphRich structure made of triples
16
Text
From the web of documents...
17
Text
...to the web of entities
18
The web of entities
An entity can be...
Identified
Described through relationships
Understood both by humans and machines
19
Towards a WWW of entitiesIdentify via HTTP URIs
http://dbpedia.org/resource/Elvis_Presley
Describe via RDF statements
:Elvis_presley :sings :Jailhouse_Rock
Understand via
HTML for humans
RDF for machines
20
Next in line…
22
Text
DBpediaExtracting Knowledge from Wikipedia
23
DBpedia is…
A. …a data extraction framework
from Wikipedia semi-structured data
B. …an open-source community effort
24
Why?
25
Wikipedia can’t answer simple questions“What do Santa Clara and San Francisco have in common?”
26
Wikipedia can’t answer complex questions“Which are the black and white movies produced in Italy that have soundtracks which were composed by musicians who were born in a city of the Trentino-Alto-Adige region with less than 40,000 inhabitants?”
27
The story so far
Project started in 2007
From good ol’ PHP to Java + Scala
Steadily growing community
Internationalization Committee
Freely available on GitHub
28
Data in WikipediaTitle
Short abstract
Long abstract29
Structure in WikipediaInfobox Images
30
Structure in Wikipedia
Links
Categories31
Structure in WikipediaInterlanguage Links
32
DBpedia Extraction Framework (DEF)
Wikipedia dump Extractors RDF graph
34
Extractors
Article Features
Abstract, redirects, categories, geo-coordinates, interlanguage links, etc.
Infobox
Raw
Mapping-based
35
Raw Infobox Extractor:Elvis_Presley
:born “Elvis Aaron Presley…”
:died “August 16, 1977…”
:restingPlace “Graceland…”
:education “L.C. Humes…”
:occupation “Singer…”
36
The Big IssuesData is heterogeneous! Data is multilingual!
37
38
Solution• The DBpedia ontology as a multilingual glue • Wikipedia-to-ontology Mapping
39
DBpedia OntologyEncoding the worldwide encyclopedic knowledge
40
Mapping-based Extractor
Combines what belongs together
Separates what is different
41
DIEF -Mapping-Based Infobox extractor
42
Download the latest DBpedia dump at
http://downloads.dbpedia.org/current/
44
Language chaptersDBpedia in your mother tongue
46
Active chapters
International (English-based)
Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Spanish
47
Host your own language chapter!
48
ApplicationsGet the best out of DBpedia data
49
Knowledge GraphsHighly informative summaries in your own language
50
Text
Question Answering“Who is Bram Stoker?”
51
Text
Entity LinkingDetecting Things in Text
52
Language and Domain-specific Resources for Short Sentences Classification
Automatic Huge Gazetteers
53
DBpedia StakeholdersWho is using the knowledge base?
54
Open GovernmentLinking Local Data
55
Digital LibrariesEnriching the Catalogue
56
Data-driven JournalismBuilding Infographics
57
And now the final part!
59
Text
The ULI use casePutting Linked Open Data to work
What’s wrong with Localization Interoperability?
Inconsistent application, implementation, and interpretation of standards
Lack of clear requirements for localization data interchange
Unicode Localization Interoperability
Technical Committee of Unicode
Focus Areas:
1. Translation memory
2. Translation source strings / translations
3. Segmentation rules
ULI: Segmentation
Given: Thanks to Dr. Jones for this effort.
UAX#11 Segmentation: |Thanks to Dr.| Jones for this effort.| English: |Thanks to Dr. Jones for this effort.|
ULI Suppression: Abbreviations English
Mr.Mrs.Dr.St.…
Spanish
Sr.Dto.Sra.Avda.…
Russian
проф.февр.тел.кв.…
Demo: ULI Breaks
http://demo.icu-project.org/icu-bin/icusegments
DEMO
DBpedia applied to ULI(University of Leipzig)Sebastian Hellman,Martin Brümmer,Dimitris Kontokostas
Opportunity:
Help segmentation by supplying abbreviation data
Yes!
Evaluation shows that especially for small texts, abbreviations can contribute to precision and recall of segmentation
Success rate
multilingual with over 100 languages
!
structured data eases extraction
!
additional data like entity types and categories
Example: Mr.
“MR” disambiguation page links to “Mr.” article. !
Ends in full stop, so may be an abbreviation.
The “Mr.” SPARQL querySELECT ?entryExample ?exampleTested ?indegreeRanking WHERE { <http://dbpedia.org/resource/Mr.> rdfs:label ?entryExample ; rdfs:comment ?exampleTested . FILTER ( lang(?entryExample) = lang(?exampleTested) ) #subselect: { SELECT count(?in) as ?indegreeRanking WHERE { ?in ?p <http://dbpedia.org/resource/Mr.> } } } LIMIT 100
DEMO
Example DBpedia data (English)
St.
Street
<http://en.wikipedia.org/wiki/Street>
<http://schema.org/Place><http://dbpedia.org/ontology/Place><http://dbpedia.org/ontology/PopulatedPlace>
Example DBpedia data (Russian)
Проф.
Профессор (Professor)
<http://ru.wikipedia.org/wiki/Профессор>
1.
Get abbreviation URIs
2.
Load DBpedia data into local DB
3.
SPARQL Query data and tsv output
!
22859 abbreviations with 78197 meanings in 99 languages
!
Long Tail !
!
!
Only 25 languages >100 abbrevs. Only 7 languages >1000 abbrevs. !
!
!
22859 abbreviations with 78197 meanings in 99 languages
!
!
Long tail (total abbrevs)
Long tail (total abbrevs) (zoom)
ULI ProcessDBpedia
Wikipedia
ULI Review
Extraction
Translation Memory
Translation MemoryTranslation
Memory
Comparison
"Lupa.na.encyklopedii" by Julo - Own work. Licensed under Public domain via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Lupa.na.encyklopedii.jpg#mediaviewer/File:Lupa.na.encyklopedii.jpg
Manual review
CLDR
CLDR
abb
rs.
CLDR Suppressions
Comparison with Translation Memory
Entry % in TMCorp. 0.0307%
St. 0.0023%P.T.T.C. 0%
"Trichtermitfilter" by Gmhofmann - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Trichtermitfilter.jpg#mediaviewer/
File:Trichtermitfilter.jpg
CLDR Input
Extract abbreviations from CLDR localized data
Days of week: Sun. Mon. Tue. Wed. Thu. …
Months: Jan. Feb. Mar. …
etc…
Manual Review
CLDR output format <segmentations> <segmentation type="SentenceBreak"> <!--From ULI data, http://uli.unicode.org--> <suppressions type="standard"> <suppression>Port.</suppression> <suppression>Alt.</suppression> <suppression>Di.</suppression> <suppression>Ges.</suppression> <suppression>frz.</suppression>
CLDR 26 Output
http://cldr.unicode.org
“Break Suppression”
de 239en 151es 164fr 82it 45pt 170ru 18
Challenges
"Long Tail" Languages
harder to find existing TM data
harder to find linguistic rules/review
harder to find tagged corpora to benchmark
Systematic issues with using redirects/disambiguation
OpportunityScope:
Non-full stop punctuation- "Yahoo!"
Language specific abbreviation rules
Context (Medical, Business, …)
Leverage
Schema/Taxonomy ( “Place” vs “Person” etc. ) to filter
DBpedia lists
Additional LOD
Thank You!
Further Q&A?
!
Slides & contact info: https://pad.okfn.org/p/DBpediaULI