linked data & dbpedia - fusionfactory.de€¦ · linked data & dbpedia ... label...
TRANSCRIPT
Linked Data & DBpediaM. Freudenberg, K. Müller & M. Ackermann
AKSW/KILT - LeipzigDBpedia Association
Linked Data & DBpedia
AKSW/KILT- Knowledge Integration and Language Technology- Part of Agile Knowledge Engineering and Semantic Web (AKSW)
Knowledge Integration
Language TechnologyKILT
Linked Data & DBpedia
Tim Berners-Lee
- British computer scientist- director of the W3C- Inventor of the World Wide Web
(1989 @ CERN)- Over his frustration of disconnected islands of
information (about scientists, projects and results)- Published the first Website:
http://info.cern.ch/hypertext/WWW/TheProject.html
Linked Data & DBpedia
From the WWW to the Web of Data
- applying the principles of the WWW to data
data is relationships,not only properties
Linked Data & DBpedia
TimBL’s next leap: from WWW to WOD
Use Linked Data to build a Web Of Data
- applying the principles of the WWW to data- Data is relationships → not only properties- The more data you have to connect together
→ the more you can find out
- using Linked Data to:- Bridging disciplines and domains (by linking their data) - Unlock the potential of island repositories
→ don’t hoard your data, if possible: share itWatch the TED talk of TimBL about Linked Data
Linked Data & DBpedia
Linked Data Principles
1. Use HTTP URIs as identifiers for resources→ so people can look up the names
2. Provide data at the location of URIs→ to provide data for interested parties
3. Include links to other resources→ so people can discover more things
→ bridging disciplines and domains
→ the more linked resources, the more one can find out
Linked Data & DBpedia
RDF - Resource Description Framework
… is a so called ‘Triple’.
http://dbpedia.org/resource/Siem
ens
- Statements of subject > predicate > object
"Siemens"label
Predicate ObjectSubject
Linked Data & DBpedia
Knowledge Graphs
● Combining multiple Triples is known as a Graph
● Linking resource to resource inside or outside the current graph/dataset
● A knowledge-base of this style is considered a Knowledge Graph (KG)
Linked Data & DBpedia
The Data (in RDF/XML)
<rdf:Description rdf:about="http://dbpedia.org/resource/Siemens"> <rdfs:label>Siemens</rdfs:label> <dbo:type rdf:resource="http://dbpedia.org/resource/Aktiengesellschaft"/> <dbo:location rdf:resource="http://dbpedia.org/resource/Munich"/></rdf:Description>
<rdf:Description rdf:about="http://dbpedia.org/resource/Munich"> <dbo:country rdf:resource="http://dbpedia.org/resource/Germany"/></rdf:Description>
Linked Data & DBpedia
The Data (in Turtle)
<http://dbpedia.org/resource/Siemens> dbo:type <http://dbpedia.org/resource/Aktiengesellschaft> ; rdfs:label "Siemens"@de ; dbo:location <http://dbpedia.org/resource/Munich> .
<http://dbpedia.org/resource/Munich> dbo:country <http://dbpedia.org/resource/Germany/> .
Linked Data & DBpedia
Linked Data vs Open Data
4. Final principle: Open your data using open licenses
• Not all linked data is open
• Licensed data can still profit from using standards
• Can be enriched with links to Linked Data
• Can be accessed by standard tools
Linked Data & DBpedia
Why publish Linked Data
• Ease of discovery through linking
• Easy to consume by humans and machines
• Reduce data redundancy
• Support collaboration & interoperability
• Add value and visibility
Linked Data & DBpedia
Benefits of Linked Data: Consumer View
- discover more related data by following links- reuse the data of other datasets- combine data safely from different sources- formulate sophisticated queries → example in appendix- query data over multiple repositories- semantic enrichment of text resources- semantic feature for machine learning models (e.g. deep
learning, word embeddings, etc.)
Linked Data & DBpedia
Benefits of Linked Data: Publisher View
- link data to any other resource on the web, thereby increasing the value of your data
- making your data discoverable (via links)- exhaustive descriptions of large and changing domains (Gene
Ontology, Human Disease Ontology)- structured representation of large, versatile datasets
(Knowledge Graphs, Thesauri, Taxonomies)- deal with unstructured data (text) as no DB-Schema could- data and schemata using the same format (RDF)- store metadata alongside the actual data (e.g. DCAT)
Linked Data & DBpedia
Linked Open Data
LOD-Cloud 2014
Linked Data -Datasets under an open access- 1014 datasets- any subject- over 50B triples- over 100M links
Linked Data & DBpedia
DBpedia
● First public Knowledge Graph● Has become the focal point of the so
called “Linked Open Data Cloud”.● Is the most universal dataset
(since it’s based on Wikipedia).● Links actively to many relevant
Linked Open Datasets.● Is a link destination for many other
Datasets.(more on DBpedia later…)
Linked Data & DBpedia
Other Linked Data Sets: Freebase
● Managed, hosted by Google until 2015● Now (in part) subsumed by Wikidata● extracted structured data from Wikipedia and other Sources● available in RDF● Differences to DBpedia
○ Freebase used several sources (but DBpedia+ does as well)○ Freebase can be directly edited by users ○ Ontology and mappings were not coordinated by a community
→ never established a community which enriched or validated the data, mostly generated by crawlers
Linked Data & DBpedia
Wikidata
● Initialized by Wikimedia Germany e.V. in 2012● free knowledge base about the world that can be read● edited by humans and machines alike● can offer a variety of statements from different sources● DBpedia is extracting information from Wikidata to fuse it with
knowledge from Wikipedia● Goal is to provide a single point of truth for facts in Wikipedia
across different language versions
Linked Data & DBpedia
Other Datasets
● Geonames○ geographical database covers all countries○ contains over eleven million placenames○ e.g. http://www.geonames.org/3399415/fortaleza.html
● Linked Open Vocabularies (LOV)○ Keeps track of available open ontologies and provides them as a graph○ Search for available ontologies, open for reuse○ e.g. http://lov.okfn.org/dataset/lov/vocabs/foaf
● Lexvo.org○ information about languages, words, characters, and other human
language-related entities○ e.g. http://www.lexvo.org/page/iso639-3/deu
Linked Data & DBpedia
Excursus: Ontologies
This is a concise introduction to ontologies and their role as schemata in Linked Data.
(No worries, we keep this short ;)
Linked Data & DBpedia
Ontologies in Computer Science
● An ontology has a common language (symbols, expressions) → Syntax● The meaning of symbols and expressions is clear → Semantics● Symbols and expressions with similar semantics are grouped in
concepts (classes) → Conceptualization● Concepts are organized in a hierarchical way → Taxonomy● Concepts might be related to others → Relations● Implicit knowledge can be made explicit → Reasoning
Linked Data & DBpedia
Ontology, a definition
“An ontology is an explicit, formal specification of a shared conceptualization.”
(Thomas R. Gruber, 1993)
Linked Data & DBpedia
Axioms
Axioms are knowledge definitions in the ontology that were explicitly defined and have not been proven true.
Implicit knowledge can be made explicit by logical induction: → Reasoning over an ontology
Source for the ontology related slides:http://www.slideshare.net/SergeLinckels/semantic-web-ontologies
Linked Data & DBpedia
Ontology Language
To express ontologies in a formal, machine readable way, in order to reason over the outlined knowledge, we need a specialized language.→ most common: Web ontology language (OWL)● represent rich and complex knowledge about things● based on a subset of First Order Logic (FOL)● can be used to verify the consistency of a knowledge ● can make implicit knowledge explicit● as the data it conceptualizes, it is serializable in RDF
Linked Data & DBpedia
How to utilize Linked Data Standards
Any OWL ontology/taxonomy can be used in a non LD context.- through its ability to link resources, RDF based ontologies can
easily amalgamate, thereby making them reusable- extending ontologies to fit a narrower use cases- reducing ontologies of a certain area to fit a broader scope- separating semantic structure (classes, properties) from use case
specific restrictions (e.g. cardinalities) -> SHACL- Example: DataID
The W3C, responsible for common standards on the Web, is focusing on RDF based standards in many fields.
Linked Data & DBpedia
Incremental adoption of LD technologies
Linked Data standards and technologies are manifold and, at times, confusing.Fortunately, introducing Linked Data into existing IT environments can be accomplished in an incremental fashion:● Collect data without given schemata/ontologies
○ Very helpful when dealing with semi- or unstructured data● Use RDF Views on top of existing DBMS
○ With an easy to change R2RML mapping● Develop an domain ontology over time (‘Open World Assumption’)
○ Especially useful in fast changing domains● Enrich data with every iteration of you data management cycle
○ See ALIGNED methods for more● Start using LD based tooling: e.g. Rel Finder
Linked Data & DBpedia
Linked Data in the context of Big Data
In 2006, Clive Humby coined the phrase "the new oil" for (digital) data, heralding the ever-expanding realm of what is now summarised as: Big Data.
What role does Linked Data play in the context of buzzwords like:● Big Data● Smart Data (e.g. Smart Data Forum)
Linked Data & DBpedia
The four V’s heatmap for Linked Data
Gartner Study in 2013 found:- many organizations find the “variety” dimension a much bigger challenge than volume or velocity.
Linked Data to the rescue:
- Combine multiple sources with different structures- while retaining the flexibility to add new ones- without adapting schematas- query combined data, or multiple sources at once- detecting patterns in the data
Linked Data & DBpedia
Linked Data in the context of Big Data
- Linked Data can describe any kind of data
- no matter the amount or domain
- especially useful in here:- graph structured data
- social media, knowledge graphs- (multi-) lingual data
- easy incorporation of unstructured data- perfect for annotating purposes
- data for complex domains - e.g. taxonomies in life sciences
- ontologies, metadata and provenance info- ontos are modeled with OWL a RDF extension
Big Data
Linked Data & DBpedia
Linked Data in the context of Big Data
- Linked Data can describeany kind of data
- no matter the amount or domain
- especially useful in here:- graph structured data
- social media, knowledge graphs- (multi-) lingual data
- easy incorporation of unstructured data- perfect for annotating purposes
- data for complex domains - e.g. taxonomies in life sciences
- ontologies, metadata and provenance info- ontos are modeled with OWL a RDF extension
Big Data Smart Data
Linked Data & DBpedia
Linked Data in the context of Big Data
- Linked Data can describeany kind of data
- no matter the amount or domain
- especially useful in here:- graph structured data
- social media, knowledge graphs- (multi-) lingual data
- easy incorporation of unstructured data- perfect for annotating purposes
- data for complex domains - e.g. taxonomies in life sciences
- ontologies, metadata and provenance info- ontologies are modeled with OWL a RDF extension
Big Data Smart Data
Linked Data
Linked Data & DBpedia
Linked Data in Research
● Computer science: especially graph- and NLP-related, QA, AI○ e.g. IBM: Natural Language Understanding of Unstructured Data
● Life sciences: to describe complex domains (large ontologies & taxonomies)
○ e.g. Human Disease Ontology● (Digital) Humanities: to manage (record, annotate…) large text records
○ e.g. Homer Multitext Project● Libraries: recording of metadata and interlinking it with other
institutions○ e.g. Deutsche Nationalbibliothek
Linked Data & DBpedia
● Online Search (large Knowledge Graphs)○ e.g. Google
● Social Media (social network analysis)○ e.g. Facebook
● Publishing Industry (large text corpora annotation)○ e.g. Wolters Kluwer, NYT
● Broadcasting (ontology-centered Linked Data services)○ e.g. BBC
● (Open) Government Data○ e.g. US publishing data as RDF http://data.gov
Linked Data in Industry
Linked Data & DBpedia
Who uses RDF (in public)
Linked Data & DBpedia
DBpedia… a fused, multi-domain, multilingual dataset
● Is a crowed sourced community effort to extract structured information from Wikipedia and Wikidata.
● Enriches the extracted information w. semantic layer
● Provides a query service and many additional tools.
Linked Data & DBpedia
DBpedia History
● DBpedia project was started in 2006 as a collaboration of Freie U. Berlin, U. Leipzig and U. Mannheim
● Has been a key factor in the rapid growth of the LOD initiative and the overall success of Linked Data
● http://wiki.dbpedia.orgExample: http://dbpedia.org/page/Leipzig_University
http://en.wikipedia.org/wiki/Monty_Python ⇔ http://dbpedia.org/resource/Monty_Python
Linked Data & DBpedia
Some statistics
The latest release (2015-10):● was extracted from 127 language editions ● describing up to 20 million things● with 8.8 billion RDF statements (triples)● Mirroring every wikipedia page = 6.2M things
○ of which 4.6M have abstracts, ○ 955K have geo coordinates ○ and 1.54M depictions
● “In general we observed a significant growth in raw infobox and mapping-based statements of close to 10%”.
Linked Data & DBpedia
Structure of DBpedia
● Structured in language datasets with multiple subsets● Each sub-dataset specializes on a certain type of data (see below)● mapping-based types and facts governed by the DBpedia Ontology
Linked Data & DBpedia
DBpedia Ontology
● A cross-domain ontology● maintained and extended by the community in the
DBpedia Mappings Wiki● manually created based on the most commonly used
infoboxes● currently covers 685 classes which form a subsumption
hierarchy and are described by 2,795 different properties● subsumption hierarchy with a maximal depth of 5● is maintained and extended by the community in the
DBpedia Mappings Wiki
Linked Data & DBpedia
DBpedia Mappings Wiki
● a community effort to:● develop an ontology schema● provide mappings from Wikipedia Infoboxes properties
to this ontology → creating an alignment between Wikipedia and Dbpedia→ eliminating name variations in properties and classes→ big boost for Precision
http://mappings.dbpedia.org/
Linked Data & DBpedia
Extracting a DBpedia
● Wikipedia articles consist mostly of free text● also comprise various types of structured
information ● depending on the template used for a specific article
(e.g. Actor, Village etc.)● including: infobox templates, categorisation
information, images, geo-coordinates, links to external web pages, disambiguation pages,
● redirects between pages, other language links
Linked Data & DBpedia
Wikipedia Article Structure
● Title● Abstract● Infoboxes● Geo-coordinates● Categories● Images● Links
○ other language versions○ other Wikipedia pages○ To the Web○ Redirects○ Disambiguations
Linked Data & DBpedia
DIEF - DBpedia Information Extraction Framework
● extracts structured information from Wikipedia and turns it into a rich knowledge base○ Mapping-Based Infobox Extraction, ○ Raw Infobox Extraction, ○ Feature Extraction, ○ Statistical Extraction
● Updated to adapt to changes in Wikipedia● Expanded for new knowledge extraction methods
○ E.g. by multiple GSOC projects (extraction tables, NIF,...)● Open Source code in Scala & Java
Linked Data & DBpedia
DBpedia Live
● Wikipedia articles are continuously revised at a very high rate
● English Wikipedia, in June 2013, had approximately 3.3 million edits per month (^= 77 edits per minute)
● Dbpedia Live was developed to keep Dbpedia in synchronization with Wikipedia
● works on a continuous stream of updates from Wikipedia and processes that stream on the fly
Linked Data & DBpedia
Accessing and Querying DBpedia
● per resource view○ Linked Data interfaces
http://dbpedia.org/page/Immanuel_Kant
● navigation view○ LodLive Browser
http://en.lodlive.it/?http:
//dbpedia.org/resource/Immanuel_Kant
● querying for resources○ SPARQL (introduced later)○ DBpedia Lookup Service
Linked Data & DBpedia
DBpedia Lookup Service
● REST service to query for DBpedia resources
● index of DBpedia resource, including alternative names/labels(page redirects, disambiguisation links, …)
● search by complete keywords and prefix search● results ranked by relevance (Page Rank)● filtering by DBpedia ontology classes
http://lookup.dbpedia.org/api/search/KeywordSearch?QueryClass=place&QueryString=berlin
http://lookup.dbpedia.org/api/search/KeywordSearch?QueryClass=person&QueryString=berlin
Linked Data & DBpedia
DBpedia internationalised
● non-English versions of DBpedia offers○ coverage of more entities○ more detailed or up-to-date information for entities associated with the
particular countries
● international mapping community ○ helps in provision of localized dbpedia datasets for 125 languages
● 15 DBpedia chapters (by languages)○ autonomous management of mapping, ○ organisation of local community, ○ hosting of datasets and services
● canonicalized datasets○ facts derived from localized Wikipedias, but only statements for resources
also present in Englisch DBpedia
Linked Data & DBpedia
DBpedia Association
- founded in 2014, based in Leipzig- goal: supporting the DBpedia community and provide free
data and services to the general public- Data Releases- Software Maintenance- Dissemination- Data Accessibility- Communication (internal and external)
- persons and organisations can become member:- gaining support for all DBpedia specific problems (queries, tools etc.)- deciding on the future of DBpedia- acquiring help for creating and linking their own datasets
Linked Data & DBpedia
A need for information
„Which films starred John Cleese without any other members of Monty Python?“
Linked Data & DBpedia
SPARQL Protocol and RDF Query Language
● RDF data query language○ also query and data transfer protocol specifications (HTTP-based)
● graph-data oriented, designed independently from ontology & related reasoning○ but some SPARQL implementations can provide reasoning (e.g.
RDFS+)
● declarative approach carrying several similarities to SQL
tutorial: https://jena.apache.org/tutorials/sparql.html
Linked Data & DBpedia
SPARQL - additional constructs
● alternative result types:○ ASK ⇒ true, if a valid binding can be found○ CONSTRUCT ⇒ create new graph from result bindings
● combinators and modifiers for queries/graph patterns:○ UNION, MINUS, Subqueries○ LIMIT, OFFSET, DISTINCT, ORDER BY
● property paths as regular language (*,+,^,{n,m})○ e.g. rel:hasParent / rel:hasChild{2} / rel:hasFriend+
● sizable library of functions and operators for resources and literal values
find it all at: http://www.w3.org/TR/sparql11-query/
Linked Data & DBpedia
End of session 1
Grab a coffee…
Next session:helpful LD technologies for:
- NLP- Link Discovery- Data Fusion
A common Use Case for integrating LD technologies
Linked Data & DBpedia
NIF - Natural Language Processing Interchange Format
● RDF/OWL-based
● utilizes various existing standards: RDF, OWL2, PROV Ontology, ITSRDF, …
● promotes stable URIs to identify primary text, its structure, annotations and their meta-data
Linked Data & DBpedia
Interoperability for Language Data and Tools
Structural Interoperability:unanimous data format and structure of annotations ⇒ RDF & NIF vocab
Conceptual Interoperability:identical vocabularies/taxonomies for annotations (or linkage to common reference vocabulary) ⇒ Ontology of Linguistic Annotations (OLiA), GOLD, ...
Access Interoperability:unanimous, widespread, easily adoptable method for access ⇒ REST
Linked Data & DBpedia
Linking OWL Ontologies for Conceptual Interoperability
● Ontology of Linguistic Annotations○ linking specific annotation tag sets into
ling. refrence models/ontologies○ machine-actionable, granular
representations of semantics of tags (beyond string values)
Linked Data & DBpedia
NIF: Further Widespread Requirements Covered
● Provenance and Confidence for Annotations
● Multiple Alternative Annotations
exdoc:2_offset_23_29 nif:anchorOf "Berlin" ; itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin> ; nif-ann:taIdentConf "0.9"^^xsd:decimal ; nif-ann:taIdentProv exdoc:eEntityProdServiceInvocation ; nif:annotationUnit [ itsrdf:taIdentRef <http://dbpedia.org/resource/Berlin,%20Nevada> ; nif-ann:taIdentConf "0.32"^^xsd:decimal ; nif-ann:taIdentProv exdoc:eEntityExpServiceInvocation ] .
Linked Data & DBpedia
Available NIF Resources
● Corpora
● Services○ Tokenisation○ Annotation○ Validation○ Combining Tools Outputs
● Documentation, Specs
Linked Data & DBpedia
NIF Corpora
● Brown Corpus
● AQUAINT News Corpus
● NER corpora○ RSS-500, Reuters-128, KORE 50○ Microposts NEEL○ ACE Mutlilingual
● DBpedia abstract corpora○ English, French, German, Dutch,
...
Linked Data & DBpedia
NIF Services
● OpenNLP○ POS tags
● Stanford NLP○ POS tags, lemmatization
● Snowball○ Stemming
● DBpedia Spotlight
● Validation (via RDFUnit)
Linked Data & DBpedia
Mapping Languages
● Helps to create class mappings between source dataset and target RDF ontology○ E.g. Table heading to RDF
predicates (e.g. rdfs:label)
Linked Data & DBpedia
Mapping Languages cont.
● R2RML: Only supports mappings between relational databases and RDF
● RML: Extension of R2RML and supports other input data formats such as CSV, JSON, XML
● SML: Extension of R2RML and supports other input formats such as CSV
Linked Data & DBpedia
ETL Frameworks
● Extract Transform Load (ETL)● Common in Data Warehousing● Extract phase:
○ Extract data from data sources (e.g. CSV, JSON, database, etc.)● Transform phase:
○ Transform data for storing in target format○ Boilerplating/Normalization○ Content Enrichment (e.g. loading of geo-information)
● Load phase:○ Loads data into target data store (e.g. Virtuoso)
Linked Data & DBpedia
ETL Frameworks - LDIF
● Linked Data Integration Framework (LDIF)
● Hadoop based ETL pipeline● Supports Provenance Metadata● Components:
○ Scheduler○ Data Import
(Crawl, Sparql, Dump)○ ETL
● Custom Mapping Language
Linked Data & DBpedia
ETL Frameworks - Unified Views
● Joined project between Semantic Web Company and Semantica.cz○ Supported by LOD2 FP7 project
● Components:○ Frontend UI○ Backend○ Database○ Scheduler
● Possible to add custom plugins
● Link
Linked Data & DBpedia
Link Discovery Frameworks
● Finding links between related data items in different datasets● Use cases:
○ owl:sameAs links○ Class mappings (e.g. like R2RML)○ data transformation
● Survey● Matching algorithms:
○ string similarity, geo-location matching, regular expressions, etc.● Link Discovery Strategies
○ Rule based: Using predefined rules to find matching data items○ Statistical based: Using machine learning techniques to find matching
data items
Linked Data & DBpedia
Link Discovery Frameworks - LIMES
● Link discovery framework for MEtric Spaces
● Fast, large-scale link discovery using specification language
● Link
Linked Data & DBpedia
Link Discovery Frameworks - Silk
● UI driven linking framework
● Uses its own specification language
● Supports for data transformation
● Link
Linked Data & DBpedia
Data Fusion
● Fusing of multiple records representing the same real-world object into a single, consistent, and clean representation (Bleiholder & Naumann 2008)
● Possible use cases:○ Same value for the same property in all datasets (e.g. name)○ Different value for the same property in all datasets (e.g. age)○ New information
● Problems:○ No unique IDs○ Real world data is dirty, big and complex○ No training data for many linkage applications○ Trustworthiness of external data
Linked Data & DBpedia
Data Fusion - Strategies
● Rule based○ Using observed value from most updated source○ Taking average/maximum/minimum for numerical values○ Idea is to improve efficiency
● Statistically based○ Unsupervised/supervised strategies:
■ Vote: Take the value which is supported by largest number of sources■ Quality based: evaluate trustworthiness
● Web-link-based, IR-based, Bayesian, graphical model■ Relation Based
● Extends Quality Based methods and considers relationship between sources (e.g. copy data around, etc.)
Linked Data & DBpedia
Data Fusion - LD-FusionTool
● Developed in conjunction with Unified Views project● Features:
○ Resolution of schema and identity conflicts○ Resolution of data conflicts○ Quality Assessment○ Provenance Tracking
● No machine learning based fusion● Link
Linked Data & DBpedia
Data Fusion - Sieve
● Developed in conjunction with LDIF project● Features:
○ Resolution of data conflicts○ Quality Assessment○ Provenance Tracking○ Support for Plugins
● Link
Linked Data & DBpedia
Data Fusion - Sieve
Examples
Linked Data & DBpedia
ALIGNED - Software an Software & Data Engineering
● quality-centric, software and data engineering ● research project funded by Horizon 2020 (EC)● will develop new ways to build and maintain IT systems
that use big data on the web
Linked Data & DBpedia
ALIGNED - Goals
● New methodology for parallel software and data engineering of web-scale information systems. ○ Linked Data the unifying foundation for system specification, process
and tool integration. ○ Support evolution of software dependent on heterogeneous, complex
data of varying quality with an independent lifecycle.
Linked Data & DBpedia
ALIGNing Problem: the example of DBpedia
Lot’s of code & a lot more data● Wikipedia evolves over time
○ Infobox Templates change, merge, deleted○ New formatting templates○ Structural differences per language edition
● DBpedia Ontology and Mappings change as well● Code should adapt to all the changes
○ hard at this (data) scale→ Data Quality will suffer
Linked Data & DBpedia
Unit-testing to the rescue?
● Software & Data testing● Straightforward for software (since 70’s)● Preliminary for (RDF) data
○ RDFUnit, SPIN…○ W3C Data Shapes WG
Data testing● Generation: manual, (Semi)automatic, ...● Linking: data & software tests
Linked Data & DBpedia
RDF Unit
http://rdfunit.aksw.org
Linked Data & DBpedia
RDF Unit
● Input: ontologies, updated Data Quality Patterns (DQP),and the Datasets to test against
● Produces Data Unit Test Cases automatically by applying DQPs to the Axioms of an ontology○ User defined test cases are added as well
● Runs all Data Unit Test Cases against a given Dataset● Generates Test Case Result data for every violating
triple for one of these DQPs● ( evaluate Test Case Results to change triples, software
or DQP/Test Cases, then run RDF Unit again… )
Linked Data & DBpedia
FREME: Multilingual Content Enrichment & Curation
● two year H2020 innovation action: bridge language and data
● driven by four business use-casesbusiness partners:vistatec - translation, localisation, content creation/curation
tilde - language & terminology services
agroknow - agriculture & food information & research
wripl - content optimisation & personalisation, SEO
Linked Data & DBpedia
Contribution of FREME
various target user groups:● developers● content authors● content architects● ...
several access modes:● graphical interfaces● programmatically● official endpoints and
local service instance
Linked Data & DBpedia
FREMEs e-Services
● e-Entity: for enriching content with information on named entities
● e-Link: for enrichment with linked data sources● e-Terminology: for detecting terms and enriching them
with term related information;● e-Translation: for providing custom machine translation
systems● e-Publishing: for exporting enriched contend in the ePub
Linked Data & DBpedia
Linked Data in FREME
● service interoperability and integration using NIF
● low entrance barrier: start for test / limited volumes immediately, just using REST queries
● utilization of several popular Linked Data knowledge bases:Europeana (cultural heritage data), ORCID (research idetifiers), ONLD (organisation names), Library of Congress Authorities
Linked Data & DBpedia
Choice of the Most Appropriate NER/NEL Tool
● Topic/Domain of○ used training data○ knowledge bases linked against
● Overall Performance (Precision,Recall,...)
● Support for/ Performance for specific Entity Categories (Companies, Artists, …)
●
?
Linked Data & DBpedia
GERBIL - Sustainable Entity Annotation Benchmarks
● unified experiment setups
● extensible for additional services and datasets
● experiment results as Linked Data Resources○ easy documentation○ improving
reproducibility
Linked Data & DBpedia
Smart Data Web (SDW)
- BMWi funded project- Main goal: Data collection for
the German industry using state of the art extraction and enrichment technologies
- Use Cases:- Supply Chain Management- Market Research
Linked Data & DBpedia
SDW - Knowledge Graph
- AKSW/KILT group responsible for Knowledge Graph
- Curated sources:- DBpedia, PermID, GRID, etc.
- Uncurated sources:- Twitter, news feeds, etc.
- Data quality and persistence- RdfUnit: test driven data-
debugging framework- LIMES: link discovery for the web
of data
Linked Data & DBpedia
SDW - Use Case: Supply Chain Management
● Public KG is fed continuously with information from news channels, websites, etc.
● Corporate/Internal KG is connected to public KG● Detection of potential problems in supply chain
○ Need to check suppliers regularly■ Check for compliance■ Quality of products■ Who else is being supplied by this supplier?■ strike, natural disaster, insolvency, etc.■ ...
○ Get information about potential problems as quickly as possible
Linked Data & DBpedia
SDW - Use Case: Market Research
● Public KG is fed continuously with information from news channels, websites, etc.
● Finding more information about value chain:○ Potential leads/potential customers○ Competition○ Potential suppliers○ Price development on the market○ Customer satisfaction
● Connecting information from different data silo○ Finding out about new relations
Linked Data & DBpedia
SDW - NLP
- Extraction of company and traffic events using state of the art NLP and machine learning technologies
- Demo
Linked Data & DBpedia
SDW - Linked Data
● Common public knowledge graph (KG)○ Modular corporate ontology○ ETL pipeline for different datasets○ Fusion of different datasets and web data
● Unique URIs for all entities (through knowledge fusion)● Store meta-data (e.g. provenance) for each RDF
statement● Use KG for NLP tasks (e.g. Named Entity Recognition,
Disambiguation, etc.)● Use KG for Enterprise Search
Linked Data & DBpedia
Use Case: Introducing Linked Data into an established IT environment
● Task 1: Support transformation from relational database data to RDF from different sources○ Establish a transformation process from SQL data to RDF.○ Combine multiple DBs into a single RDF
● Task 2: Generate Software components capable of manipulating the underlying data by any user○ Based on the domain description (ontology)
● Task 3: Enable data quality checks using data constraints○ Applying an iterative process of developing an ontology, providing test
results and correcting both data and ontology in a changing overall software environment.
Linked Data & DBpedia
Task 1: Transforming DB-data into RDF
● Using a Mapping Language like R2RML can generate a RDF (or Triple-) View of the relational data○ Databases like Virtuoso already have all tools needed for this mapping
included and provide an automated mapping function if needed.○ Multiple approaches for automated creation of R2RML mappings exist
creating Wrapper layer on to of the Relational Database (RDB)○ A different approach is the query rewriting SPARQL → SQL,
needs an ontology adaptation of the RDB schema● Multiple mappings for additional databases
Linked Data & DBpedia
Task 2: ALIGNED Tool - Semantic Booster
● Given a domain description as input (ontology, DB schema, etc.)
● Optional mapping to an existing DB schema
● Automatic creation of a Booster Specification
● Automatic generation of a SQL based DB-specification.
● The data is propagated through the Booster web interface
Linked Data & DBpedia
Semantic Booster - Features
● Generating high-quality software components from precise models with metadata annotations.
● Using Model-Driven Engineering techniques to generate a complete information system.
● key feature: Allows for the smooth transition of data when the underlying database is updated.
● Enables domain experts to develop systems conformant with existing standards, datasets or systems.
Linked Data & DBpedia
Task 3: Data Quality Validation
Option A: ● use the Form Validation methods of the Booster Web
Interface to validate user input directly○ Useful for simple domains without many restrictions
Option B:● Use RDF Unit to automatically generate data unit tests
○ for more complex domains
Linked Data & DBpedia
Creating data unit tests with RDF Unit
● Common restrictions (e.g. cardinalities, domain/range) are automatically transformed into unit tests by RDF U.
● More complex restriction can be inserted ○ with new unit patterns → enabling RDF Unit to generate the test itself○ Using custom SPARQL queries → defining the pattern to look for by
the user● A failed unit test will produce error object in RDF
serialization○ containing the necessary metadata to pinpoint the offending triple
Linked Data & DBpedia
Evaluating RDF Unit results
● Test case results can be used to implement a (semi-) automatic process to improve the tested data or its generating software
● Or to provide statistics about the quality of a dataset● ...
Linked Data & DBpedia
Completing the tasks
In addition to validation processes, any number of tasks can be executed on the given data● e.g. extracting NIF annotations on linguistic data● Spotlight Entity recognition● etc.
Linked Data & DBpedia
Summary
● WOD: applying the principles of the WWW to data● Bridging disciplines and domains (by linking their data) ● Linked Data makes Smart Data out of Big Data● Many Linked Data Standards can be reused for Big Data● DBpedia can be used as for many domains and
processes● Linked Data can be applied in many different parts of
commercial environments